Bug 10482 - [2.6.25 REGRESSION] Lenovo 3000 N100 does not wake up from ACPI S3 state
Summary: [2.6.25 REGRESSION] Lenovo 3000 N100 does not wake up from ACPI S3 state
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: EC (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Alexey Starikovskiy
URL:
Keywords:
Depends on: 11559
Blocks: 7216 9832
  Show dependency tree
 
Reported: 2008-04-19 05:48 UTC by Thomas Bächler
Modified: 2008-10-17 14:39 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.25
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
syslog-ng kernel.log output from boot until before the suspend (34.22 KB, text/plain)
2008-04-21 13:33 UTC, Thomas Bächler
Details
debug patch (2.58 KB, patch)
2008-04-21 14:24 UTC, Venkatesh Pallipadi
Details | Diff
Kernel configuration (75.90 KB, text/plain)
2008-04-23 03:57 UTC, Thomas Bächler
Details
My kernel configuration (51.28 KB, text/plain)
2008-05-09 15:41 UTC, Olav Morken
Details
dmesg output without debug patch applied (44.62 KB, text/plain)
2008-05-09 15:42 UTC, Olav Morken
Details
dmesg output with debug patch applied (66.03 KB, text/plain)
2008-05-09 15:44 UTC, Olav Morken
Details
Test patch (739 bytes, patch)
2008-05-09 16:59 UTC, Venkatesh Pallipadi
Details | Diff
dmesg output with patch applied (65.37 KB, text/plain)
2008-05-10 03:09 UTC, Olav Morken
Details
debug patch 3 (2.39 KB, patch)
2008-05-12 17:29 UTC, Venkatesh Pallipadi
Details | Diff
dmesg output from debug patch 3 (59.07 KB, text/plain)
2008-05-13 11:35 UTC, Olav Morken
Details
dmesg output from sysrq (226.45 KB, text/plain)
2008-05-20 13:09 UTC, Olav Morken
Details
Patch which adds some debug logging (1.78 KB, patch)
2008-05-21 12:45 UTC, Olav Morken
Details | Diff
dmesg output with patch applied (46.55 KB, text/plain)
2008-05-21 12:48 UTC, Olav Morken
Details
Log from cpu_idle_wait with ssleep. (46.71 KB, text/plain)
2008-05-22 06:36 UTC, Olav Morken
Details
Running _WAK on CPU0 (48.46 KB, text/plain)
2008-05-22 18:24 UTC, Olav Morken
Details
dmidecode output (12.81 KB, text/plain)
2008-07-09 22:26 UTC, Olav Morken
Details
dmidecode output from Lenovo 3000 N100 (10.22 KB, text/plain)
2008-07-20 12:38 UTC, Thomas Bächler
Details

Description Thomas Bächler 2008-04-19 05:48:14 UTC
Latest working kernel version: 2.6.25-rc8-git8
Earliest failing kernel version: 2.6.25-rc8-git9
Distribution: Arch Linux
Hardware Environment: Lenovo 3000 N100, Core 2 Duo T5300
Problem Description:

Since 2.6.25-rc9, my Lenovo 3000 N100 (Core 2 Duo T5300), running x86_64 does not wake up from ACPI S3 (Suspend to RAM) any more. The display remains dark. It's possible that this bug also occurs on i686, I currently have no way of testing that though.

Fix: Revert http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=783e391b7b5b273cd20856d8f6f4878da8ec31b3
Comment 1 Venkatesh Pallipadi 2008-04-19 08:14:25 UTC
Rafael,

Any clues about this? This hang looks like happening while secondary CPUs are made online during resume.

We go through CPU_ONLINE notifier function acpi_cpu_soft_notify() ->
acpi_processor_cst_has_changed() -> cpuidle_pause_and_lock() -> cpuidle_uninstall_idle_handler() -> which calls smp_call_function().

Looks like secondary CPUs do not like smp_call_function at this point. I am digging through the resume code, but thought you may have a quicker answer.
Comment 2 Rafael J. Wysocki 2008-04-19 13:15:57 UTC
No, unfortunately I have no idea why that might happen.
Comment 3 Thomas Gleixner 2008-04-19 13:40:51 UTC
I wonder if this might be related to mwait_idle(). Just a idea which shot through my head.
Comment 4 Venkatesh Pallipadi 2008-04-21 13:10:50 UTC
Thomas,

Can you attach the 'dmesg' output from 2.6.25 before actually suspending.
Comment 5 Thomas Bächler 2008-04-21 13:29:41 UTC
I actually unapplied the commit mentioned above and suspended successfully lots of times since then. I guess you want the messages from my boot, so I am attaching that.
Comment 6 Thomas Bächler 2008-04-21 13:33:24 UTC
Created attachment 15830 [details]
syslog-ng kernel.log output from boot until before the suspend
Comment 7 Venkatesh Pallipadi 2008-04-21 14:24:56 UTC
Created attachment 15831 [details]
debug patch


Can you try 2.6.25 with the attached debug patch.

Try suspend-resume with this patch. System should wake up cleanly after resume. I have added few debug prints in the patch that should give us more hints about the failure.

I tried to reproduce this on Lenovo T61 and other Core 2 Duo based boxes here. Not much success yet. Can you also attach the .config file that you are using.

Thanks.
Comment 8 Thomas Bächler 2008-04-21 23:45:20 UTC
This is my .config:
http://dev.archlinux.org/~aaron/svn/viewvc/bin/cgi/viewvc.cgi/kernel26/repos/testing-x86_64/config.x86_64?revision=374&view=markup

Will be a few days though until I can build and try another kernel.
Comment 9 Thomas Bächler 2008-04-23 03:57:22 UTC
Created attachment 15859 [details]
Kernel configuration

The link I posted above has changed, so I rather upload the file here.
Comment 10 Venkatesh Pallipadi 2008-04-30 13:48:40 UTC
Sorry about the delayed response on this one. I have finally figured out the reason for this oops. I should have a patch for you to test soon.
Comment 11 Venkatesh Pallipadi 2008-04-30 13:53:36 UTC
Oops. sorry. wrong bug. comment #10 above was meant for bug #10394...

This one is still need info with some debug msgs from the debug patch above.
Comment 12 Olav Morken 2008-05-09 15:39:48 UTC
I have the same problem, and have identified the same commit as the cause of
the problem. This is on a ThinkPad Z61t.

When I resume the laptop from suspend to ram, it will resume "slightly", making
the hibernation light blink. The resume operation will finish if I insert or
remove AC power at that point.

I tested with the debug patch, but the system didn't resume as it should with
that patch applied.

I will attach my .config, the dmesg output without the debug patch applied and
the dmesg output with the debug patch applied.
Comment 13 Olav Morken 2008-05-09 15:41:28 UTC
Created attachment 16089 [details]
My kernel configuration
Comment 14 Olav Morken 2008-05-09 15:42:59 UTC
Created attachment 16090 [details]
dmesg output without debug patch applied
Comment 15 Olav Morken 2008-05-09 15:44:21 UTC
Created attachment 16091 [details]
dmesg output with debug patch applied

This log contains the output of two resume operations.
Comment 16 Venkatesh Pallipadi 2008-05-09 16:59:54 UTC
Created attachment 16092 [details]
Test patch

Thanks for the detailed logs.

Can you try the attached patch (without the earlier debug patch) and check whether the resume regression goes away?
Comment 17 Olav Morken 2008-05-10 03:09:19 UTC
Created attachment 16093 [details]
dmesg output with patch applied

Unfortunately the patch didn't solve the problem. This is the dmesg output from
three suspend-resume cycles with the patch applied.
Comment 18 Venkatesh Pallipadi 2008-05-12 17:29:52 UTC
Created attachment 16118 [details]
debug patch 3


This is getting increasingly weirder.

Only thing I can think of is ssleep(1) which got removed by the cleanup patch is somehow affecting the resume.

Can you try yet another debug patch, which is basically same as earlier debug patch with an ssleep(1) added and see whether that helps resume correctly?

Thanks.
Comment 19 Olav Morken 2008-05-13 11:35:25 UTC
Created attachment 16126 [details]
dmesg output from debug patch 3

My laptop resumes correctly with that patch applied. This is the dmesg output
from two suspend-resume cycles.
Comment 20 Venkatesh Pallipadi 2008-05-19 16:02:20 UTC
Looking at the logs, I still cannot make out where we actually hang and why? Not sure whether the sleep(1) in patch 3 preventing the hang altogether or just hiding it for the time being..

Can you give some more information about the hang at resume. With the base kernel, without patch 3 (with patch 1 if possible), can you try the following please:

1) Make sure CONFIG_MAGIC_SYSRQ is enabled in your config.

2) Suspend the system and resume

3) It should hang at this point

4) Try "Alt" -> "PrintScreen/SysRq" -> "p"
       "Alt" -> "PrintScreen/SysRq" -> "t"

5) plug/unplug ac to make the system continue to resume.

At this point you should have debug messages from the 'hang' time in dmesg.
Attach the dmesg here.

Thanks.
Comment 21 Olav Morken 2008-05-20 13:09:50 UTC
Created attachment 16219 [details]
dmesg output from sysrq

This is the dmesg output with the sysrq output. The output is from two
suspend-resume cycles.

Thank you for taking the time to look into this problem.
Comment 22 Olav Morken 2008-05-21 12:42:55 UTC
I looked at the problem some more, using printk and sysrq to narrow it down.
It looks like the problem occurs during the execution of the _WAK ACPI
function. The _WAK-function on my laptop calls Sleep(100). This sleep doesn't
finish.

I added debug logging to the acpi_ex_system_do_suspend function. This shows
that this function calls acpi_os_sleep(how_long), but that this call doesn't
return (before I unplug the power).

I will attach a patch which shows the debug logging I added, and the output
from dmesg.
Comment 23 Olav Morken 2008-05-21 12:45:48 UTC
Created attachment 16230 [details]
Patch which adds some debug logging

This patch shows where I added the printk's for debug logging.
Comment 24 Olav Morken 2008-05-21 12:48:23 UTC
Created attachment 16231 [details]
dmesg output with patch applied

This is the dmesg output from a suspend/resume cycle with the patch applied.
The SysRq output shows the place where the resume hangs.
Comment 25 Venkatesh Pallipadi 2008-05-21 16:51:17 UTC
Thanks for the detective work :-). This is interesting...

Along with yout patch, can you also add an ssleep(1) in cpu_idle_wait with printk before and after ssleep. From your earlier logs, looks like cpu_idle_wait is getting called much later than the _WAK function. So, it is still not fully clear how that is helping this _WAK part. This experiment can give us more hints.
Comment 26 Olav Morken 2008-05-22 06:36:04 UTC
Created attachment 16245 [details]
Log from cpu_idle_wait with ssleep.

This it the log output with some printk's added to cpu_idle_wait. I also changed
printk to always show which cpu it is executing on (the @0 and @1 after the
timestamp).

To me it looks like a sleep on CPU 1 won't wake up unless/until a sleep on
CPU 0 has been executed. When unplugging the power cable, there will be
executed some ACPI code on CPU 0, which sleeps (three sleeps actually - 500ms,
100ms & 100ms). I will attach a log showing this.

The sleep in cpu_idle_wait has the same effect, as it also executes on CPU 0.
Comment 27 Olav Morken 2008-05-22 07:22:14 UTC
(In reply to comment #26)
Actually, just sleeping on CPU 0 does not wake up the sleep on CPU 1. I added a small sysrq handler which executed a sleep. The sleep executed (on CPU 0), but it didn't wake up CPU 1.
Comment 28 Venkatesh Pallipadi 2008-05-22 17:20:39 UTC
Len Brown suggested another thing to test. Run _WAK on CPU 0 and check whether it wakes OK. Can you try change like this.
(Note that suspend_ops->finish calls the _WAK internally)


Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c	2008-05-01 15:31:18.000000000 -0700
+++ linux-2.6/kernel/power/main.c	2008-05-22 17:13:52.000000000 -0700
@@ -279,17 +279,17 @@ int suspend_devices_and_enter(suspend_st
 			goto Resume_devices;
 	}
 
+	error = disable_nonboot_cpus();
 	if (suspend_test(TEST_PLATFORM))
 		goto Finish;
 
-	error = disable_nonboot_cpus();
 	if (!error && !suspend_test(TEST_CPUS))
 		suspend_enter(state);
 
-	enable_nonboot_cpus();
  Finish:
 	if (suspend_ops->finish)
 		suspend_ops->finish();
+	enable_nonboot_cpus();
  Resume_devices:
 	device_resume();
  Resume_console:
Comment 29 Olav Morken 2008-05-22 18:24:40 UTC
Created attachment 16252 [details]
Running _WAK on CPU0

When running _WAK on CPU0 it boots correctly. The dmesg output is attached.
Note that cpu_idle_wait is still run straight after _WAK is called.
Comment 30 Olav Morken 2008-06-22 08:44:40 UTC
I decided to look a bit at this bug again today, and tried to upgrade the BIOS on my laptop. This fixed the original problem. However, I decided to try the latest version of the 2.6.26-kernel, and that one has the same problem, except that I can't make resuming continue by unplugging AC power.

git bisect points to commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a, which is ACPI: EC: Use msleep instead of udelay while waiting for event.

Some more testing shows that enabling CONFIG_HIGH_RES_TIMERS makes this problem go away for me.
Comment 31 Zhang Rui 2008-06-22 19:15:43 UTC
So you mean the originally resume issue is gone, and the new issue (unplugging AC breaks resume) is introduced by the commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a?
Comment 32 Olav Morken 2008-06-22 22:31:37 UTC
No, sorry. I meant that 1b7fc5aae8867046f8d3d45808309d5b7f2e036a introduces a new resume-problem with the same symptom as the previous (hang with blinking "suspended-light"). However, I am no longer able to wake the laptop from that hang by unplugging AC power.

Actually, I believe that the "unplug AC to continue waking"-"feature" was removed earlier in the 2.6.26-series of patches. I believe I bisected it to b77d81b2678950077088956da4638c26853389fc - EC: Replace broken controller workarounds with poll mode.
Comment 33 Zhang Rui 2008-07-06 23:13:25 UTC
what's your BIOS version?
could you please verify if you can reproduce bug 10223?
Comment 34 Olav Morken 2008-07-09 11:00:05 UTC
(In reply to comment #33)
> what's your BIOS version?
> could you please verify if you can reproduce bug 10223?
> 

My laptop is a Thinkpad Z61t, and I have never seen anything like what is described in bug 10223 on it. The fan appears to adjust its speed as appropriate. My BIOS version is 2.24.


To summarize this bug as I have seen it:

1. After 783e391b7b5b273cd20856d8f6f4878da8ec31b3, my laptop wouldn't resume from suspend-to-ram. It would begin resuming, set the suspend-light to blink, and freeze/hang. I was able to unfreeze it by unplugging/plugging in AC power.

2. After commit 1b7fc5aae8867046f8d3d45808309d5b7f2e036a was added to the 2.6.26-tree, I could no longer unfreeze it by unplugging AC power.

3. At this point I updated my BIOS to version 2.24. This fixed the original problem, but 1b7fc5aae8867046f8d3d45808309d5b7f2e036a still causes it to hang.

4. Enabling CONFIG_HIGH_RES_TIMERS made this hang go away, and I can resume normally.

As far as I have been able to understand, all hangs are related to sleeping - for some reason a sleep at that point in the resume process on that CPU does not end automatically. I assume that unplugging/plugging in AC power caused an interrupt which woke up the sleeping CPU. However, after a sleep was added to the EC-code, that code will hang.

As far as I have been able to understand, all hangs are related to sleeping - for some reason a sleep at that point in the resume process on that CPU does not end automatically. I assume that unplugging/plugging in AC power caused an interrupt which woke up the sleeping CPU. However, after a sleep was added to the EC-code, that code will hang.
Comment 35 Zhang Rui 2008-07-09 20:20:32 UTC
would you please attach the dmidecode?
Comment 36 Olav Morken 2008-07-09 22:26:40 UTC
Created attachment 16779 [details]
dmidecode output
Comment 37 Thomas Bächler 2008-07-20 12:27:17 UTC
Okay, I am sorry for not reporting back so long, but I think Olav Morken provided some useful information. I simply didn't have the time to look into this.

The status right now is this:

1) With 2.6.26, resuming is still broken. The sleep LED is turned off, but the display isn't enabled and the machine freezes (looks the same as with .25, but I never tried the "unplug AC" workaround).

2) Above it was suggested to update the BIOS. However, I cannot find a way to update my BIOS at Lenovo's website: The only thing I can find is a Windows executable which apparently updates the Lenovo drivers and the BIOS, but I don't have Windows installed. Olav, can you tell me how to update the BIOS?

If I find some time in the next few days, what should I try to get this fixed?
Comment 38 Thomas Bächler 2008-07-20 12:35:43 UTC
(In reply to comment #33)
> what's your BIOS version?
> could you please verify if you can reproduce bug 10223?
> 

I have a similar, but not identical bug: sometimes (unrelated to resume), while the Coretemp temperature is going down, the ACPI temperature stays the same. The fan is running continuously. I can fix this issue (sometimes) by heating up the CPU for the ACPI temperature to raise again, and then wait for it to cool down. I can also fix this issue by suspending and resuming. So not exactly 10223, but there is something wrong with ACPI here, might be related.
Comment 39 Thomas Bächler 2008-07-20 12:38:37 UTC
Created attachment 16910 [details]
dmidecode output from Lenovo 3000 N100
Comment 40 Thomas Bächler 2008-07-20 13:15:36 UTC
Only reverting 1b7fc5aae8867046f8d3d45808309d5b7f2e036a actually fixes the issue in 2.6.26 (no BIOS update, no other changes).
Comment 41 Olav Morken 2008-07-20 13:44:47 UTC
I'm afraid I don't know how to update the BIOS on your laptop. For mine, they provide a bootable iso-file, linked from the download-page for the normal update utility (under "Additional information" on http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-64409). I was not able to find a similar link in the download page for the Lenovo 3000 N100, so I assume they don't provide one.
Comment 42 Thomas Bächler 2008-07-20 13:51:23 UTC
Thank you, I couldn't find an ISO for mine either (only the Win XP/Vista tools). However, from the changelog it appears the improvements are mostly unrelated to suspending.

If I get a second hard drive somewhere, I could install Windows to update the BIOS, but I don't think that will happen anytime soon.


Just like in 2.6.25, I now have a temporary fix for the problem (see comment #40).
Comment 43 Zhang Rui 2008-07-20 18:21:57 UTC
Alexey, any ideas? :)
Comment 44 Olav Morken 2008-09-13 15:14:39 UTC
The bug I was seeing seems to be fixed in the latest 2.6.27 rc. Some testing and bisection shows that it was fixed by the following commit:
nohz: fix wrong event handler after online an offlined cpu
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3c4fbe5e01d7e5309be5045e7ae0db20a049e6dc

However, I see that Thomas Bächler has CONFIG_HIGH_RES_TIMERS=y in his .config, so I guess that I was seeing a different bug than he was.
Comment 45 Thomas Bächler 2008-09-14 08:41:54 UTC
Thanks for reminding me: I just built 2.6.27-rc6-00088-g6bfb09a and the bug seems fixed. I just booted the machine and suspended/resumed twice successfully. Maybe it was fixed for me before the commit you mentioned already.

I have no time to bisect right now, but if any of the involved developers deem it necessary, I will.
Comment 46 Rafael J. Wysocki 2008-09-14 16:33:20 UTC
Well, I don't think it's necessary.

Please reopen if the problem happens again.
Comment 47 Thomas Bächler 2008-09-15 00:38:36 UTC
It still hangs occasionally. This didn't happen in 2.6.24, but with my patched 2.6.25 and 2.6.26. I will reopen this as soon as I am able to reproduce it.
Comment 48 Thomas Bächler 2008-09-18 10:50:07 UTC
I have to reopen this. Resuming still fails approximately one fourth of the time and I still have to find a pattern as to when it succeeds and when it fails (it seems random, sometimes it fails the first time I try, sometimes I can suspend a few times until it fails).
Comment 49 Thomas Bächler 2008-10-11 03:22:19 UTC
Okay, I can happily report that with -rc7 and now with the final release, I didn't have a single resume crash. This is again as stable as it was with 2.6.24.

Note You need to log in before you can comment on or make changes to this bug.