Bug 5037
Summary: | S3 sleep hangs often [TP 600X] | ||
---|---|---|---|
Product: | ACPI | Reporter: | Sanjoy Mahajan (sanjoy) |
Component: | Power-Sleep-Wake | Assignee: | Shaohua (shaohua.li) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | acpi-bugzilla, bunk, linux, mingo |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13-rc6 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 5062 | ||
Attachments: |
suspend script
initramfs 2.6.13-rc6 config that has the S3 hangs script to diff kernel configs dmesgs during S3 sleep hang debug patch dmsgs during S3 sleep hang (using workaround) good sleep/wake, then bad one (console log) |
Description
Sanjoy Mahajan
2005-08-09 16:34:08 UTC
Created attachment 5566 [details]
suspend script
The script logs the modules and lspci output before putting the computer to
sleep, but I couldn't find any correlation between the modules loaded and
whether it hangs. Sadly.
Please try revert your hard disk driver's change, and report it back. > Please try revert your hard disk driver's change
Sure. But I'm not sure which change you mean (maybe let me know a git
commit id?).
-Sanjoy
can you isolate when this broke between 2.6.11.4 and 2.6.13-rc6? probably good to start with vanilla 2.6.12 and divide from there. Created attachment 5598 [details]
initramfs
Please compile a kernel without any driver and let attached file as your
initrd.
Note: the initrd doesn't touch any device. It does something like:
echo "Start init ............"
echo Mounting proc
mount -t proc /proc /proc
echo Mounting sysfs
mount -t sysfs /sys /sys
echo "See how many devices we have .... "
cat /proc/interrupts
echo "Parepare enter S3 ..... "
sleep 5
echo "mem" > /sys/power/state
echo "we are still living ...."
sleep 5
while true; do true; done
So be carefule don't make the initrd as your working kernel's initrd.
I suspect some drivers are broken. With the initrd, we can do s3 without any
driver, so it might help us narrow down the bug.
> probably good to start with vanilla 2.6.12 and divide from there. I just tried 2.6.12.3 and it works fine; at least, I haven't hit the bug in many suspend/resume cycles, already a lot more than needed to hit the bug with 2.6.13-rc6. I'll keep bisecting. One difficulty is that S3 wakeup broke for me between rc1-git7 and rc2-git1 (see <http://bugzilla.kernel.org/show_bug.cgi?id=4989>), so I can't do more than one sleep in that region. But to trigger the hang on sleep usually requires two or three sleep/wake cycles. Hopefully the bisection converges upon a kernel that resumes fine. Just to be safe, I'll first try -rc6 with the same (working) .config as for 2.6.12.3. I'll also try the initrd suggestion. -Sanjoy I've been running 2.6.13-rc6 with a config as similar as possible to the working 2.6.12.3: I copied over the 2.6.12.3 config, ran 'make oldconfig' and accepted the defaults (either by hitting return a lot or feeding /dev/null as the stdin). I haven't been able to crash the S3 sleep yet in 7 or 8 cycles, with the network card plugged in (using the same suspend script that is attached). So perhaps changing the config causes a problem, or reveals a problem. Here's a unified diff between the configs (0 lines of context, with the comment and blank lines ignored, and the @@ lines taken out). The first 'file' is the working one, the second 'file' is the broken one. I'll attach diff_configs.sh in case anyone else finds it useful. diff_configs.sh 2.6.13-rc6-same-config-as-2612 2.6.13-rc6 --- 2.6.13-rc6-same-config-as-2612 2005-08-11 13:00:28.000000000 -0400 +++ 2.6.13-rc6 2005-08-11 13:00:28.000000000 -0400 +CONFIG_LOCK_KERNEL=y -CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT=y +CONFIG_PREEMPT_BKL=y +CONFIG_HAVE_DEC_LOCK=y -CONFIG_HZ_250=y -CONFIG_HZ=250 +CONFIG_HZ_1000=y +CONFIG_HZ=1000 -CONFIG_ACPI_ASUS=y -CONFIG_ACPI_TOSHIBA=m +CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m -CONFIG_USB_MON=y +CONFIG_DEBUG_PREEMPT=y I'm not a betting man, but if I had to bet, I'd guess that the preempt options have caused (or exposed) a problem. So, how to debug? Should I try these config options in 2.6.12.3 (basically enable preempts) and see whether it starts to hang? Or should I put in some debugging into 2.6.13-rc6? More and more I suspect this problem is correlated with the semaphore errors I reported in Bug #5008 and Bug #4967. Running 'stress-poll' from Bug #5008 produces semaphore errors if and only if the kernel will show those S3 hangs. For example, I haven't been able to produce the semaphore error in this kernel, which is -rc6 with the working config; but I could produce them in the -rc6 with the hanging config. Created attachment 5602 [details]
2.6.13-rc6 config that has the S3 hangs
Created attachment 5603 [details]
script to diff kernel configs
Produces a human-useful diff of two kernel configs.
The semaphore hang issue is likely an EC issue.
>Re: config.
My test kernel enabled preempt, but it works. So it's a little strange, can
you change your config slightly and see which config option causes the hang?
Created attachment 5611 [details] dmesgs during S3 sleep hang I recompiled the working -rc6 but adding low-latency preemption, which is one option in the 'make menuconfig' that sets a few others. I get the semaphore error again (Bug #5008) and it hung on S3 sleep on my first try. Also as expected, 'acpi' did NOT produce the extra <RETURN>: so Bug #5049 did not happen, though I didn't run it long enough to notice if I got doubled keystrokes just typing in emacs. The attachment is the console output across ah serial console with a bit of extra acpi debugging enabled (my suspend script enables it just in case). It just keeps looping and looping, so I eventually stopped it after capturing 2000+ lines and rebooted. Here's the sort-of-diff from the working config: --- 2.6.13-rc6-same-config-as-2612 2005-08-12 01:48:47.000000000 -0400 +++ 2.6.13-rc6-2612-with-preempt 2005-08-12 01:48:47.000000000 -0400 +CONFIG_LOCK_KERNEL=y -CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT=y +CONFIG_PREEMPT_BKL=y +CONFIG_HAVE_DEC_LOCK=y +CONFIG_DEBUG_PREEMPT=y Next I could try 'desktop system' in the make menuconfig, which may not enable so many extra config settings. I didn't want to disable/enable any by hand since they probably depend on each other and not every combination will produce a valid kernel. Created attachment 5612 [details]
debug patch
Quite strange some thermal zone methods still are executed in suspend process.
It might be because kacpid isn't stoped. how about try the workaround?
I'm recompiling the version that breaks but adding the debug patch and will let you know how it goes. Meanwhile I just recompiled with voluntary preempts (instead of low-latency desktop, i.e. fully preemptible). That changes just a couple config items compared to the working config: -CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_VOLUNTARY=y I'm running the resulting kernel now and it seems to suspend/resume fine (though I haven't yet had as much testing, so problems may show up later if they are more subtle). The semaphore errors show up (Bug #5008), though it takes quite a while running 'stress-poll' to trigger them. Because of the preemption, the system is still responsive while stress-poll is running (with the no-preemption kernel, i.e. the one that for sure works, the system felt very sluggish). The voluntary-preemption kernel also doesn't have problems with 'cardctl eject' whereas the fully premptible kernel says "unable to remove socket power". Note to self: I sent a note to linux-pcmcia reporting this problem, but I should also file a bugzilla report. Created attachment 5617 [details] dmsgs during S3 sleep hang (using workaround) S3 sleep just hung even with the workaround. Attached is what showed up on the serial console (just kept going until I finally rebooted). The temperature lines are still there, which is strange. Is my hacked DSDT at fault? (Bug#4989 has the acpidmp output.) I spoke too soon about 2.6.12.3. The version I normally use hasn't hung, but I recompiled it with these options (got by selecting preemptible kernel in make menuconfig): CONFIG_PREEMPT=y CONFIG_HAVE_DEC_LOCK=y CONFIG_LOCK_KERNEL=y CONFIG_PREEMPT_BKL=y CONFIG_DEBUG_PREEMPT=y After a lot of use (several sleep/wake cycles, wlan card in and out), this kernel also hung with an endless loop when put to sleep. Too bad too. It had lasted for a while, so I assumed it was good when I did the bisection, and had just compiled the midpoint of 2.6.12 (mistakenly assumed good) and 2.6.13-rc6 (known bad) to test. Now I wonder whether even 2.6.11.4 w/ preemption will also hang (the regular 2.6.11.4 that I ran for a while never hung like the others are doing). Just tried the low-latency preemption version of 2.6.13-rc6 (which often hangs on sleep) but with PREEMPT_BKL unset. It hung on the second sleep/wake cycle. So that eliminates one more option as the culprit. Below is the minimal config change (from my testing so far) that produces the hang. It's the diff between the mild preemption config, which didn't hang after lots of cycles so I'm assuming for now that it was okay, and the low-latency config with PREEMPT_BKL unset (the + is for th kernel with the hang): +CONFIG_DEBUG_PREEMPT=y +CONFIG_HAVE_DEC_LOCK=y +CONFIG_LOCK_KERNEL=y -CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=y For further testing, I'm not sure which of these options I can vary independently. For example, can I turn off CONFIG_HAVE_DEC_LOCK, or is its setting determined by other options? Its entry in the Kconfig is config HAVE_DEC_LOCK bool depends on (SMP || PREEMPT) && X86_CMPXCHG default y and I didn't find any menuconfig way to turn it off. Created attachment 5650 [details]
good sleep/wake, then bad one (console log)
Attachment is a serial-console log showing a good sleep/wake and then a bad
one. It is from a kernel compiled with preempts (but w/o preempt_bkl), and
with the workaround (and a couple patches to debug other acpi problems:
ec_burst mode and ibm acpi hotkey).
The log supports my latest theory, that if thermal polling happens at the wrong
time, then the sleep hangs. Here's the longer version.
To quiet the fan, I set the THM2 trip point to 50 C and the polling frequency
to 10 seconds for THM0 and THM2 (with polling disabled, the fan does not turn
on and off with the new trip point, although it seems to work with the default
DSDT trip point -- a matter to investigate and perhaps file a report, but
another story).
So every 10 seconds, THM0 methods are run; and every 10 seconds some THM2
methods are run. If those methods are run during a critical section of going
to sleep (I think it's the PTS method), then trouble happens. But that can
happen only with a fully preemptible kernel. With a non-preemptible kernel, or
even one with only voluntary preempts, the THM0 or THM2 methods won't be run,
and all is well.
This theory would explain why the hang happens only once in a while. Suppose
that the critical going-to-sleep section lasts 0.5 seconds. The thermal
polling happens every 10 seconds, so 5% of the time it will happen during the
critical section (or 10% of the time, if the two thermal zone pollings are
independent). So on average the system will hang once every 20 (or 10) times.
One prediction, therefore, is that lowering the polling interval will increase
the chance of hanging the system. So I lowered it to 1 second for THM0 and
THM2 (with a 0.3 second sleep after setting the first, so that the two do not
trigger at the same time, which should maximize the chance of hitting the
critical section). As expected, with this faster polling the sleep hangs more
frequently.
The attached serial-console log file (acpi_debug=0x1F) shows a good sleep/wake,
then a bad one (tries to sleep, then infinite loops, so I reboot). Here is an
egrep of it for @@@@ or Method: lines (the @@@@ lines are put there by my
suspend script). The first SLEEP goes okay, and the thermal methods start to
run and show up in the log. In the second SLEEP, they aren't there, perhaps
because they ran at a dangerous time during PTS, hung the system, and the log
never showed them.
@@@@ SLEEP
@@@@ Tue Aug 16 11:29:05 EDT 2005
@@@@ Linux sanjoy 2.6.13-rc6-2612-with-preempt-without-bkl+patches
@@@@ BOOT_IMAGE=-preempt-bkl ro root=305 idebus=66 apm=off acpi=force
pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_SB_.LID0._PSW] (Node c15558c8)
Execute Method: [\_SB_.SLPB._PSW] (Node c15557c8)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_PTS] (Node c1575548)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_SI_._SST] (Node c15752c8)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
@@@@ WAKE
@@@@ Tue Aug 16 11:30:18 EDT 2005
[2nd sleep, but no wake:]
@@@@ SLEEP
@@@@ Tue Aug 16 11:30:42 EDT 2005
@@@@ Linux sanjoy 2.6.13-rc6-2612-with-preempt-without-bkl+patches
@@@@ BOOT_IMAGE=-preempt-bkl ro root=305 idebus=66 apm=off acpi=force
pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios
Execute Method: [\_SB_.LID0._PSW] (Node c15558c8)
Execute Method: [\_SB_.SLPB._PSW] (Node c15557c8)
Execute Method: [\_PTS] (Node c1575548)
If I enable ACPI_LV_MUTEX and/or ACPI_LV_THREADS, would that help to pinpoint the hang? *** Bug 4926 has been marked as a duplicate of this bug. *** Sadly, it happened again even with turning off the thermal polling. But it has taken many days of using sleep/wake, so the bug triggers less often since I turned off thermal polling before the sleep. Perhaps some other thread jumps in and causes problems, but less often? Has this issue ever been resolved? I am experiencing a similar phenomenon on a T60 with 2.6.15 and CONFIG_LOCK_KERNEL=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y Does anyone have a reliably working config for S3 suspend on a T60 ? Try the patch given in Bug 6749. It fixed S3 suspend on my 600X. Without the patch my T60 hangs once in a while (though a lot fewer after I started unlinking the fan module before suspending). But I haven't had the time to try that patch on the T60. Is the 600X still available to reproduce this issue with 2.6.22? I gave it away to a good local cause a few months ago. I would have sent it to you, if I'd thought of it. Sorry! Though the patch in 6749 fixed the problem, as far as I could find. |