Bug 5037

Summary: S3 sleep hangs often [TP 600X]
Product: ACPI Reporter: Sanjoy Mahajan (sanjoy)
Component: Power-Sleep-WakeAssignee: Shaohua (shaohua.li)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: acpi-bugzilla, bunk, linux, mingo
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13-rc6 Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on:    
Bug Blocks: 5062    
Attachments: suspend script
initramfs
2.6.13-rc6 config that has the S3 hangs
script to diff kernel configs
dmesgs during S3 sleep hang
debug patch
dmsgs during S3 sleep hang (using workaround)
good sleep/wake, then bad one (console log)

Description Sanjoy Mahajan 2005-08-09 16:34:08 UTC
Kernel that worked: 2.6.11.4 could sleep/wake an arbitrary number of times
(2.6.12.3 might be okay but I didn't use it enough to say for sure), as long as
I unloaded uhci_hcd.  

Distribution: Debian testing
Hardware Environment: TP 600X
Software Environment: patched DSDT (see Bug #4989 for acpidmp, etc.)

Problem Description: The system frequently hangs while putting the system to
sleep (S3).  I have not found a way to reproduce it reliably, even after binary
searching for a set of problematic modules.  But it almost never happens on the
first sleep/wake cycle, sometimes not even on the second.   But after enough
cycles, it happens.  I unload possibly problematic modules (prism54, xircom_cb,
xircom_tulip_cb).  It has happened with those modules unloaded and no cards in
the slot; it has happened with those modules unloaded but with a Intersil wlan
card (prism54) or a Xircom ethernet/modem combo card (xircom_cb, xircom_tulip_cb).  

I turn on ACPI debugging (I'll attach the suspend script), so I see an endless
series of ACPI debugging messages on the console.  Unfortunately, because
sleep/wake messes up the vga adapter, it's impossible to read the messages
(since the hang almost never, maybe never happens on the first sleep/wake, when
I would be able to read the messages).

When I get back from this trip, I have a serial console setup (to another
thinkpad), and now always boot with console=, so I will then connect this laptop
to the other one and hope that it crashes whbile it is being watched carefully!

Steps to reproduce:  Can't reliably reproduce it, though it happens frequently
enough (maybe every 5th sleep).
Comment 1 Sanjoy Mahajan 2005-08-09 16:36:20 UTC
Created attachment 5566 [details]
suspend script

The script logs the modules and lspci output before putting the computer to
sleep, but I couldn't find any correlation between the modules loaded and
whether it hangs.  Sadly.
Comment 2 Shaohua 2005-08-09 18:45:29 UTC
Please try revert your hard disk driver's change, and report it back.
Comment 3 Sanjoy Mahajan 2005-08-09 21:19:36 UTC
> Please try revert your hard disk driver's change

Sure.  But I'm not sure which change you mean (maybe let me know a git
commit id?).

-Sanjoy

Comment 4 Len Brown 2005-08-10 18:51:04 UTC
can you isolate when this broke between 2.6.11.4 and 2.6.13-rc6? 
probably good to start with vanilla 2.6.12 and divide from there. 
 
Comment 5 Shaohua 2005-08-10 22:19:00 UTC
Created attachment 5598 [details]
initramfs

Please compile a kernel without any driver and let attached file as your
initrd.
Note: the initrd doesn't touch any device. It does something like:
      echo "Start init ............"
      echo Mounting proc
      mount -t proc /proc /proc
      echo Mounting sysfs
      mount -t sysfs /sys /sys
      echo "See how many devices we have .... "
      cat /proc/interrupts
      echo "Parepare enter S3 ..... "
      sleep 5
      echo "mem" > /sys/power/state
      echo "we are still living ...."
      sleep 5
      while true; do true; done
So be carefule don't make the initrd as your working kernel's initrd.
I suspect some drivers are broken. With the initrd, we can do s3 without any
driver, so it might help us narrow down the bug.
Comment 6 Sanjoy Mahajan 2005-08-11 07:15:45 UTC
> probably good to start with vanilla 2.6.12 and divide from there.

I just tried 2.6.12.3 and it works fine; at least, I haven't hit the
bug in many suspend/resume cycles, already a lot more than needed to
hit the bug with 2.6.13-rc6.  I'll keep bisecting.

One difficulty is that S3 wakeup broke for me between rc1-git7 and
rc2-git1 (see <http://bugzilla.kernel.org/show_bug.cgi?id=4989>), so I
can't do more than one sleep in that region.  But to trigger the hang
on sleep usually requires two or three sleep/wake cycles.  Hopefully
the bisection converges upon a kernel that resumes fine.

Just to be safe, I'll first try -rc6 with the same (working) .config
as for 2.6.12.3.

I'll also try the initrd suggestion.

-Sanjoy

Comment 7 Sanjoy Mahajan 2005-08-11 10:19:17 UTC
I've been running 2.6.13-rc6 with a config as similar as possible to the working
2.6.12.3: I copied over the 2.6.12.3 config, ran 'make oldconfig' and accepted
the defaults (either by hitting return a lot or feeding /dev/null as the stdin).

I haven't been able to crash the S3 sleep yet in 7 or 8 cycles, with the network
card plugged in (using the same suspend script that is attached).  So perhaps
changing the config causes a problem, or reveals a problem.  

Here's a unified diff between the configs (0 lines of context, with the comment
and blank lines ignored, and the @@ lines taken out).  The first 'file' is the
working one, the second 'file' is the broken one.  I'll attach diff_configs.sh
in case anyone else finds it useful.

diff_configs.sh 2.6.13-rc6-same-config-as-2612 2.6.13-rc6
--- 2.6.13-rc6-same-config-as-2612	2005-08-11 13:00:28.000000000 -0400
+++ 2.6.13-rc6	2005-08-11 13:00:28.000000000 -0400
+CONFIG_LOCK_KERNEL=y
-CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT=y
+CONFIG_PREEMPT_BKL=y
+CONFIG_HAVE_DEC_LOCK=y
-CONFIG_HZ_250=y
-CONFIG_HZ=250
+CONFIG_HZ_1000=y
+CONFIG_HZ=1000
-CONFIG_ACPI_ASUS=y
-CONFIG_ACPI_TOSHIBA=m
+CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
-CONFIG_USB_MON=y
+CONFIG_DEBUG_PREEMPT=y

I'm not a betting man, but if I had to bet, I'd guess that the preempt options
have caused (or exposed) a problem.  So, how to debug?  Should I try these
config options in 2.6.12.3 (basically enable preempts) and see whether it starts
to hang?  Or should I put in some debugging into 2.6.13-rc6?

More and more I suspect this problem is correlated with the semaphore errors I
reported in Bug #5008 and Bug #4967.  Running 'stress-poll' from Bug #5008
produces semaphore errors if and only if the kernel will show those S3 hangs.  
For example, I haven't been able to produce the semaphore error in this kernel,
which is -rc6 with the working config; but I could produce them in the -rc6 with
the hanging config.
Comment 8 Sanjoy Mahajan 2005-08-11 10:21:55 UTC
Created attachment 5602 [details]
2.6.13-rc6 config that has the S3 hangs
Comment 9 Sanjoy Mahajan 2005-08-11 10:22:55 UTC
Created attachment 5603 [details]
script to diff kernel configs

Produces a human-useful diff of two kernel configs.
Comment 10 Shaohua 2005-08-11 18:33:31 UTC
The semaphore hang issue is likely an EC issue.
>Re: config.
My test kernel enabled preempt, but it works. So it's a little strange, can 
you change your config slightly and see which config option causes the hang?
Comment 11 Sanjoy Mahajan 2005-08-11 22:55:33 UTC
Created attachment 5611 [details]
dmesgs during S3 sleep hang

I recompiled the working -rc6 but adding low-latency preemption, which is one
option in the 'make menuconfig' that sets a few others.  I get the semaphore
error again (Bug #5008) and it hung on S3 sleep on my first try.  Also as
expected, 'acpi' did NOT produce the extra <RETURN>: so Bug #5049 did not
happen, though I didn't run it long enough to notice if I got doubled
keystrokes just typing in emacs.

The attachment is the console output across ah serial console with a bit of
extra acpi debugging enabled (my suspend script enables it just in case).  It
just keeps looping and looping, so I eventually stopped it after capturing
2000+ lines and rebooted.

Here's the sort-of-diff from the working config:

--- 2.6.13-rc6-same-config-as-2612	2005-08-12 01:48:47.000000000 -0400
+++ 2.6.13-rc6-2612-with-preempt	2005-08-12 01:48:47.000000000 -0400
+CONFIG_LOCK_KERNEL=y
-CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT=y
+CONFIG_PREEMPT_BKL=y
+CONFIG_HAVE_DEC_LOCK=y
+CONFIG_DEBUG_PREEMPT=y

Next I could try 'desktop system' in the make menuconfig, which may not enable
so many extra config settings.	I didn't want to disable/enable any by hand
since they probably depend on each other and not every combination will produce
a valid kernel.
Comment 12 Shaohua 2005-08-11 23:31:36 UTC
Created attachment 5612 [details]
debug patch

Quite strange some thermal zone methods still are executed in suspend process.
It might be because kacpid isn't stoped. how about try the workaround?
Comment 13 Sanjoy Mahajan 2005-08-12 00:15:00 UTC
I'm recompiling the version that breaks but adding the debug patch and will let
you know how it goes.  Meanwhile I just recompiled with voluntary preempts
(instead of low-latency desktop, i.e. fully preemptible).  That changes just a
couple config items compared to the working config:

-CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=y

I'm running the resulting kernel now and it seems to suspend/resume fine (though
I haven't yet had as much testing, so problems may show up later if they are
more subtle).  

The semaphore errors show up (Bug #5008), though it takes quite a while running
'stress-poll' to trigger them.  Because of the preemption, the system is still
responsive while stress-poll is running (with the no-preemption kernel, i.e. the
one that for sure works, the system felt very sluggish).

The voluntary-preemption kernel also doesn't have problems with 'cardctl eject'
whereas the fully premptible kernel says "unable to remove socket power".  Note
to self: I sent a note to linux-pcmcia reporting this problem, but I should also
file a bugzilla report.
Comment 14 Sanjoy Mahajan 2005-08-12 07:26:57 UTC
Created attachment 5617 [details]
dmsgs during S3 sleep hang (using workaround)

S3 sleep just hung even with the workaround.  Attached is what showed up on the
serial console (just kept going until I finally rebooted).  The temperature
lines are still there, which is strange.  Is my hacked DSDT at fault?  (Bug#4989 has the acpidmp output.)
Comment 15 Sanjoy Mahajan 2005-08-12 21:00:57 UTC
I spoke too soon about 2.6.12.3.  The version I normally use hasn't hung, but I
recompiled it with these options (got by selecting preemptible kernel in make
menuconfig):

CONFIG_PREEMPT=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_LOCK_KERNEL=y
CONFIG_PREEMPT_BKL=y
CONFIG_DEBUG_PREEMPT=y

After a lot of use (several sleep/wake cycles, wlan card in and out), this
kernel also hung with an endless loop when put to sleep.  

Too bad too.  It had lasted for a while, so I assumed it was good when I did the
bisection, and had just compiled the midpoint of 2.6.12 (mistakenly assumed
good) and 2.6.13-rc6 (known bad) to test.  Now I wonder whether even 2.6.11.4 w/
preemption will also hang (the regular 2.6.11.4 that I ran for a while never
hung like the others are doing).
Comment 16 Sanjoy Mahajan 2005-08-15 15:10:27 UTC
Just tried the low-latency preemption version of 2.6.13-rc6 (which often hangs
on sleep) but with PREEMPT_BKL unset.  It hung on the second sleep/wake cycle. 
So that eliminates one more option as the culprit.

Below is the minimal config change (from my testing so far) that produces the
hang.  It's the diff between the mild preemption config, which didn't hang after
lots of cycles so I'm assuming for now that it was okay, and the low-latency
config with PREEMPT_BKL unset (the + is for th kernel with the hang):

+CONFIG_DEBUG_PREEMPT=y
+CONFIG_HAVE_DEC_LOCK=y
+CONFIG_LOCK_KERNEL=y
-CONFIG_PREEMPT_VOLUNTARY=y
+CONFIG_PREEMPT=y

For further testing, I'm not sure which of these options I can vary
independently.  For example, can I turn off CONFIG_HAVE_DEC_LOCK, or is its
setting determined by other options? Its entry in the Kconfig is

config HAVE_DEC_LOCK
	bool
	depends on (SMP || PREEMPT) && X86_CMPXCHG
	default y

and I didn't find any menuconfig way to turn it off.
Comment 17 Sanjoy Mahajan 2005-08-16 09:21:51 UTC
Created attachment 5650 [details]
good sleep/wake, then bad one (console log)

Attachment is a serial-console log showing a good sleep/wake and then a bad
one.  It is from a kernel compiled with preempts (but w/o preempt_bkl), and
with the workaround (and a couple patches to debug other acpi problems:
ec_burst mode and ibm acpi hotkey).

The log supports my latest theory, that if thermal polling happens at the wrong
time, then the sleep hangs.  Here's the longer version.  

To quiet the fan, I set the THM2 trip point to 50 C and the polling frequency
to 10 seconds for THM0 and THM2 (with polling disabled, the fan does not turn
on and off with the new trip point, although it seems to work with the default
DSDT trip point -- a matter to investigate and perhaps file a report, but
another story).  

So every 10 seconds, THM0 methods are run; and every 10 seconds some THM2
methods are run.  If those methods are run during a critical section of going
to sleep (I think it's the PTS method), then trouble happens.  But that can
happen only with a fully preemptible kernel.  With a non-preemptible kernel, or
even one with only voluntary preempts, the THM0 or THM2 methods won't be run,
and all is well.

This theory would explain why the hang happens only once in a while.  Suppose
that the critical going-to-sleep section lasts 0.5 seconds.  The thermal
polling happens every 10 seconds, so 5% of the time it will happen during the
critical section (or 10% of the time, if the two thermal zone pollings are
independent).  So on average the system will hang once every 20 (or 10) times.

One prediction, therefore, is that lowering the polling interval will increase
the chance of hanging the system.  So I lowered it to 1 second for THM0 and
THM2 (with a 0.3 second sleep after setting the first, so that the two do not
trigger at the same time, which should maximize the chance of hitting the
critical section).  As expected, with this faster polling the sleep hangs more
frequently.

The attached serial-console log file (acpi_debug=0x1F) shows a good sleep/wake,
then a bad one (tries to sleep, then infinite loops, so I reboot).  Here is an
egrep of it for @@@@ or Method: lines (the @@@@ lines are put there by my
suspend script).  The first SLEEP goes okay, and the thermal methods start to
run and show up in the log.  In the second SLEEP, they aren't there, perhaps
because they ran at a dangerous time during PTS, hung the system, and the log
never showed them.

@@@@ SLEEP
@@@@ Tue Aug 16 11:29:05 EDT 2005
@@@@ Linux sanjoy 2.6.13-rc6-2612-with-preempt-without-bkl+patches
@@@@ BOOT_IMAGE=-preempt-bkl ro root=305 idebus=66 apm=off acpi=force
     pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_SB_.LID0._PSW] (Node c15558c8)
Execute Method: [\_SB_.SLPB._PSW] (Node c15557c8)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_PTS] (Node c1575548)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_SI_._SST] (Node c15752c8)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
Execute Method: [\_TZ_.THM2._TMP] (Node c1576608)
Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08)
@@@@ WAKE
@@@@ Tue Aug 16 11:30:18 EDT 2005

[2nd sleep, but no wake:]
@@@@ SLEEP
@@@@ Tue Aug 16 11:30:42 EDT 2005
@@@@ Linux sanjoy 2.6.13-rc6-2612-with-preempt-without-bkl+patches
@@@@ BOOT_IMAGE=-preempt-bkl ro root=305 idebus=66 apm=off acpi=force
     pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios
Execute Method: [\_SB_.LID0._PSW] (Node c15558c8)
Execute Method: [\_SB_.SLPB._PSW] (Node c15557c8)
Execute Method: [\_PTS] (Node c1575548)
Comment 18 Sanjoy Mahajan 2005-08-17 20:37:05 UTC
If I enable ACPI_LV_MUTEX and/or ACPI_LV_THREADS, would that help to pinpoint
the hang?
Comment 19 Shaohua 2005-08-17 23:23:05 UTC
*** Bug 4926 has been marked as a duplicate of this bug. ***
Comment 20 Sanjoy Mahajan 2005-08-21 19:55:11 UTC
Sadly, it happened again even with turning off the thermal polling.  But it has
taken many days of using sleep/wake, so the bug triggers less often since I
turned off thermal polling before the sleep.  Perhaps some other thread jumps in
and causes problems, but less often?
Comment 21 Kasper Peeters 2006-11-01 08:32:23 UTC
Has this issue ever been resolved? I am experiencing a similar phenomenon on a
T60 with 2.6.15 and

 CONFIG_LOCK_KERNEL=y
 # CONFIG_PREEMPT_NONE is not set
 CONFIG_PREEMPT=y
 CONFIG_PREEMPT_BKL=y

Does anyone have a reliably working config for S3 suspend on a T60 ?
Comment 22 Sanjoy Mahajan 2006-11-01 08:39:09 UTC
Try the patch given in Bug 6749.  It fixed S3 suspend on my 600X.
Without the patch my T60 hangs once in a while (though a lot fewer
after I started unlinking the fan module before suspending).  But I
haven't had the time to try that patch on the T60.

Comment 23 Len Brown 2007-08-18 20:51:11 UTC
Is the 600X still available to reproduce this issue with 2.6.22?
Comment 24 Sanjoy Mahajan 2007-08-19 19:42:53 UTC
I gave it away to a good local cause a few months ago.  I would have
sent it to you, if I'd thought of it.  Sorry!  Though the patch in
6749 fixed the problem, as far as I could find.