Kernel that worked: 2.6.11.4 could sleep/wake an arbitrary number of times (2.6.12.3 might be okay but I didn't use it enough to say for sure), as long as I unloaded uhci_hcd. Distribution: Debian testing Hardware Environment: TP 600X Software Environment: patched DSDT (see Bug #4989 for acpidmp, etc.) Problem Description: The system frequently hangs while putting the system to sleep (S3). I have not found a way to reproduce it reliably, even after binary searching for a set of problematic modules. But it almost never happens on the first sleep/wake cycle, sometimes not even on the second. But after enough cycles, it happens. I unload possibly problematic modules (prism54, xircom_cb, xircom_tulip_cb). It has happened with those modules unloaded and no cards in the slot; it has happened with those modules unloaded but with a Intersil wlan card (prism54) or a Xircom ethernet/modem combo card (xircom_cb, xircom_tulip_cb). I turn on ACPI debugging (I'll attach the suspend script), so I see an endless series of ACPI debugging messages on the console. Unfortunately, because sleep/wake messes up the vga adapter, it's impossible to read the messages (since the hang almost never, maybe never happens on the first sleep/wake, when I would be able to read the messages). When I get back from this trip, I have a serial console setup (to another thinkpad), and now always boot with console=, so I will then connect this laptop to the other one and hope that it crashes whbile it is being watched carefully! Steps to reproduce: Can't reliably reproduce it, though it happens frequently enough (maybe every 5th sleep).
Created attachment 5566 [details] suspend script The script logs the modules and lspci output before putting the computer to sleep, but I couldn't find any correlation between the modules loaded and whether it hangs. Sadly.
Please try revert your hard disk driver's change, and report it back.
> Please try revert your hard disk driver's change Sure. But I'm not sure which change you mean (maybe let me know a git commit id?). -Sanjoy
can you isolate when this broke between 2.6.11.4 and 2.6.13-rc6? probably good to start with vanilla 2.6.12 and divide from there.
Created attachment 5598 [details] initramfs Please compile a kernel without any driver and let attached file as your initrd. Note: the initrd doesn't touch any device. It does something like: echo "Start init ............" echo Mounting proc mount -t proc /proc /proc echo Mounting sysfs mount -t sysfs /sys /sys echo "See how many devices we have .... " cat /proc/interrupts echo "Parepare enter S3 ..... " sleep 5 echo "mem" > /sys/power/state echo "we are still living ...." sleep 5 while true; do true; done So be carefule don't make the initrd as your working kernel's initrd. I suspect some drivers are broken. With the initrd, we can do s3 without any driver, so it might help us narrow down the bug.
> probably good to start with vanilla 2.6.12 and divide from there. I just tried 2.6.12.3 and it works fine; at least, I haven't hit the bug in many suspend/resume cycles, already a lot more than needed to hit the bug with 2.6.13-rc6. I'll keep bisecting. One difficulty is that S3 wakeup broke for me between rc1-git7 and rc2-git1 (see <http://bugzilla.kernel.org/show_bug.cgi?id=4989>), so I can't do more than one sleep in that region. But to trigger the hang on sleep usually requires two or three sleep/wake cycles. Hopefully the bisection converges upon a kernel that resumes fine. Just to be safe, I'll first try -rc6 with the same (working) .config as for 2.6.12.3. I'll also try the initrd suggestion. -Sanjoy
I've been running 2.6.13-rc6 with a config as similar as possible to the working 2.6.12.3: I copied over the 2.6.12.3 config, ran 'make oldconfig' and accepted the defaults (either by hitting return a lot or feeding /dev/null as the stdin). I haven't been able to crash the S3 sleep yet in 7 or 8 cycles, with the network card plugged in (using the same suspend script that is attached). So perhaps changing the config causes a problem, or reveals a problem. Here's a unified diff between the configs (0 lines of context, with the comment and blank lines ignored, and the @@ lines taken out). The first 'file' is the working one, the second 'file' is the broken one. I'll attach diff_configs.sh in case anyone else finds it useful. diff_configs.sh 2.6.13-rc6-same-config-as-2612 2.6.13-rc6 --- 2.6.13-rc6-same-config-as-2612 2005-08-11 13:00:28.000000000 -0400 +++ 2.6.13-rc6 2005-08-11 13:00:28.000000000 -0400 +CONFIG_LOCK_KERNEL=y -CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT=y +CONFIG_PREEMPT_BKL=y +CONFIG_HAVE_DEC_LOCK=y -CONFIG_HZ_250=y -CONFIG_HZ=250 +CONFIG_HZ_1000=y +CONFIG_HZ=1000 -CONFIG_ACPI_ASUS=y -CONFIG_ACPI_TOSHIBA=m +CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m -CONFIG_USB_MON=y +CONFIG_DEBUG_PREEMPT=y I'm not a betting man, but if I had to bet, I'd guess that the preempt options have caused (or exposed) a problem. So, how to debug? Should I try these config options in 2.6.12.3 (basically enable preempts) and see whether it starts to hang? Or should I put in some debugging into 2.6.13-rc6? More and more I suspect this problem is correlated with the semaphore errors I reported in Bug #5008 and Bug #4967. Running 'stress-poll' from Bug #5008 produces semaphore errors if and only if the kernel will show those S3 hangs. For example, I haven't been able to produce the semaphore error in this kernel, which is -rc6 with the working config; but I could produce them in the -rc6 with the hanging config.
Created attachment 5602 [details] 2.6.13-rc6 config that has the S3 hangs
Created attachment 5603 [details] script to diff kernel configs Produces a human-useful diff of two kernel configs.
The semaphore hang issue is likely an EC issue. >Re: config. My test kernel enabled preempt, but it works. So it's a little strange, can you change your config slightly and see which config option causes the hang?
Created attachment 5611 [details] dmesgs during S3 sleep hang I recompiled the working -rc6 but adding low-latency preemption, which is one option in the 'make menuconfig' that sets a few others. I get the semaphore error again (Bug #5008) and it hung on S3 sleep on my first try. Also as expected, 'acpi' did NOT produce the extra <RETURN>: so Bug #5049 did not happen, though I didn't run it long enough to notice if I got doubled keystrokes just typing in emacs. The attachment is the console output across ah serial console with a bit of extra acpi debugging enabled (my suspend script enables it just in case). It just keeps looping and looping, so I eventually stopped it after capturing 2000+ lines and rebooted. Here's the sort-of-diff from the working config: --- 2.6.13-rc6-same-config-as-2612 2005-08-12 01:48:47.000000000 -0400 +++ 2.6.13-rc6-2612-with-preempt 2005-08-12 01:48:47.000000000 -0400 +CONFIG_LOCK_KERNEL=y -CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT=y +CONFIG_PREEMPT_BKL=y +CONFIG_HAVE_DEC_LOCK=y +CONFIG_DEBUG_PREEMPT=y Next I could try 'desktop system' in the make menuconfig, which may not enable so many extra config settings. I didn't want to disable/enable any by hand since they probably depend on each other and not every combination will produce a valid kernel.
Created attachment 5612 [details] debug patch Quite strange some thermal zone methods still are executed in suspend process. It might be because kacpid isn't stoped. how about try the workaround?
I'm recompiling the version that breaks but adding the debug patch and will let you know how it goes. Meanwhile I just recompiled with voluntary preempts (instead of low-latency desktop, i.e. fully preemptible). That changes just a couple config items compared to the working config: -CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_VOLUNTARY=y I'm running the resulting kernel now and it seems to suspend/resume fine (though I haven't yet had as much testing, so problems may show up later if they are more subtle). The semaphore errors show up (Bug #5008), though it takes quite a while running 'stress-poll' to trigger them. Because of the preemption, the system is still responsive while stress-poll is running (with the no-preemption kernel, i.e. the one that for sure works, the system felt very sluggish). The voluntary-preemption kernel also doesn't have problems with 'cardctl eject' whereas the fully premptible kernel says "unable to remove socket power". Note to self: I sent a note to linux-pcmcia reporting this problem, but I should also file a bugzilla report.
Created attachment 5617 [details] dmsgs during S3 sleep hang (using workaround) S3 sleep just hung even with the workaround. Attached is what showed up on the serial console (just kept going until I finally rebooted). The temperature lines are still there, which is strange. Is my hacked DSDT at fault? (Bug#4989 has the acpidmp output.)
I spoke too soon about 2.6.12.3. The version I normally use hasn't hung, but I recompiled it with these options (got by selecting preemptible kernel in make menuconfig): CONFIG_PREEMPT=y CONFIG_HAVE_DEC_LOCK=y CONFIG_LOCK_KERNEL=y CONFIG_PREEMPT_BKL=y CONFIG_DEBUG_PREEMPT=y After a lot of use (several sleep/wake cycles, wlan card in and out), this kernel also hung with an endless loop when put to sleep. Too bad too. It had lasted for a while, so I assumed it was good when I did the bisection, and had just compiled the midpoint of 2.6.12 (mistakenly assumed good) and 2.6.13-rc6 (known bad) to test. Now I wonder whether even 2.6.11.4 w/ preemption will also hang (the regular 2.6.11.4 that I ran for a while never hung like the others are doing).
Just tried the low-latency preemption version of 2.6.13-rc6 (which often hangs on sleep) but with PREEMPT_BKL unset. It hung on the second sleep/wake cycle. So that eliminates one more option as the culprit. Below is the minimal config change (from my testing so far) that produces the hang. It's the diff between the mild preemption config, which didn't hang after lots of cycles so I'm assuming for now that it was okay, and the low-latency config with PREEMPT_BKL unset (the + is for th kernel with the hang): +CONFIG_DEBUG_PREEMPT=y +CONFIG_HAVE_DEC_LOCK=y +CONFIG_LOCK_KERNEL=y -CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT=y For further testing, I'm not sure which of these options I can vary independently. For example, can I turn off CONFIG_HAVE_DEC_LOCK, or is its setting determined by other options? Its entry in the Kconfig is config HAVE_DEC_LOCK bool depends on (SMP || PREEMPT) && X86_CMPXCHG default y and I didn't find any menuconfig way to turn it off.
Created attachment 5650 [details] good sleep/wake, then bad one (console log) Attachment is a serial-console log showing a good sleep/wake and then a bad one. It is from a kernel compiled with preempts (but w/o preempt_bkl), and with the workaround (and a couple patches to debug other acpi problems: ec_burst mode and ibm acpi hotkey). The log supports my latest theory, that if thermal polling happens at the wrong time, then the sleep hangs. Here's the longer version. To quiet the fan, I set the THM2 trip point to 50 C and the polling frequency to 10 seconds for THM0 and THM2 (with polling disabled, the fan does not turn on and off with the new trip point, although it seems to work with the default DSDT trip point -- a matter to investigate and perhaps file a report, but another story). So every 10 seconds, THM0 methods are run; and every 10 seconds some THM2 methods are run. If those methods are run during a critical section of going to sleep (I think it's the PTS method), then trouble happens. But that can happen only with a fully preemptible kernel. With a non-preemptible kernel, or even one with only voluntary preempts, the THM0 or THM2 methods won't be run, and all is well. This theory would explain why the hang happens only once in a while. Suppose that the critical going-to-sleep section lasts 0.5 seconds. The thermal polling happens every 10 seconds, so 5% of the time it will happen during the critical section (or 10% of the time, if the two thermal zone pollings are independent). So on average the system will hang once every 20 (or 10) times. One prediction, therefore, is that lowering the polling interval will increase the chance of hanging the system. So I lowered it to 1 second for THM0 and THM2 (with a 0.3 second sleep after setting the first, so that the two do not trigger at the same time, which should maximize the chance of hitting the critical section). As expected, with this faster polling the sleep hangs more frequently. The attached serial-console log file (acpi_debug=0x1F) shows a good sleep/wake, then a bad one (tries to sleep, then infinite loops, so I reboot). Here is an egrep of it for @@@@ or Method: lines (the @@@@ lines are put there by my suspend script). The first SLEEP goes okay, and the thermal methods start to run and show up in the log. In the second SLEEP, they aren't there, perhaps because they ran at a dangerous time during PTS, hung the system, and the log never showed them. @@@@ SLEEP @@@@ Tue Aug 16 11:29:05 EDT 2005 @@@@ Linux sanjoy 2.6.13-rc6-2612-with-preempt-without-bkl+patches @@@@ BOOT_IMAGE=-preempt-bkl ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios Execute Method: [\_TZ_.THM2._TMP] (Node c1576608) Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08) Execute Method: [\_TZ_.THM2._TMP] (Node c1576608) Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08) Execute Method: [\_TZ_.THM2._TMP] (Node c1576608) Execute Method: [\_SB_.LID0._PSW] (Node c15558c8) Execute Method: [\_SB_.SLPB._PSW] (Node c15557c8) Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08) Execute Method: [\_PTS] (Node c1575548) Execute Method: [\_TZ_.THM2._TMP] (Node c1576608) Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08) Execute Method: [\_TZ_.THM2._TMP] (Node c1576608) Execute Method: [\_SI_._SST] (Node c15752c8) Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08) Execute Method: [\_TZ_.THM2._TMP] (Node c1576608) Execute Method: [\_TZ_.THM0._TMP] (Node c1576a08) @@@@ WAKE @@@@ Tue Aug 16 11:30:18 EDT 2005 [2nd sleep, but no wake:] @@@@ SLEEP @@@@ Tue Aug 16 11:30:42 EDT 2005 @@@@ Linux sanjoy 2.6.13-rc6-2612-with-preempt-without-bkl+patches @@@@ BOOT_IMAGE=-preempt-bkl ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios Execute Method: [\_SB_.LID0._PSW] (Node c15558c8) Execute Method: [\_SB_.SLPB._PSW] (Node c15557c8) Execute Method: [\_PTS] (Node c1575548)
If I enable ACPI_LV_MUTEX and/or ACPI_LV_THREADS, would that help to pinpoint the hang?
*** Bug 4926 has been marked as a duplicate of this bug. ***
Sadly, it happened again even with turning off the thermal polling. But it has taken many days of using sleep/wake, so the bug triggers less often since I turned off thermal polling before the sleep. Perhaps some other thread jumps in and causes problems, but less often?
Has this issue ever been resolved? I am experiencing a similar phenomenon on a T60 with 2.6.15 and CONFIG_LOCK_KERNEL=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y Does anyone have a reliably working config for S3 suspend on a T60 ?
Try the patch given in Bug 6749. It fixed S3 suspend on my 600X. Without the patch my T60 hangs once in a while (though a lot fewer after I started unlinking the fan module before suspending). But I haven't had the time to try that patch on the T60.
Is the 600X still available to reproduce this issue with 2.6.22?
I gave it away to a good local cause a few months ago. I would have sent it to you, if I'd thought of it. Sorry! Though the patch in 6749 fixed the problem, as far as I could find.