Most recent kernel where this bug did not occur: 2.6.15 Distribution: Debian Hardware Environment: TP 600X, bios 1.11, modified DSDT so that S3 worked with earlier kernels (e.g. 2.6.15). Software Environment: Problem Description: The first sleep and wake goes fine. The second sleep hangs at the 'Stopping tasks'. With tons of debugging turned out, the second sleep does not hang, but the wakeup hangs. With 0x1F as the acpi debug_level, the second sleep still hangs and I got some useful output across a serial console (attached). In the second sleep (after the second 'Stopping tasks'), you can see it repeating a sequence of a few exregion lines. Steps to reproduce: Run the sleep.sh script (attached), wake it up, then run it again.
Created attachment 7203 [details] log file showing 1st (ok) and 2nd (failed) sleep attempts
Created attachment 7204 [details] S3 sleep script
I'm using 2.6.16-rc1 patched with acpi-release-20060113-2.6.16-rc1.diff With vanilla 2.6.15 (i.e. no extra acpi release patch), sleeping never hangs. Once in a while (maybe 1 in 30 times) wakeup hangs, but I haven't had a serial console handy then (on the subway) to check why.
Just tried vanilla 2.6.16-rc2 (with no extra acpi patches) and it has the same problem, although the exregion-NNNN numbers are slightly different compared to -rc1. I updated the kernel-version header in the bug report.
Looks like maybe the _PTS method is in an infinite loop waiting for the hardware to respond. Please post the DSDT and/or acpidump for the machine
Created attachment 7261 [details] acpidmp output Here is the acpidmp output. The DSDT is modified relative to the vanilla DSDT from the BIOS, in order to get S3 to work at all (plus a few changes to the temperature code to fix another DSDT bug, etc.). I'll also attach the original DSDT in the next attachment.
Created attachment 7262 [details] original DSDT from BIOS 1.11
You are stuck in the loop in the method below. I would guess that the code is waiting for a response from the SMI bios and it never happens. Method (SMPI, 1, NotSerialized) { Store (S_AX, Local0) Store (0x81, APMD) While (LEqual (S_AH, 0xA6)) { Sleep (0x64) Store (Local0, S_AX) Store (0x81, APMD) } } OperationRegion (MNVS, SystemMemory, 0x23FDF000, 0x1000) Field (MNVS, DWordAcc, NoLock, Preserve) { Offset (0xFC0), S_AX, 16, OperationRegion (APMC, SystemIO, 0xB2, 0x01) Field (APMC, ByteAcc, NoLock, Preserve) { APMD, 8 Store (Local0, S_AX) exregion-0182 [29] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 Store (0x81, APMD) exregion-0287 [30] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2
Although perhaps I gave you slightly inaccurate information. I ran the acpidmp on 2.6.15.1, where S3 mostly works so I use that kernel except when debugging ACPI. But the exregion-NNNN offsets are with 2.6.16-rc1 or -rc2 (slightly different numbers in each case). Do you need an acpidmp when running 2.6.16-rc2, even though the DSDT is the same as for 2.6.15.1? > You are stuck in the loop in the method below. I would guess that > the code is waiting for a response from the SMI bios and it never > happens. What could prevent 2.6.15[.1] from getting stuck there?
It looks to me like the ACPI/AML interpreter is doing what it was told to do. I am not an expert on sleep issues, so I can't help further.
Can you try to locate change that breaks it with git bisect?
Good idea. I've just started: Bisecting: 1675 revisions left to test after this
'git bisect' narrowed the problematic commit to: commit 292dd876ee765c478b27c93cc51e93a558ed58bf Merge: d4ec6c7cc9a15a7a529719bc3b84f46812f9842e 9fdb62af92c741addbea15545f214a6e89460865 Author: Len Brown <len.brown@intel.com> Date: Fri Jan 27 17:18:29 2006 -0500 Pull release into acpica branch This commit had a slightly different bug from all the other ones I marked as bad. This one hung on the first S3 sleep, with the same endless loop pattern as the other bad ones (which hang on the second S3 sleep). I'll attach the bisect log in case it's useful.
Created attachment 7304 [details] git bisect log
Please try pci=noacpi to test.
I use pci=noacpi with every kernel (it's in the lilo.conf append= line). It's been there so long that I forget why I put it there, but I think it was needed for S3 sleep to work at all.
Here is the commit (I'll attach the diff) that makes the second S3 sleep go into the endless loop when the loaded modules are exactly thermal, processor, intel_agp, and agpgart: 53f11d4ff8797bcceaf014e62bd39f16ce84baec is first bad commit diff-tree 53f11d4ff8797bcceaf014e62bd39f16ce84baec (from 02b28a33aae93a3b53068e 0858d62f8bcaef60a3) Author: Len Brown <len.brown@intel.com> Date: Mon Dec 5 16:46:36 2005 -0500 [ACPI] Enable Embedded Controller (EC) interrupt mode by default "ec_intr=0" reverts to polling "ec_burst=" no longer exists. Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Luming Yu <luming.yu@intel.com> :040000 040000 9eec66712c68ebe372b2fb2c8d78bdc99df942ab e7e62cd09983730aee468ed d4ba1cce50786b7e5 M Documentation :040000 040000 6e7db46918f6124f64a11f6757560078a8a27519 aa8abb1023024902300cb2e 7a5bf74acd8c579e8 M drivers If I boot with ec_intr=0, the second sleep works fine. Here is the full story. First I tried a system with the minimal set of modules to boot and run X (S3 sleep-wake wrecks the VGA consoles, but X restores fine with 'chvt 1; sleep 0.5 ; chvt 7' on wakeup). So I stopped every service and unloaded all modules possible, which left only intel_agp and agpgart. Then, for each of the usual loaded modules (except for sound modules, which often has sleep-wake problems anyway), I tried: 1. Load the module 2. S3 sleep-wake-sleep-wake 3. Unload the module to see whether the second sleep went into the infinite loop (visible across the console). The one culprit I found was 'thermal' (which brings in 'processor'). Other modules didn't trigger the problem. Then I recompiled using a minimal config, with only networking (for X to work) and the acpi modules, and maybe a few others that I couldn't avoid, and retested to make sure 'thermal' still triggered the problem, which it did. I used this config, or one just like it for 2.6.15, as a base for all other kernels in the bisection search, using the nearest ancestor's .config and then yes '' | make oldconfig to make the new .config The attached test_s3.sh script gives the steps for testing the thermal module. Eventually bisection converged to the commit above, and then I retested that kernel with 'ec_intr=0'. Is this a problem with the TP 600X hardware, in which case I'll just use ec_intr=0 forever, or is there more software debugging (DSDT or related to the diff)? I can turn on gobs of ACPI debugging and send the useful parts of the log file.
Created attachment 7326 [details] bash script for two S3 sleep-wake cycles testing thermal module
Created attachment 7327 [details] diff of the problematic commit: git diff 53f11d4f^..53f11d4f
what could happpen if you unload thermal, processor before suspend (ec_intr=1 kernel)?
With only intel_agp, agpgart loaded (so no processor or thermal or almost anything!), it suspends and wakes fine. That's a very minimal configuration, too minimal to be useful. So I'll also try it with the usual set of modules but with thermal and processor unloaded (I already unload the sound modules and prism54 or else they wake up in an unhappy state).
With a small set of modules loaded (thermal, processor, fan, button, ac, battery, intel_agp, agpgart), the sleeps are fine if thermal+processor are unloaded beforehand (ec_intr=1). If instead fan is unloaded before sleep, then the second sleep hangs in the usual endless loop.
cat you attach /proc/interrupts before 2nd suspend hang?
>modified DSDT so that S3 worked with earlier kernels (e.g. 2.6.15). Also please try un-modified DSDT.
I just tried the 'first failing commit' (02b28a33) with the vanilla (BIOS 1.11) DSDT. And not only did S3 sleep happen, but the thermal+processor bug didn't show up: i.e. it did two (actually) many S3 sleep-wake cycles. So the problem is due to something in the modified DSDT, which is either a problem itself or it exposes another problem. [Picture of me hanging head in shame for putting people to the trouble.] My only excuse is that long ago (2.6.11), the machine wouldn't S3 sleep at all with the vanilla DSDT, and I patched it as recommended by the ACPI experts. However, the current ACPI interpreter seems to handle the 600X's mangy DSDT. Now I agree 100% with Len Brown that it's best to have ACPI handle marginal DSDT's, not hack the DSDT.
I'm now running/testing an unmodified DSDT with 2.6.16-rc5. For a while I had no S3 hangs, but I just noticed them again. The error is the same as with the modified DSDT (with slightly different offsets): exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFFC0 exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0 000000023FDFFC0 exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=000000 00000000B2 repeated endlessly. I think the problem resurfaced once I decided to let my sleep.sh script leave the thermal driver loaded before going into S3 (suspecting that the bug might come back if I did that). So I susect that my modified DSDT didn't cause the S3 problems, it merely exposed one even in the minimal configuration. I'll attach the vanilla DSDT from /proc/acpi/dsdt
Created attachment 7554 [details] vanilla DSDT from /proc/acpi/dsdt This is the vanilla DSDT from BIOS 1.11 (the latest BIOS) w/ 2.6.16-rc5. This boot had ec_intr=0 (I was checking that the S3 hangs go away, and they do). If that addition to the kernel command line can change the DSDT, let me know and I'll repost after rebooting without ec_intr=0.
Hmmm, Then, thermal issue issue ? acpi trace log may help.
By "acpi trace log", do you mean using one of the acpi_debug_level options?
yes, please dig out the useful info from acpi trace log.
> yes, please dig out the useful info from acpi trace log. /proc/cmdline: auto BOOT_IMAGE=16-rc5 ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios cpufreq.debug=7 acpi_debug=0xffffffff where 16-rc5 is my LILO label for 2.6.16-rc5 The extract from the dmesgs: Stopping tasks: ========================================================| Execute Method: [\_SB_.LID0._PSW] (Node c1564808) exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFF40 acpi_ec-0458 [37] ec_intr_read : Read [02] from address [32] acpi_ec-0508 [37] ec_intr_write : Wrote [06] to address [32] Execute Method: [\_SB_.SLPB._PSW] (Node c1564708) Execute Method: [\_S3_] (Node c157a988) exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFF40 Execute Method: [\_PTS] (Node c157ab48) exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFF40 exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFF40 exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFF40 exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFFC0 exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0 000000023FDFFC0 exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFFC4 exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0 000000023FDFFC4 exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFFC0 exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=000000 00000000B2 exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFFC0 exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0 000000023FDFFC0 exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0 000000023FDFFC0 exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=000000 00000000B2 And then these above four lines (exregion-0185, -0185, -0185, -0290) repeat until I reboot.
Created attachment 7566 [details] full dmesgs with 2.6.16-rc5, vanilla DSDT Here are the full dmesgs from boot through the second S3 sleep (which is the one that fails). Search for the second "@@@@ SLEEP" (each one is sent across the serial console by the sleep.sh script). The sleep.sh script now turns off thermal polling before the sleep, by echoing '0' to the polling_frequency /proc files.
Created attachment 7570 [details] full dmesgs with boot-time debuggings set to 0x10 Here are the dmesgs. I booted with acpi_dbg_level=0x10 acpi_dbg_layer=0x10 and then did two sleep-wake cycles with no thermal module (both went fine), then one cycle with the thermal module loaded (went fine), and then the usual failing second sleep with the thermal module still loaded. The sleep-wake cycles were done with acpi_debug_level=0x1F (rather than the 0x10 boot value). I put ### marks where I've added comments to the dmesgs.
Created attachment 7655 [details] dmesgs with debugging patch to kernel and DSDT Here is a tar.gz file with the patch showing the changes relative to a vanilla kernel and to a DSDT with only THM0. I booted with debug_{layer,level}=0x10 as you suggested. The file names within the archive explain what each log file is.
Hmm, the debug object info was eliminated, I'm sorry, try acpi_dbg_level=0x1F
> Hmm, the debug object info was eliminated, > I'm sorry, try acpi_dbg_level=0x1F I did that with dbg_layer still 0x10. This time it hung on the first sleep, in the usual spot: Stopping tasks: ====================================================| Execute Method: [\_SB_.LID0._PSW] (Node c1564808) Execute Method: [\_SB_.SLPB._PSW] (Node c1564708) Execute Method: [\_S3_] (Node c157a988) Execute Method: [\_PTS] (Node c157ab48)
Did you see any debug info before stopping tasks............ I want to track the calls to I2RB, and understand if CHKS return different. Actually, this problem is pretty wired. We need to know if last I2RB works right before hang. Thanks, Luming
> Did you see any debug info before stopping tasks............ No, that was all there was. That makes sense though. The debug info is only in the I2RC (copy of I2RB), which is called only from THM0._TMP. With thermal polling off, there shouldn't be any calls to I2RC. Should I get rid of the UPDT and I2RB copies, and just modify I2RB itself to have the Debug objects? > I want to track the calls to I2RB, and understand if CHKS return > different. Following the depth-first strategy, let's make sure that the debug info is being printed as you want, then use it to investigate the hangs. Perhaps I'm not using the right debug_layer/level settings? 'acpi -t' will query the THM0 temperature and won't cause a hang, so I could use it to test various debug_level settings and see which ones produce the object you're looking for. Once that's settled, then I can use it on the sleep-wake cycles. > Actually, this problem is pretty wired. We need to know if last I2RB > works right before hang. I don't think I2RB (or its copy) is being called during the 2nd sleep.
>No, that was all there was. That makes sense though. The debug info >is only in the I2RC (copy of I2RB), which is called only from >THM0._TMP. With thermal polling off, there shouldn't be any calls to >I2RC. I assume the last call to THM0._TMP, in turn last call to UPDT, in turn last call to I2RB(c) trigger the later 2nd suspend hang. So, I need to understand the call trace of those last calls. >Should I get rid of the UPDT and I2RB copies, and just modify I2RB >itself to have the Debug objects? NO. >so Icould use it to test various debug_level settings and see which ones >produce the object you're looking for. Once that's settled, then I >can use it on the sleep-wake cycles. Yes
> I assume the last call to THM0._TMP, in turn last call to UPDT, in > turn last call to I2RB(c) trigger the later 2nd suspend hang. Oh, and it's called in the first wakeup, which must be the culprit: Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000 done Execute Method: [\_SI_._SST] (Node e3f8a848) Execute Method: [\_WAK] (Node e3f8aa48) Execute Method: [\_TZ_.THM0._PSV] (Node e3f8bdc8) Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bd48) Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd08) Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bcc8) Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bec8) Execute Method: [\_SI_._SST] (Node e3f8a848) Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf08) uhci_hcd 0000:00:07.2: suspend_rh (auto-stop) Execute Method: [\_SB_.LID0._PSW] (Node c1574808) Execute Method: [\_SB_.SLPB._PSW] (Node c1574708) >> so Icould use acpi -t to test various debug_level settings and see >> which ones produce the object you're looking for. Once that's >> settled, then I can use it on the sleep-wake cycles. > Yes What settings should I try? And what should I look for in the dmesgs (what does this debug object look like)?
>What settings should I try? And what should I look for in the dmesgs >(what does this debug object look like)? try boot kernel without any debug selection option. cat /proc/acpi/debug_level, and cat /proc/acpi/debug_layer you can see the supported debug option.
> cat /proc/acpi/debug_level, and cat /proc/acpi/debug_layer I've done that many times but the several names are ambiguous to me. For example, there's INFO and VERBOSE_INFO; or FUNCTIONS and NAMES and INIT_NAMES. I also don't understand why DEBUG_OBJECT didn't work before to print the Debug variable (I'd tried debug_level=0x1F). Maybe I need more debug_layers. But in the debug_layer what's the difference between the EXECUTER and the DISPATCHER? Description Hex SET ACPI_LV_ERROR 0x00000001 [ ] ACPI_LV_WARN 0x00000002 [ ] ACPI_LV_INIT 0x00000004 [ ] ACPI_LV_DEBUG_OBJECT 0x00000008 [ ] ACPI_LV_INFO 0x00000010 [*] ACPI_LV_INIT_NAMES 0x00000020 [ ] ACPI_LV_PARSE 0x00000040 [ ] ACPI_LV_LOAD 0x00000080 [ ] ACPI_LV_DISPATCH 0x00000100 [ ] ACPI_LV_EXEC 0x00000200 [ ] ACPI_LV_NAMES 0x00000400 [ ] ACPI_LV_OPREGION 0x00000800 [ ] ACPI_LV_BFIELD 0x00001000 [ ] ACPI_LV_TABLES 0x00002000 [ ] ACPI_LV_VALUES 0x00004000 [ ] ACPI_LV_OBJECTS 0x00008000 [ ] ACPI_LV_RESOURCES 0x00010000 [ ] ACPI_LV_USER_REQUESTS 0x00020000 [ ] ACPI_LV_PACKAGE 0x00040000 [ ] ACPI_LV_ALLOCATIONS 0x00100000 [ ] ACPI_LV_FUNCTIONS 0x00200000 [ ] ACPI_LV_OPTIMIZATIONS 0x00400000 [ ] ACPI_LV_MUTEX 0x01000000 [ ] ACPI_LV_THREADS 0x02000000 [ ] ACPI_LV_IO 0x04000000 [ ] ACPI_LV_INTERRUPTS 0x08000000 [ ] ACPI_LV_AML_DISASSEMBLE 0x10000000 [ ] ACPI_LV_VERBOSE_INFO 0x20000000 [ ] ACPI_LV_FULL_TABLES 0x40000000 [ ] ACPI_LV_EVENTS 0x80000000 [ ] -- debug_level = 0x00000010 (* = enabled) Description Hex SET ACPI_UTILITIES 0x00000001 [ ] ACPI_HARDWARE 0x00000002 [ ] ACPI_EVENTS 0x00000004 [ ] ACPI_TABLES 0x00000008 [ ] ACPI_NAMESPACE 0x00000010 [*] ACPI_PARSER 0x00000020 [ ] ACPI_DISPATCHER 0x00000040 [ ] ACPI_EXECUTER 0x00000080 [ ] ACPI_RESOURCES 0x00000100 [ ] ACPI_CA_DEBUGGER 0x00000200 [ ] ACPI_OS_SERVICES 0x00000400 [ ] ACPI_CA_DISASSEMBLER 0x00000800 [ ] ACPI_COMPILER 0x00001000 [ ] ACPI_TOOLS 0x00002000 [ ] ACPI_ALL_DRIVERS 0xFFFF0000 [ ] -- debug_layer = 0x00000010 (* = enabled, - = partial)
Created attachment 7656 [details] commented dmesgs: boot, ok sleep-wake, then sleep hang Here are commented dmesgs (lines beginning with #) for a failing 2nd sleep, with debugging info! I booted with debug_layer=0x10, debug_level=0x1F, and then used 'acpi -t' with different values until I settled on debug_layer=0x90, debug_level=0x1F, as printing the [ACPI Debug] lines without too much extra junk. Probably it could be still cleaner. The first chunk is the boot messages plus me messing around with acpi -t to get the right debug values; then the first sleep-wake cycle (which worked); and then the second sleep (which hung). There's one strange spot I can't figure out: in the wakeup msgs, the [ACPI Debug] comes as part of the _SST and not as part of the THM0._TMP (I marked it in the dmesgs).
># why does the Debug line show up here, and not right after the THM0._TMP >line? >[ACPI Debug] Integer: 0x00000000 Good question! Could you try option: acpi_serialize
> Could you try option: acpi_serialize Here's the kernel command line: BOOT_IMAGE=904b0f361ebf ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios cpufreq.debug=7 acpi_dbg_level=0x10 acpi_dbg_layer=0x10 acpi_dbg_layer=0x90 acpi_dbg_level=0x1F acpi_serialize Here are the dmesgs from the first cycle, which shows the Debug still showing late: eth0: removing device Unloaded prism54 driver PM: Preparing system for mem sleep Stopping tasks: =====================================================| Execute Method: [\_SB_.LID0._PSW] (Node c1574808) exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_SB_.SLPB._PSW] (Node c1574708) Execute Method: [\_S3_] (Node e3f8a908) exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_PTS] (Node e3f8aac8) exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [21] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0185 [21] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0 exregion-0185 [21] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4 exregion-0185 [21] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0290 [21] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0290 [21] ex_system_io_space_han: system_iO 0 (16 width) Address=000000000000EF20 exregion-0290 [21] ex_system_io_space_han: system_iO 1 (16 width) Address=000000000000EF20 exregion-0290 [20] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_SI_._SST] (Node e3f8a848) exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 uhci_hcd 0000:00:07.2: suspend_rh uhci_hcd 0000:00:07.2: uhci_suspend uhci_hcd 0000:00:07.2: --> PCI D0/legacy PM: Entering mem sleep Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Back to C! PM: Finishing wakeup. Execute Method: [\_GPE._L0B] (Node e3f8a7c8) PCI: Found IRQ 11 for device 0000:00:02.0 PCI: Sharing IRQ 11 with 0000:00:06.0 PCI: Sharing IRQ 11 with 0000:01:00.0 PCI: Found IRQ 11 for device 0000:00:02.1 uhci_hcd 0000:00:07.2: PCI legacy resume PCI: Found IRQ 11 for device 0000:00:07.2 uhci_hcd 0000:00:07.2: uhci_resume uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000 uhci_hcd 0000:00:07.2: Performing full reset usb usb1: root hub lost power or was reset uhci_hcd 0000:00:07.2: suspend_rh usb usb1: finish resume uhci_hcd 0000:00:07.2: wakeup_rh Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000 done Execute Method: [\_SI_._SST] (Node e3f8a848) exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_WAK] (Node e3f8aa48) exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF40 exregion-0290 [23] ex_system_io_space_han: system_iO 0 (16 width) Address=000000000000EF20 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (16 width) Address=000000000000EF20 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0185 <7>uhci_hcd 0000:00:07.2: suspend_rh (auto-stop) [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0290 [25] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000015EE exregion-0290 [24] ex_system_io_space_han: system_iO 0 (8 width) Address=00000000000015EF exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF4C exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF4C exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF4C exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF4C exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC8 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC8 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4 exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0365 [22] ex_pci_config_space_ha: pci_config 0 (8) Seg(0000) Bus(0000) Dev(0001) Func(0000) Reg(00b0) exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0290 [22] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37 exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF37 exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0004) Func(0000) Reg(00dc) exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0004) Func(0000) Reg(00dc) exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF37 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF37 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF35 exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF35 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF4C exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF4C exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040) exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040) exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040) exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040) exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040) exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040) exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (8) Seg(0000) Bus(0000) Dev(0001) Func(0000) Reg(00b1) exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (8) Seg(0000) Bus(0000) Dev(0001) Func(0000) Reg(00b1) exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_TZ_.THM0._PSV] (Node e3f8bdc8) Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bd48) exregion-0185 [30] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF38 Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd08) exregion-0185 [30] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF38 Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bcc8) exregion-0185 [30] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF3C Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bec8) exregion-0185 [31] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf08) exregion-0185 [32] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_SI_._SST] (Node e3f8a848) exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 [ACPI Debug] Integer: 0x00000000 Execute Method: [\_SB_.LID0._PSW] (Node c1574808) exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40 Execute Method: [\_SB_.SLPB._PSW] (Node c1574708) ds: ds_open(socket 0) ds: ds_open(socket 1) ds: ds_open(socket 2) pccard: card ejected from slot 1 PCMCIA: socket e3be9828: *** DANGER *** unable to remove socket power ds: ds_release(socket 0) ds: ds_release(socket 1) No surprise: It hung on the second cycle, with the same log as before.
>Here are the dmesgs from the first cycle, which shows the Debug still >showing late: It should be due to kacpid preempted the execution of acpi_leave_sleep_state. Let continue hack with: chage 2. in acpi_pm_finish: Right after: arg.integer.value = ACPI_SST_WORKING; status = acpi_evaluate_object (NULL, METHOD_NAME__SST, &arg_list, NULL); if (ACPI_FAILURE (status) && status != AE_NOT_FOUND) { ACPI_REPORT_ERROR (("Method _SST failed, %s\n", acpi_format_exception (status))); } insert: acpi_in_suspend =NO 6. in acpi_thermal_check, just before: acpi_thermal_get_temperature(tz) insert: if (acpi_in_suspend == YES) return_VOID; Let's check if acpi_leave_sleep_state won't be preempted, and exuection sequence is like: _SST _WAK _SST
Okay, I first merged the acpi_in_suspend changes from before with the Debug and I2RBcopy changes to the DSDT (which still has only THM0 in it). Then I put in your latest two changes, but booting gives: Freeing unused kernel memory: 228k freed FATAL: kernel too old kernel panic not syncing: Attempted to kill init! ================== change 1 ============================ change 2. in acpi_pm_finish: Right after: arg.integer.value = ACPI_SST_WORKING; status = acpi_evaluate_object (NULL, METHOD_NAME__SST, &arg_list, NULL); if (ACPI_FAILURE (status) && status != AE_NOT_FOUND) { ACPI_REPORT_ERROR (("Method _SST failed, %s\n", acpi_format_exception (status))); } insert: acpi_in_suspend =NO That code didn't appear in acpi_pm_finish (in sleep/main.c), but a close relative of it appeared at the end of acpi_leave_sleep_state (in hardware/hwsleep.c). So, in acpi_pm_finish, I moved the acpi_in_suspend = FALSE line to come after apci_leave_sleep_state returns. I'm using 2.6.16-rc5; are you using 2.6.16? Anyway here's what acpi_pm_finish now looks like: static int acpi_pm_finish(suspend_state_t pm_state) { u32 acpi_state = acpi_suspend_states[pm_state]; acpi_leave_sleep_state(acpi_state); acpi_in_suspend = FALSE; /* used to be one line above */ acpi_disable_wakeup_device(acpi_state); /* reset firmware waking vector */ acpi_set_firmware_waking_vector((acpi_physical_address) 0); if (init_8259A_after_S1) { printk("Broken toshiba laptop -> kicking interrupts\n"); init_8259A(0); } return 0; } ====================== change 2 ============================ 6. in acpi_thermal_check, just before: acpi_thermal_get_temperature(tz) insert: if (acpi_in_suspend == YES) return_VOID; I made that change too. ============================================================ For completeness, here's the diff between the vanilla kernel and the one I just tried (let me know if I should also attach it at the bugzilla), except I haven't included the modified DSDT: diff -r ac486e270597 -r 9b6b830abe8f .config --- a/.config Sat Mar 18 08:35:34 2006 -0500 +++ b/.config Sun Mar 26 23:06:16 2006 -0500 @@ -217,7 +217,8 @@ CONFIG_ACPI_THERMAL=m # CONFIG_ACPI_ASUS is not set CONFIG_ACPI_IBM=m # CONFIG_ACPI_TOSHIBA is not set -# CONFIG_ACPI_CUSTOM_DSDT is not set +CONFIG_ACPI_CUSTOM_DSDT=y +CONFIG_ACPI_CUSTOM_DSDT_FILE="../../dsdt/600x.hex" CONFIG_ACPI_BLACKLIST_YEAR=0 CONFIG_ACPI_DEBUG=y CONFIG_ACPI_EC=y diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/ec.c --- a/drivers/acpi/ec.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/ec.c Sun Mar 26 23:06:16 2006 -0500 @@ -456,6 +456,7 @@ static int acpi_ec_intr_read(union acpi_ acpi_hw_low_level_read(8, data, &ec->common.data_addr); ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Read [%02x] from address [%02x]\n", *data, address)); + acpi_os_sleep (1); end: up(&ec->intr.sem); @@ -506,6 +507,7 @@ static int acpi_ec_intr_write(union acpi ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Wrote [%02x] to address [%02x]\n", data, address)); + acpi_os_sleep (1); up(&ec->intr.sem); diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/osl.c --- a/drivers/acpi/osl.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/osl.c Sun Mar 26 23:06:16 2006 -0500 @@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi return_VOID; } +extern int acpi_in_suspend; + acpi_status acpi_os_queue_for_execution(u32 priority, acpi_osd_exec_callback function, void *context) @@ -643,6 +645,9 @@ acpi_os_queue_for_execution(u32 priority struct work_struct *task; ACPI_FUNCTION_TRACE("os_queue_for_execution"); + if (acpi_in_suspend) /* in case kacpid is causing the queue */ + /* should use a better error code, but oh well */ + return_ACPI_STATUS(AE_OK); ACPI_DEBUG_PRINT((ACPI_DB_EXEC, "Scheduling function [%p(%p)] for deferred execution.\n", diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/sleep/main.c --- a/drivers/acpi/sleep/main.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/sleep/main.c Sun Mar 26 23:06:16 2006 -0500 @@ -19,6 +19,12 @@ #include <acpi/acpi_drivers.h> #include "sleep.h" +/* for functions putting machine to sleep to know that we're + suspending, so that they can careful about what AML methods they + invoke (to avoid trying untested BIOS code paths) */ +int acpi_in_suspend; +EXPORT_SYMBOL(acpi_in_suspend); + u8 sleep_states[ACPI_S_STATE_COUNT]; static struct pm_ops acpi_pm_ops; @@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state printk("acpi_pm_prepare does not support %d \n", pm_state); return -EPERM; } + acpi_os_wait_events_complete(NULL); + acpi_in_suspend = TRUE; return acpi_sleep_prepare(acpi_state); } @@ -132,6 +140,7 @@ static int acpi_pm_finish(suspend_state_ u32 acpi_state = acpi_suspend_states[pm_state]; acpi_leave_sleep_state(acpi_state); + acpi_in_suspend = FALSE; acpi_disable_wakeup_device(acpi_state); /* reset firmware waking vector */ diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/thermal.c --- a/drivers/acpi/thermal.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/thermal.c Sun Mar 26 23:06:16 2006 -0500 @@ -79,6 +79,8 @@ static int tzp; static int tzp; module_param(tzp, int, 0); MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n"); + +extern int acpi_in_suspend; static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); @@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo static void acpi_thermal_run(unsigned long data) { struct acpi_thermal *tz = (struct acpi_thermal *)data; + if (acpi_in_suspend) /* thermal methods might cause a hang */ + return_VOID; /* so don't do them */ if (!tz->zombie) acpi_os_queue_for_execution(OSD_PRIORITY_GPE, acpi_thermal_check, (void *)data); @@ -705,6 +709,8 @@ static void acpi_thermal_check(void *dat state = tz->state; + if (acpi_in_suspend) + return_VOID; result = acpi_thermal_get_temperature(tz); if (result) return_VOID; @@ -1224,6 +1230,9 @@ static void acpi_thermal_notify(acpi_han struct acpi_device *device = NULL; ACPI_FUNCTION_TRACE("acpi_thermal_notify"); + + if (acpi_in_suspend) /* thermal methods might cause a hang */ + return_VOID; /* so don't do them */ if (!tz) return_VOID;
why? I tried your patch, boot fine! Could you try to remove that patch, and build from scratch? Thanks, Luming
Created attachment 7686 [details] dmesgs from mostly working kernel! (.tgz file) Extracted from the README included in this .tgz file: The previous kernel failed with 'kernel panic'. In trying to figure out the problem I decided to add printk's rather than try an unmodified kernel (since I know that works). So this one is the same kernel but with printk's around all the uses of and changes to acpi_in_suspend. This kernel boots fine; and I don't know why. Kernel command line: BOOT_IMAGE=6eb2c00a8f62 ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios cpufreq.debug=7 acpi_dbg_level=0x10 acpi_dbg_layer=0x10 acpi_dbg_layer=0x90 acpi_dbg_level=0x1F acpi_serialize Anyway, since it booted fine, I could test it. It did one sleep/wake cycle fine, as usual. And on the second cycle, it went to sleep! But it wouldn't wake up with the Fn key, and I had to briefly slide the power switch to wake it up. I had thought it was hung there forever, so I slid the power switch to turn it off and reboot. I was surprised to find that it began waking up. It finished waking up and then shutdown (almost certainly because I used the power switch). So the suspend hangs seems improved, but a mystery is why I had to use the power switch. Perhaps it's related to the acpi_serialize, which makes the acpi system act more like Windows. A perhaps related data point is from an earlier experiment: I told the vanilla DSDT to pretend to be Win98 (by hacking it to return True when it did that test); the resulting kernel also wouldn't wake up with the 'Fn' key and I had to use the power switch. So my next test will be to try the 6eb2c00a8f62 kernel without acpi_serialize. Contents of the .tgz file: README : This file cycle1.dmesg : dmesgs from the first sleep/wake cycle cycle2.log : dmesgs extracted from the syslog (hence the .log extension) for the second sleep/wake from-serial-console.dmesg : everything captured across the serial console from bootup to shutdown hacks.diff : the full diff between a vanilla kernel/DSDT/config and this one, including the modified DSDT. config : the kernel config (includes the diff in the file above)
Created attachment 7691 [details] dmesgs from booting without acpi_serialize These dmesgs are from the kernel in the previous attachment (the kernel that needed the power switch for the 2nd wakeup), but without acpi_serialize. It hung on the fIRST sleep, in the usual way. Maybe worth tracking down what goes wrong here, since the bug is really easy to trigger now?
To comment# 49: Hmm, we made a progress! :-) Did you try 2nd, 3rd, 4th... sleep cycles? Maybe you can try vanilla DSDT too? To comment# 50: According to the experience of how _TMP affecting the BIOS behavior, I doubt the quality of the BIOS on this box. So, I recommend you use acpi_serialize as default, just because there could have some un-tested code paths causing trouble in future. If you verify eveything is ok, then we can move on to the Fn-key-wakeup issue. After that, probably, you can come up with a real/clean patch.
> Hmm, we made a progress! :-) I agree. > Did you try 2nd, 3rd, 4th... sleep cycles? No, because the power switch (required to wake it up) also shut it down. Oh, I can tell the acpid to shut up and not power it off, and see if I can then wake it up without having it shut down. Then I can do 3rd and 4th etc. cycles. > Maybe you can try vanilla DSDT too? I'll try that as well.
> After that, probably, you can come up with a real/clean patch. There's also the issue of the kernel panic, which went away by adding a few printk's. So I don't know how reliable a patch can be if it might cause kernel panics, or not cause them but only in poorly understood situations.
> Did you try 2nd, 3rd, 4th... sleep cycles? > Maybe you can try vanilla DSDT too? Just tried 10 cycles with a vanilla DSDT and it worked fine (had to wake up cycles 2-10 with the power button).
Then, don't forget to delete the hack of removing THM{2,6,7}.
>> Just tried 10 cycles with a vanilla DSDT and it worked fine (had to >> wake up cycles 2-10 with the power button). > Then, don't forget to delete the hack of removing THM{2,6,7}. That's what I tested (the BIOS-supplied DSDT): $ grep DSDT .config # CONFIG_ACPI_CUSTOM_DSDT is not set
>$ grep DSDT .config ># CONFIG_ACPI_CUSTOM_DSDT is not set But from "from-serial-console.dmesg", I just found THM0?
> But from "from-serial-console.dmesg", I just found THM0? Right, those messages are just with THM0. Then, feeling brave, I recompiled with the BIOS DSDT and told acpid not to power off the machine when the power switch is used. I ran 10 sleep-wake cycles with that kernel, and they all worked (except for needing to use the power switch to wake up the 2nd and subsequent sleeps).
So, do you mean THM0,2,6,7 are all included in your 10 cycles testing? I mean there are NO kernel hack exists to prevent THM 2,6,7? If yeas, then, it's time to do some clean up, and make a clean-patch. :-) But, I'm still confused with the printk and kernel panic stuff.
> I mean there are NO kernel hack exists to prevent THM 2,6,7? Right, no hacks. That kernel hack was gone long ago: Once I took THM2,6,7 out of the DSDT, I removed the hack in thermal.c. So the kernel with 10 cycles used the BIOS-supplied DSDT. Below is the full diff of every file (except my own compiling scripts) relative to vanilla kernel/BIOS DSDT (i.e. config with CUSTOM_DSDT unset). > If yes, then, it's time to do some clean up, and make a clean-patch. :-) Except I'm worried about the unexplained kernel panic (when I didn't use all the printk's). Maybe we have not got to the bottom of the problem, and have merely pushed it elsewhere? Another worrisome point is that the changes made the system hang on the first sleep, unless I use acpi_serialize. But acpi_serialize is not ideal on this box, because it slows down 'acpi -t' by a factor of 4, from 0.12 sec real time to 0.5 sec (but maybe nothing can be done about that). But here's the diff: The current revision: $ hg tip changeset: 53:ec4441b5c420 tag: tip user: Sanjoy Mahajan <sanjoy@mrao.cam.ac.uk> date: Mon Mar 27 22:04:23 2006 -0500 summary: Use vanilla DSDT. The base revision (revision 0 is untarring 2.6.16-rc5.tar.gz): $ hg log -r 1 changeset: 1:ac486e270597 tag: vanilla user: Sanjoy Mahajan <sanjoy@mrao.cam.ac.uk> date: Sat Mar 18 08:35:34 2006 -0500 summary: Added my usual .config (uses vanilla DSDT). The diff of the current revision relative to the base revision: $ hg diff -r 1 diff -r ac486e270597 drivers/acpi/ec.c --- a/drivers/acpi/ec.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/ec.c Tue Mar 28 08:20:44 2006 -0500 @@ -456,6 +456,7 @@ static int acpi_ec_intr_read(union acpi_ acpi_hw_low_level_read(8, data, &ec->common.data_addr); ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Read [%02x] from address [%02x]\n", *data, address)); + acpi_os_sleep (1); end: up(&ec->intr.sem); @@ -506,6 +507,7 @@ static int acpi_ec_intr_write(union acpi ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Wrote [%02x] to address [%02x]\n", data, address)); + acpi_os_sleep (1); up(&ec->intr.sem); diff -r ac486e270597 drivers/acpi/osl.c --- a/drivers/acpi/osl.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/osl.c Tue Mar 28 08:20:44 2006 -0500 @@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi return_VOID; } +extern int acpi_in_suspend; + acpi_status acpi_os_queue_for_execution(u32 priority, acpi_osd_exec_callback function, void *context) @@ -643,6 +645,12 @@ acpi_os_queue_for_execution(u32 priority struct work_struct *task; ACPI_FUNCTION_TRACE("os_queue_for_execution"); + if (acpi_in_suspend) /* in case kacpid is causing the queue */ + { + printk(KERN_INFO PREFIX "acpi_os_queue_for_execution: acpi_in_suspend=%d, so returning\n", acpi_in_suspend); + /* should use a better error code, but oh well */ + return_ACPI_STATUS(AE_OK); + } ACPI_DEBUG_PRINT((ACPI_DB_EXEC, "Scheduling function [%p(%p)] for deferred execution.\n", diff -r ac486e270597 drivers/acpi/sleep/main.c --- a/drivers/acpi/sleep/main.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/sleep/main.c Tue Mar 28 08:20:44 2006 -0500 @@ -19,6 +19,12 @@ #include <acpi/acpi_drivers.h> #include "sleep.h" +/* for functions putting machine to sleep to know that we're + suspending, so that they can careful about what AML methods they + invoke (to avoid trying untested BIOS code paths) */ +int acpi_in_suspend; +EXPORT_SYMBOL(acpi_in_suspend); + u8 sleep_states[ACPI_S_STATE_COUNT]; static struct pm_ops acpi_pm_ops; @@ -55,6 +61,9 @@ static int acpi_pm_prepare(suspend_state printk("acpi_pm_prepare does not support %d \n", pm_state); return -EPERM; } + acpi_os_wait_events_complete(NULL); + printk(KERN_INFO PREFIX "acpi_pm_prepare: Setting acpi_in_suspend.\n"); + acpi_in_suspend = TRUE; return acpi_sleep_prepare(acpi_state); } @@ -132,6 +141,8 @@ static int acpi_pm_finish(suspend_state_ u32 acpi_state = acpi_suspend_states[pm_state]; acpi_leave_sleep_state(acpi_state); + printk(KERN_INFO PREFIX "acpi_pm_finish: Unsetting acpi_in_suspend.\n"); + acpi_in_suspend = FALSE; acpi_disable_wakeup_device(acpi_state); /* reset firmware waking vector */ diff -r ac486e270597 drivers/acpi/thermal.c --- a/drivers/acpi/thermal.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/thermal.c Tue Mar 28 08:20:44 2006 -0500 @@ -79,6 +79,8 @@ static int tzp; static int tzp; module_param(tzp, int, 0); MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n"); + +extern int acpi_in_suspend; static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); @@ -683,6 +685,11 @@ static void acpi_thermal_run(unsigned lo static void acpi_thermal_run(unsigned long data) { struct acpi_thermal *tz = (struct acpi_thermal *)data; + if (acpi_in_suspend) /* thermal methods might cause a hang */ + { + printk(KERN_INFO PREFIX "acpi_thermal_run: acpi_in_suspend=%d, so returning\n", acpi_in_suspend); + return_VOID; /* so don't do them */ + } if (!tz->zombie) acpi_os_queue_for_execution(OSD_PRIORITY_GPE, acpi_thermal_check, (void *)data); @@ -705,6 +712,11 @@ static void acpi_thermal_check(void *dat state = tz->state; + if (acpi_in_suspend) /* thermal methods might cause a hang */ + { + printk(KERN_INFO PREFIX "acpi_thermal_check: acpi_in_suspend=%d, so returning\n", acpi_in_suspend); + return_VOID; + } result = acpi_thermal_get_temperature(tz); if (result) return_VOID; @@ -1224,6 +1236,12 @@ static void acpi_thermal_notify(acpi_han struct acpi_device *device = NULL; ACPI_FUNCTION_TRACE("acpi_thermal_notify"); + + if (acpi_in_suspend) /* thermal methods might cause a hang */ + { + printk(KERN_INFO PREFIX "acpi_thermal_notify: acpi_in_suspend=%d, so returning\n", acpi_in_suspend); + return_VOID; /* so don't do them */ + } if (!tz) return_VOID;
This problem is probably similar to the problem that I reported in #5037 (pre-emptible kernels would hang going to sleep). See especially comment #17 there: http://bugzilla.kernel.org/show_bug.cgi?id=5037#c17 So maybe the solution with acpi_in_suspend, once it's cleaned up, will fix the pre-emptible kernels too (I haven't run them for a while, since they so regularly would hang on sleep).
Hmmm, it could be. Please clean up and send out the updated patch to acpi ML and LKML for review. Probably, this patch needs go to mm tree. Please CC me, Len brown, and akpm. Thanks, Luming
Bad news. I made a cleaned-up patch (below) and tested it, and it hung on the 4th sleep. That was with debug_{level,layer}=0x10, so I didn't get much debugging info. It printed the 'Execute method' for _PTS, and then hung at "Exec" -- I guess printing it for SST. Just to make sure I wasn't imagining that the problem had been fixed, I return to the usual debugging: layer=0x90 and level=0x1F. And it cycled with no problem (18 tries)! So we're not done yet, and maybe we're just treating the symptoms? -Sanjoy diff -r ac486e270597 -r abd89292c539 drivers/acpi/osl.c --- a/drivers/acpi/osl.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/osl.c Thu Mar 30 10:59:57 2006 -0500 @@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi return_VOID; } +extern int acpi_in_suspend; + acpi_status acpi_os_queue_for_execution(u32 priority, acpi_osd_exec_callback function, void *context) @@ -643,6 +645,8 @@ acpi_os_queue_for_execution(u32 priority struct work_struct *task; ACPI_FUNCTION_TRACE("os_queue_for_execution"); + if (acpi_in_suspend) /* in case kacpid is causing the queue */ + return_ACPI_STATUS(AE_OK); ACPI_DEBUG_PRINT((ACPI_DB_EXEC, "Scheduling function [%p(%p)] for deferred execution.\n", diff -r ac486e270597 -r abd89292c539 drivers/acpi/sleep/main.c --- a/drivers/acpi/sleep/main.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/sleep/main.c Thu Mar 30 10:59:57 2006 -0500 @@ -19,6 +19,12 @@ #include <acpi/acpi_drivers.h> #include "sleep.h" +/* for functions putting machine to sleep to know that we're + suspending, so that they can careful about what AML methods they + invoke (to avoid trying untested BIOS code paths) */ +int acpi_in_suspend; +EXPORT_SYMBOL(acpi_in_suspend); + u8 sleep_states[ACPI_S_STATE_COUNT]; static struct pm_ops acpi_pm_ops; @@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state printk("acpi_pm_prepare does not support %d \n", pm_state); return -EPERM; } + acpi_os_wait_events_complete(NULL); + acpi_in_suspend = TRUE; return acpi_sleep_prepare(acpi_state); } @@ -132,6 +140,7 @@ static int acpi_pm_finish(suspend_state_ u32 acpi_state = acpi_suspend_states[pm_state]; acpi_leave_sleep_state(acpi_state); + acpi_in_suspend = FALSE; acpi_disable_wakeup_device(acpi_state); /* reset firmware waking vector */ diff -r ac486e270597 -r abd89292c539 drivers/acpi/thermal.c --- a/drivers/acpi/thermal.c Sat Mar 18 08:35:34 2006 -0500 +++ b/drivers/acpi/thermal.c Thu Mar 30 10:59:57 2006 -0500 @@ -79,6 +79,8 @@ static int tzp; static int tzp; module_param(tzp, int, 0); MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n"); + +extern int acpi_in_suspend; static int acpi_thermal_add(struct acpi_device *device); static int acpi_thermal_remove(struct acpi_device *device, int type); @@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo static void acpi_thermal_run(unsigned long data) { struct acpi_thermal *tz = (struct acpi_thermal *)data; + if (acpi_in_suspend) /* thermal methods might cause a hang */ + return_VOID; /* so don't do them */ if (!tz->zombie) acpi_os_queue_for_execution(OSD_PRIORITY_GPE, acpi_thermal_check, (void *)data); @@ -705,6 +709,8 @@ static void acpi_thermal_check(void *dat state = tz->state; + if (acpi_in_suspend) + return_VOID; result = acpi_thermal_get_temperature(tz); if (result) return_VOID; @@ -1224,6 +1230,9 @@ static void acpi_thermal_notify(acpi_han struct acpi_device *device = NULL; ACPI_FUNCTION_TRACE("acpi_thermal_notify"); + + if (acpi_in_suspend) /* thermal methods might cause a hang */ + return_VOID; /* so don't do them */ if (!tz) return_VOID;
The reason for 4th suspend hang could be: 1. the patch doesn't fix the known problem cleanly, we need to come up with a better patch for the known problem. 2. There are other unknown problems in kernel? in driver? in BIOS? . :-(
> 1. the patch doesn't fix the known problem cleanly, we need to come > up with a better patch for the known problem. This is quite likely, since we don't fully understand the problem. We just know that calling _TMP in the wrong place messes up the EC, but we don't know how (or what defines 'wrong place'), so we can't be sure that we fixed the whole problem and that no other methods can do the same thing.
>> 1. the patch doesn't fix the known problem cleanly, we need to come >> up with a better patch for the known problem. > >This is quite likely, since we don't fully understand the problem. We >just know that calling _TMP in the wrong place messes up the EC, but >we don't know how (or what defines 'wrong place'), so we can't be sure >that we fixed the whole problem and that no other methods can do the >same thing. Yes, this is a good question. The real fix should be in BIOS. But, the key point is we assume Windows work, and we expect linux work too. So, we do need to avoid un-tested (by windows) BIOS code path. For this bug, maybe we didn't handle all cases to stop the execution of unnecessary AML methods/ code path.
> For this bug, maybe we didn't handle all cases to stop the execution of > unnecessary AML methods/ code path. Maybe figuring out the kernel panic would help? Because the kernel panic comes and goes, depending on printk's (and the hang does too - not happening with lots of debugging, for example), it makes me think that some memory is getting corrupted somewhere. And by putting in printk's we're changing the location of what gets corrupted, so changing the result but not the (unknown) cause.
More bad news. My other sleep script, which unloads thermal before suspending, just hung on the first sleep. It has a vanilla 2.6.16-rc5 kernel with the BIOS DSDT (so no mods), but it's the first time it's ever hung. I suspect one of two recent changes: 1. The trackpoint died a couple days ago, so I had attached a USB/PS2 external mouse. But I tried sleeping again with that situation, with no problems. 2. I had booted it up and left it overnight without logging in, and so presumably without the X screensaver (not sure about that), and then logged in. Not sure if that can make a difference. Either way, it indicates that we don't know the whole story. I wonder if the approach is to disable EC updates during suspend and wake. So hack the ACPI interpreter to ignore UPDT() (is it called that in every ACPI bios implementation?) while acpi_in_suspend is true, but let everything else happen. Another thought is about the reason that I need to use the power switch to wake it up (with the mods): acpi_os_queue_for_execution is disabled by the acpi_in_suspend=TRUE setting, but maybe some methods need to run to set Fn as the wakeup key. Which maybe means I should try the hack to prevent UPDT() running.
>More bad news. My other sleep script, which unloads thermal before >suspending, just hung on the first sleep It is worthy to do same test without loading thermal moudle at all. PS. Hmmm, we need to find out an efficient way to reproduce this hung. Please apply patch at http://bugzilla.kernel.org/show_bug.cgi?id=5749#c11 and register a interested method with command like: echo "0:\\_SB.BAT0:_BST:\\_SB.BAT0:_BST:1:1" > /proc/acpi/hotkey/event_config And, execute it with command: echo -n "1:0:0:0" > /proc/acpi/hotkey/action In this way, without changing kernel, you can directly test the interested AML methods that could cause hang. The BIOS needs to be fully tested. :-(
> But, the key point is we assume Windows work, and we expect linux work too. I just saw the documentation on acpi_os_name: commit a1f9e65e2085e0a87f28a4d5a8ae43b32c087f24 Author: Len Brown <len.brown@intel.com> Date: Wed Jan 25 23:47:36 2006 -0500 [ACPI] document cmdline acpi_os_name= So I used it to simulate Windows code paths. I tested the vanilla 2.6.16-rc5 kernel with the BIOS DSDT, using acpi_os_name="Microsoft Windows" acpi_serialize It hung on the second sleep, as usual (and the same hang if I didn't use acpi_serialize). So then I wondered whether this BIOS works with Windows at all. But it has to. What else could IBM have tested it with? So Windows must do something smart during the suspend, like disabling the EC updates or like the other hacks that we've been trying.
> What else could IBM have tested it with? So Windows must >do something smart during the suspend, like disabling the EC updates >or like the other hacks that we've been trying. It's NOT called smart-thing. >It hung on the second sleep, as usual (and the same hang if I didn't >use acpi_serialize). Do you use pathc at http://bugzilla.kernel.org/show_bug.cgi?id=5989#c63? I suggest you use that patch, and acpi_serialize option for testing.
> Do you use patch at http://bugzilla.kernel.org/show_bug.cgi?id=5989#c63? > I suggest you use that patch, and acpi_serialize option for testing. Good news! I used that patch with acpi_os_name="Microsoft Windows" acpi_serialize and I haven't been able to hang it. I did 14 cycles with no problem. The first 8 cycles were with debug_{level,layer}=0x10, which was the combination that hung more easily than layer=0x90,level=0x1F. To check whether the debug params matters, I did the last 6 cycles with layer=0x90,level=0x1F. To further stress it, I turned on thermal polling: 1 second for each of the four thermal zones (starting them 0.25 seconds apart to maximize the chance that a thermal poll happens at a dangerous time during the suspend or wake). Six further cycles worked fine. Then I unloaded and loaded the thermal module, which often helps produce hangs on the next suspend, and that was fine for six more cycles. I haven't been able to hang it at all. To summarize: 1. hangs on 4th S3 : patch in #63. 2. hangs on 2nd S3 : vanilla + acpi_os_name="Microsoft Windows" 3. hangs on 2nd S3 : vanilla + acpi_os_name="Microsoft Windows" acpi_serialize 4. doesn't hang : patch + acpi_os_name="Microsoft Windows" Probably the patch + acpi_serialize will hang too, but I'm not sure. The unclean version of the patch (i.e. with debugging printk's) did hang with just acpi_serialize. The slight problems: the fan state and polling frequencies are garbled on resume. By fan state being garbled, I mean that it is sometime on even though the temps are all below the trip points, and acpi -t reports that the fan is off. So the system doesn't have the right fan state. Similarly with the thermal polling: It reports 1 second for each zone, but it's not polling at all (no TMP methods reported in the dmesgs). If I set the polling intervals to 1 second, then it starts polling again. Also, I need to wake it up using the power switch instead of the Fn key. Using the power switch produces a couple sharp clicks on the speaker.
Same problem with 2.6.17-rc4 with boot params acpi_os_name="Microsoft Windows" acpi_serialize and the vanilla (BIOS) DSDT.
*** This bug has been marked as a duplicate of 6749 ***