Bug 5989

Summary: S3 sleep hangs the second time - 600X
Product: ACPI Reporter: Sanjoy Mahajan (sanjoy)
Component: Power-Sleep-WakeAssignee: Luming Yu (luming.yu)
Status: REJECTED DUPLICATE    
Severity: normal CC: acpi-bugzilla, bunk, pavel
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.17-rc4 Subsystem:
Regression: --- Bisected commit-id:
Attachments: log file showing 1st (ok) and 2nd (failed) sleep attempts
S3 sleep script
acpidmp output
original DSDT from BIOS 1.11
git bisect log
bash script for two S3 sleep-wake cycles testing thermal module
diff of the problematic commit: git diff 53f11d4f^..53f11d4f
vanilla DSDT from /proc/acpi/dsdt
full dmesgs with 2.6.16-rc5, vanilla DSDT
full dmesgs with boot-time debuggings set to 0x10
dmesgs with debugging patch to kernel and DSDT
commented dmesgs: boot, ok sleep-wake, then sleep hang
dmesgs from mostly working kernel! (.tgz file)
dmesgs from booting without acpi_serialize

Description Sanjoy Mahajan 2006-02-01 09:18:40 UTC
Most recent kernel where this bug did not occur: 2.6.15
Distribution: Debian
Hardware Environment: TP 600X, bios 1.11, modified DSDT so that S3 worked with
earlier kernels (e.g. 2.6.15).
Software Environment: 
Problem Description: The first sleep and wake goes fine.  The second sleep hangs
at the 'Stopping tasks'.  With tons of debugging turned out, the second sleep
does not hang, but the wakeup hangs.  With 0x1F as the acpi debug_level, the
second sleep still hangs and I got some useful output across a serial console
(attached).  In the second sleep (after the second 'Stopping tasks'), you can
see it repeating a sequence of a few exregion lines.

Steps to reproduce:
Run the sleep.sh script (attached), wake it up, then run it again.
Comment 1 Sanjoy Mahajan 2006-02-01 09:20:26 UTC
Created attachment 7203 [details]
log file showing 1st (ok) and 2nd (failed) sleep attempts
Comment 2 Sanjoy Mahajan 2006-02-01 09:20:55 UTC
Created attachment 7204 [details]
S3 sleep script
Comment 3 Sanjoy Mahajan 2006-02-01 09:23:09 UTC
I'm using 2.6.16-rc1 patched with acpi-release-20060113-2.6.16-rc1.diff

With vanilla 2.6.15 (i.e. no extra acpi release patch), sleeping never hangs. 
Once in a while (maybe 1 in 30 times) wakeup hangs, but I haven't had a serial
console handy then (on the subway) to check why.
Comment 4 Sanjoy Mahajan 2006-02-04 12:21:05 UTC
Just tried vanilla 2.6.16-rc2 (with no extra acpi patches) and it has the same
problem, although the exregion-NNNN numbers are slightly different compared to
-rc1.  I updated the kernel-version header in the bug report.
Comment 5 Robert Moore 2006-02-06 09:09:53 UTC
Looks like maybe the _PTS method is in an infinite loop waiting for the 
hardware to respond. Please post the DSDT and/or acpidump for the machine
Comment 6 Sanjoy Mahajan 2006-02-06 14:21:25 UTC
Created attachment 7261 [details]
acpidmp output

Here is the acpidmp output.  The DSDT is modified relative to the vanilla DSDT
from the BIOS, in order to get S3 to work at all (plus a few changes to the
temperature code to fix another DSDT bug, etc.).  I'll also attach the original
DSDT in the next attachment.
Comment 7 Sanjoy Mahajan 2006-02-06 14:24:32 UTC
Created attachment 7262 [details]
original DSDT from BIOS 1.11
Comment 8 Robert Moore 2006-02-06 14:46:45 UTC
You are stuck in the loop in the method below. I would guess that the code is 
waiting for a response from the SMI bios and it never happens.

    Method (SMPI, 1, NotSerialized)
    {
        Store (S_AX, Local0)
        Store (0x81, APMD)
        While (LEqual (S_AH, 0xA6))
        {
            Sleep (0x64)
            Store (Local0, S_AX)
            Store (0x81, APMD)
        }
    }

    OperationRegion (MNVS, SystemMemory, 0x23FDF000, 0x1000)
    Field (MNVS, DWordAcc, NoLock, Preserve)
    {
        Offset (0xFC0), 
        S_AX,   16, 

    OperationRegion (APMC, SystemIO, 0xB2, 0x01)
    Field (APMC, ByteAcc, NoLock, Preserve)
    {
        APMD,   8


Store (Local0, S_AX)

exregion-0182 [29] ex_system_memory_space: system_memory 0 (32 width) 
Address=0000000023FDFFC0


Store (0x81, APMD)

exregion-0287 [30] ex_system_io_space_han: system_iO 1 (8 width) 
Address=00000000000000B2
Comment 9 Sanjoy Mahajan 2006-02-06 14:56:20 UTC
Although perhaps I gave you slightly inaccurate information.  I ran
the acpidmp on 2.6.15.1, where S3 mostly works so I use that kernel
except when debugging ACPI.  But the exregion-NNNN offsets are with
2.6.16-rc1 or -rc2 (slightly different numbers in each case).  Do you
need an acpidmp when running 2.6.16-rc2, even though the DSDT is the
same as for 2.6.15.1?

> You are stuck in the loop in the method below. I would guess that
> the code is waiting for a response from the SMI bios and it never
> happens.

What could prevent 2.6.15[.1] from getting stuck there?

Comment 10 Robert Moore 2006-02-06 15:04:30 UTC
It looks to me like the ACPI/AML interpreter is doing what it was told to do. 
I am not an expert on sleep issues, so I can't help further.
Comment 11 Pavel Machek 2006-02-09 15:54:33 UTC
Can you try to locate change that breaks it with git bisect?
Comment 12 Sanjoy Mahajan 2006-02-09 22:05:34 UTC
Good idea.  I've just started:

Bisecting: 1675 revisions left to test after this

Comment 13 Sanjoy Mahajan 2006-02-12 21:22:51 UTC
'git bisect' narrowed the problematic commit to:

  commit 292dd876ee765c478b27c93cc51e93a558ed58bf
  Merge: d4ec6c7cc9a15a7a529719bc3b84f46812f9842e
  9fdb62af92c741addbea15545f214a6e89460865
  Author: Len Brown <len.brown@intel.com>
  Date:   Fri Jan 27 17:18:29 2006 -0500

      Pull release into acpica branch

This commit had a slightly different bug from all the other ones I
marked as bad.  This one hung on the first S3 sleep, with the same
endless loop pattern as the other bad ones (which hang on the second
S3 sleep).

I'll attach the bisect log in case it's useful.

Comment 14 Sanjoy Mahajan 2006-02-12 21:27:24 UTC
Created attachment 7304 [details]
git bisect log
Comment 15 Luming Yu 2006-02-12 22:15:04 UTC
Please try pci=noacpi to test.
Comment 16 Sanjoy Mahajan 2006-02-12 22:17:51 UTC
I use pci=noacpi with every kernel (it's in the lilo.conf append=
line).  It's been there so long that I forget why I put it there, but
I think it was needed for S3 sleep to work at all.

Comment 17 Sanjoy Mahajan 2006-02-14 17:08:57 UTC
Here is the commit (I'll attach the diff) that makes the second S3
sleep go into the endless loop when the loaded modules are exactly
thermal, processor, intel_agp, and agpgart:

53f11d4ff8797bcceaf014e62bd39f16ce84baec is first bad commit
diff-tree 53f11d4ff8797bcceaf014e62bd39f16ce84baec (from 02b28a33aae93a3b53068e
0858d62f8bcaef60a3)
Author: Len Brown <len.brown@intel.com>
Date:   Mon Dec 5 16:46:36 2005 -0500

    [ACPI] Enable Embedded Controller (EC) interrupt mode by default
    
    "ec_intr=0" reverts to polling
    "ec_burst=" no longer exists.
    
    Signed-off-by: Len Brown <len.brown@intel.com>
    Acked-by: Luming Yu <luming.yu@intel.com>

:040000 040000 9eec66712c68ebe372b2fb2c8d78bdc99df942ab e7e62cd09983730aee468ed
d4ba1cce50786b7e5 M	Documentation
:040000 040000 6e7db46918f6124f64a11f6757560078a8a27519 aa8abb1023024902300cb2e
7a5bf74acd8c579e8 M	drivers

If I boot with ec_intr=0, the second sleep works fine.

Here is the full story.  First I tried a system with the minimal set of
modules to boot and run X (S3 sleep-wake wrecks the VGA consoles, but X
restores fine with 'chvt 1; sleep 0.5 ; chvt 7' on wakeup).  So I
stopped every service and unloaded all modules possible, which left only
intel_agp and agpgart.  Then, for each of the usual loaded modules
(except for sound modules, which often has sleep-wake problems anyway),
I tried:

1. Load the module
2. S3 sleep-wake-sleep-wake
3. Unload the module

to see whether the second sleep went into the infinite loop (visible
across the console).  The one culprit I found was 'thermal' (which
brings in 'processor').  Other modules didn't trigger the problem.

Then I recompiled using a minimal config, with only networking (for X to
work) and the acpi modules, and maybe a few others that I couldn't
avoid, and retested to make sure 'thermal' still triggered the problem,
which it did.  I used this config, or one just like it for 2.6.15, as a
base for all other kernels in the bisection search, using the nearest
ancestor's .config and then
   yes '' | make oldconfig
to make the new .config

The attached test_s3.sh script gives the steps for testing the thermal
module.  Eventually bisection converged to the commit above, and then
I retested that kernel with 'ec_intr=0'.

Is this a problem with the TP 600X hardware, in which case I'll just use
ec_intr=0 forever, or is there more software debugging (DSDT or related
to the diff)?  I can turn on gobs of ACPI debugging and send the useful
parts of the log file.

Comment 18 Sanjoy Mahajan 2006-02-14 17:10:35 UTC
Created attachment 7326 [details]
bash script for two S3 sleep-wake cycles testing thermal module
Comment 19 Sanjoy Mahajan 2006-02-14 17:14:23 UTC
Created attachment 7327 [details]
diff of the problematic commit: git diff 53f11d4f^..53f11d4f
Comment 20 Luming Yu 2006-02-14 18:10:24 UTC
what could happpen if you unload thermal, processor before suspend (ec_intr=1 
kernel)?
Comment 21 Sanjoy Mahajan 2006-02-14 18:45:14 UTC
With only intel_agp, agpgart loaded (so no processor or thermal or
almost anything!), it suspends and wakes fine.  That's a very minimal
configuration, too minimal to be useful.  So I'll also try it with the
usual set of modules but with thermal and processor unloaded (I
already unload the sound modules and prism54 or else they wake up in
an unhappy state).

Comment 22 Sanjoy Mahajan 2006-02-14 20:22:23 UTC
With a small set of modules loaded (thermal, processor, fan, button,
ac, battery, intel_agp, agpgart), the sleeps are fine if
thermal+processor are unloaded beforehand (ec_intr=1).  If instead fan
is unloaded before sleep, then the second sleep hangs in the usual
endless loop.

Comment 23 Luming Yu 2006-02-21 18:41:25 UTC
cat you attach /proc/interrupts before 2nd suspend hang?
Comment 24 Luming Yu 2006-02-27 00:52:48 UTC
>modified DSDT so that S3 worked with earlier kernels (e.g. 2.6.15).
Also please try un-modified DSDT.
Comment 25 Sanjoy Mahajan 2006-03-02 21:59:42 UTC
I just tried the 'first failing commit' (02b28a33) with the vanilla
(BIOS 1.11) DSDT.  And not only did S3 sleep happen, but the
thermal+processor bug didn't show up: i.e. it did two (actually) many
S3 sleep-wake cycles.  So the problem is due to something in the
modified DSDT, which is either a problem itself or it exposes another
problem.

[Picture of me hanging head in shame for putting people to the trouble.]

My only excuse is that long ago (2.6.11), the machine wouldn't S3
sleep at all with the vanilla DSDT, and I patched it as recommended by
the ACPI experts.  However, the current ACPI interpreter seems to
handle the 600X's mangy DSDT.

Now I agree 100% with Len Brown that it's best to have ACPI handle
marginal DSDT's, not hack the DSDT.

Comment 26 Sanjoy Mahajan 2006-03-09 21:43:01 UTC
I'm now running/testing an unmodified DSDT with 2.6.16-rc5.  For a
while I had no S3 hangs, but I just noticed them again.  The error is
the same as with the modified DSDT (with slightly different offsets):

exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0
000000023FDFFC0
exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=000000
00000000B2

repeated endlessly.

I think the problem resurfaced once I decided to let my sleep.sh
script leave the thermal driver loaded before going into S3
(suspecting that the bug might come back if I did that).  So I susect
that my modified DSDT didn't cause the S3 problems, it merely exposed
one even in the minimal configuration.

I'll attach the vanilla DSDT from /proc/acpi/dsdt

Comment 27 Sanjoy Mahajan 2006-03-09 21:46:34 UTC
Created attachment 7554 [details]
vanilla DSDT from /proc/acpi/dsdt

This is the vanilla DSDT from BIOS 1.11 (the latest BIOS) w/ 2.6.16-rc5.  This
boot had ec_intr=0 (I was checking that the S3 hangs go away, and they do).  If
that addition to the kernel command line can change the DSDT, let me know and
I'll repost after rebooting without ec_intr=0.
Comment 28 Luming Yu 2006-03-09 21:53:19 UTC
Hmmm, Then, thermal issue issue ?
acpi trace log may help.
Comment 29 Sanjoy Mahajan 2006-03-09 21:58:40 UTC
By "acpi trace log", do you mean using one of the acpi_debug_level
options?

Comment 30 Luming Yu 2006-03-09 22:13:22 UTC
yes, please dig out the useful info from acpi trace log.
Comment 31 Sanjoy Mahajan 2006-03-10 05:37:52 UTC
> yes, please dig out the useful info from acpi trace log.

/proc/cmdline: auto BOOT_IMAGE=16-rc5 ro root=305 idebus=66 apm=off acpi=force 
pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios cpufreq.debug=7
 acpi_debug=0xffffffff

where 16-rc5 is my LILO label for 2.6.16-rc5

The extract from the dmesgs:

Stopping tasks: ========================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFF40
 acpi_ec-0458 [37] ec_intr_read          : Read [02] from address [32]
 acpi_ec-0508 [37] ec_intr_write         : Wrote [06] to address [32]
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFF40
Execute Method: [\_PTS] (Node c157ab48)
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFF40
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFF40
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFF40
exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0
000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFFC4
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0
000000023FDFFC4
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFFC0
exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=000000
00000000B2
exregion-0185 [35] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 0 (32 width) Address=0
000000023FDFFC0
exregion-0185 [36] ex_system_memory_space: system_memory 1 (32 width) Address=0
000000023FDFFC0
exregion-0290 [36] ex_system_io_space_han: system_iO 1 (8 width) Address=000000
00000000B2

And then these above four lines (exregion-0185, -0185, -0185, -0290)
repeat until I reboot.

Comment 32 Sanjoy Mahajan 2006-03-12 23:24:04 UTC
Created attachment 7566 [details]
full dmesgs with 2.6.16-rc5, vanilla DSDT

Here are the full dmesgs from boot through the second S3 sleep (which is the
one that fails).  Search for the second "@@@@ SLEEP" (each one is sent across
the serial console by the sleep.sh script).  The sleep.sh script now turns off
thermal polling before the sleep, by echoing '0' to the polling_frequency /proc
files.
Comment 33 Sanjoy Mahajan 2006-03-13 07:13:19 UTC
Created attachment 7570 [details]
full dmesgs with boot-time debuggings set to 0x10

Here are the dmesgs.  I booted with acpi_dbg_level=0x10 acpi_dbg_layer=0x10 and
then did two sleep-wake cycles with no thermal module (both went fine), then
one  cycle with the thermal module loaded (went fine), and then the usual
failing second sleep with the thermal module still loaded.  The sleep-wake
cycles were done with acpi_debug_level=0x1F (rather than the 0x10 boot value).

I put ### marks where I've added comments to the dmesgs.
Comment 34 Sanjoy Mahajan 2006-03-23 20:14:55 UTC
Created attachment 7655 [details]
dmesgs with debugging patch to kernel and DSDT

Here is a tar.gz file with the patch showing the changes relative to a vanilla
kernel and to a DSDT with only THM0.  I booted with debug_{layer,level}=0x10 as
you suggested.	The file names within the archive explain what each log file
is.
Comment 35 Luming Yu 2006-03-23 20:34:39 UTC
Hmm, the debug object info was eliminated,
I'm sorry, try acpi_dbg_level=0x1F
Comment 36 Sanjoy Mahajan 2006-03-23 21:01:28 UTC
> Hmm, the debug object info was eliminated,
> I'm sorry, try acpi_dbg_level=0x1F

I did that with dbg_layer still 0x10.  This time it hung on the first
sleep, in the usual spot:

Stopping tasks: ====================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1564808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1564708)
Execute Method: [\_S3_] (Node c157a988)
Execute Method: [\_PTS] (Node c157ab48)

Comment 37 Luming Yu 2006-03-23 21:38:59 UTC
Did you see any debug info before stopping tasks............
I want to track the calls to I2RB, and understand if CHKS return different.

Actually, this problem is pretty wired. We need to know if last I2RB works 
right before hang.

Thanks,
Luming
Comment 38 Sanjoy Mahajan 2006-03-23 21:48:01 UTC
> Did you see any debug info before stopping tasks............  

No, that was all there was.  That makes sense though.  The debug info
is only in the I2RC (copy of I2RB), which is called only from
THM0._TMP.  With thermal polling off, there shouldn't be any calls to
I2RC.

Should I get rid of the UPDT and I2RB copies, and just modify I2RB
itself to have the Debug objects?

> I want to track the calls to I2RB, and understand if CHKS return
> different.

Following the depth-first strategy, let's make sure that the debug
info is being printed as you want, then use it to investigate the
hangs.  Perhaps I'm not using the right debug_layer/level settings?

'acpi -t' will query the THM0 temperature and won't cause a hang, so I
could use it to test various debug_level settings and see which ones
produce the object you're looking for.  Once that's settled, then I
can use it on the sleep-wake cycles.

> Actually, this problem is pretty wired. We need to know if last I2RB
> works right before hang.

I don't think I2RB (or its copy) is being called during the 2nd sleep.


Comment 39 Luming Yu 2006-03-23 22:06:14 UTC
>No, that was all there was.  That makes sense though.  The debug info
>is only in the I2RC (copy of I2RB), which is called only from
>THM0._TMP.  With thermal polling off, there shouldn't be any calls to
>I2RC.

I assume the last call to THM0._TMP, in turn last call to UPDT, in turn last 
call to I2RB(c) trigger the later 2nd suspend hang.
So, I need to understand the call trace of those last calls.

>Should I get rid of the UPDT and I2RB copies, and just modify I2RB
>itself to have the Debug objects?

NO.

>so Icould use it to test various debug_level settings and see which ones
>produce the object you're looking for.  Once that's settled, then I
>can use it on the sleep-wake cycles.

Yes


Comment 40 Sanjoy Mahajan 2006-03-23 22:12:34 UTC
> I assume the last call to THM0._TMP, in turn last call to UPDT, in
> turn last call to I2RB(c) trigger the later 2nd suspend hang.

Oh, and it's called in the first wakeup, which must be the culprit:

Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node e3f8a848)
Execute Method: [\_WAK] (Node e3f8aa48)
Execute Method: [\_TZ_.THM0._PSV] (Node e3f8bdc8)
Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bd48)
Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd08)
Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bcc8)
Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bec8)
Execute Method: [\_SI_._SST] (Node e3f8a848)
Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf08)
uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)
Execute Method: [\_SB_.LID0._PSW] (Node c1574808)
Execute Method: [\_SB_.SLPB._PSW] (Node c1574708)


>> so Icould use acpi -t to test various debug_level settings and see
>> which ones produce the object you're looking for.  Once that's
>> settled, then I can use it on the sleep-wake cycles.

> Yes

What settings should I try?  And what should I look for in the dmesgs
(what does this debug object look like)?

Comment 41 Luming Yu 2006-03-23 22:18:27 UTC
>What settings should I try?  And what should I look for in the dmesgs
>(what does this debug object look like)?

try boot kernel without any debug selection option.

cat /proc/acpi/debug_level, and cat /proc/acpi/debug_layer
you can see the supported debug option.

Comment 42 Sanjoy Mahajan 2006-03-23 22:23:44 UTC
> cat /proc/acpi/debug_level, and cat /proc/acpi/debug_layer

I've done that many times but the several names are ambiguous to me.
For example, there's INFO and VERBOSE_INFO; or FUNCTIONS and NAMES and
INIT_NAMES.  I also don't understand why DEBUG_OBJECT didn't work
before to print the Debug variable (I'd tried debug_level=0x1F).
Maybe I need more debug_layers.  But in the debug_layer what's the
difference between the EXECUTER and the DISPATCHER?

Description		  Hex        SET
ACPI_LV_ERROR		  0x00000001 [ ]
ACPI_LV_WARN		  0x00000002 [ ]
ACPI_LV_INIT		  0x00000004 [ ]
ACPI_LV_DEBUG_OBJECT	  0x00000008 [ ]
ACPI_LV_INFO		  0x00000010 [*]
ACPI_LV_INIT_NAMES	  0x00000020 [ ]
ACPI_LV_PARSE		  0x00000040 [ ]
ACPI_LV_LOAD		  0x00000080 [ ]
ACPI_LV_DISPATCH	  0x00000100 [ ]
ACPI_LV_EXEC		  0x00000200 [ ]
ACPI_LV_NAMES		  0x00000400 [ ]
ACPI_LV_OPREGION	  0x00000800 [ ]
ACPI_LV_BFIELD		  0x00001000 [ ]
ACPI_LV_TABLES		  0x00002000 [ ]
ACPI_LV_VALUES		  0x00004000 [ ]
ACPI_LV_OBJECTS		  0x00008000 [ ]
ACPI_LV_RESOURCES	  0x00010000 [ ]
ACPI_LV_USER_REQUESTS	  0x00020000 [ ]
ACPI_LV_PACKAGE		  0x00040000 [ ]
ACPI_LV_ALLOCATIONS	  0x00100000 [ ]
ACPI_LV_FUNCTIONS	  0x00200000 [ ]
ACPI_LV_OPTIMIZATIONS	  0x00400000 [ ]
ACPI_LV_MUTEX		  0x01000000 [ ]
ACPI_LV_THREADS		  0x02000000 [ ]
ACPI_LV_IO		  0x04000000 [ ]
ACPI_LV_INTERRUPTS	  0x08000000 [ ]
ACPI_LV_AML_DISASSEMBLE	  0x10000000 [ ]
ACPI_LV_VERBOSE_INFO	  0x20000000 [ ]
ACPI_LV_FULL_TABLES	  0x40000000 [ ]
ACPI_LV_EVENTS		  0x80000000 [ ]
--
debug_level = 0x00000010 (* = enabled)
Description		    Hex        SET
ACPI_UTILITIES		    0x00000001 [ ]
ACPI_HARDWARE		    0x00000002 [ ]
ACPI_EVENTS		    0x00000004 [ ]
ACPI_TABLES		    0x00000008 [ ]
ACPI_NAMESPACE		    0x00000010 [*]
ACPI_PARSER		    0x00000020 [ ]
ACPI_DISPATCHER		    0x00000040 [ ]
ACPI_EXECUTER		    0x00000080 [ ]
ACPI_RESOURCES		    0x00000100 [ ]
ACPI_CA_DEBUGGER	    0x00000200 [ ]
ACPI_OS_SERVICES	    0x00000400 [ ]
ACPI_CA_DISASSEMBLER	    0x00000800 [ ]
ACPI_COMPILER		    0x00001000 [ ]
ACPI_TOOLS		    0x00002000 [ ]
ACPI_ALL_DRIVERS	    0xFFFF0000 [ ]
--
debug_layer = 0x00000010 (* = enabled, - = partial)

Comment 43 Sanjoy Mahajan 2006-03-24 00:20:47 UTC
Created attachment 7656 [details]
commented dmesgs: boot,  ok sleep-wake, then sleep hang

Here are commented dmesgs (lines beginning with #) for a failing 2nd sleep,
with debugging info!  I booted with debug_layer=0x10, debug_level=0x1F, and
then used 'acpi -t' with different values until I settled on debug_layer=0x90,
debug_level=0x1F, as printing the [ACPI Debug] lines without too much extra
junk.  Probably it could be still cleaner.

The first chunk is the boot messages plus me messing around with acpi -t to get
the right debug values; then the first sleep-wake cycle (which worked); and
then the second sleep (which hung).  There's one strange spot I can't figure
out: in the wakeup msgs, the [ACPI Debug] comes as part of the _SST and not as
part of the THM0._TMP (I marked it in the dmesgs).
Comment 44 Luming Yu 2006-03-24 06:18:55 UTC
># why does the Debug line show up here, and not right after the THM0._TMP 
>line?
>[ACPI Debug]  Integer: 0x00000000

Good question!
Could you try option: acpi_serialize

Comment 45 Sanjoy Mahajan 2006-03-24 07:47:04 UTC
> Could you try option: acpi_serialize

Here's the kernel command line:

BOOT_IMAGE=904b0f361ebf ro root=305 idebus=66 apm=off acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios cpufreq.debug=7 acpi_dbg_level=0x10 acpi_dbg_layer=0x10 acpi_dbg_layer=0x90 acpi_dbg_level=0x1F acpi_serialize

Here are the dmesgs from the first cycle, which shows the Debug still
showing late:

eth0: removing device
Unloaded prism54 driver
PM: Preparing system for mem sleep
Stopping tasks: =====================================================|
Execute Method: [\_SB_.LID0._PSW] (Node c1574808)
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_SB_.SLPB._PSW] (Node c1574708)
Execute Method: [\_S3_] (Node e3f8a908)
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_PTS] (Node e3f8aac8)
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [21] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [21] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
exregion-0185 [21] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4
exregion-0185 [21] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0290 [21] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0290 [21] ex_system_io_space_han: system_iO 0 (16 width) Address=000000000000EF20
exregion-0290 [21] ex_system_io_space_han: system_iO 1 (16 width) Address=000000000000EF20
exregion-0290 [20] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_SI_._SST] (Node e3f8a848)
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [20] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
uhci_hcd 0000:00:07.2: suspend_rh
uhci_hcd 0000:00:07.2: uhci_suspend
uhci_hcd 0000:00:07.2: --> PCI D0/legacy
PM: Entering mem sleep
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Back to C!
PM: Finishing wakeup.
Execute Method: [\_GPE._L0B] (Node e3f8a7c8)
PCI: Found IRQ 11 for device 0000:00:02.0
PCI: Sharing IRQ 11 with 0000:00:06.0
PCI: Sharing IRQ 11 with 0000:01:00.0
PCI: Found IRQ 11 for device 0000:00:02.1
uhci_hcd 0000:00:07.2: PCI legacy resume
PCI: Found IRQ 11 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: uhci_resume
uhci_hcd 0000:00:07.2: uhci_check_and_reset_hc: legsup = 0x2000
uhci_hcd 0000:00:07.2: Performing full reset
usb usb1: root hub lost power or was reset
uhci_hcd 0000:00:07.2: suspend_rh
usb usb1: finish resume
uhci_hcd 0000:00:07.2: wakeup_rh
Restarting tasks...<7>hub 1-0:1.0: state 7 ports 2 chg 0000 evt 0000
 done
Execute Method: [\_SI_._SST] (Node e3f8a848)
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_WAK] (Node e3f8aa48)
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF40
exregion-0290 [23] ex_system_io_space_han: system_iO 0 (16 width) Address=000000000000EF20
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (16 width) Address=000000000000EF20
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 <7>uhci_hcd 0000:00:07.2: suspend_rh (auto-stop)
[23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0290 [25] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000015EE
exregion-0290 [24] ex_system_io_space_han: system_iO 0 (8 width) Address=00000000000015EF
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF4C
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF4C
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF4C
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF4C
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC8
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC8
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC0
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC4
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFFC4
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=00000000000000B2
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFFC0
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0365 [22] ex_pci_config_space_ha: pci_config 0 (8) Seg(0000) Bus(0000) Dev(0001) Func(0000) Reg(00b0)
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0290 [22] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37
exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF37
exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0004) Func(0000) Reg(00dc)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0004) Func(0000) Reg(00dc)
exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF37
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF37
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF37
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0290 [23] ex_system_io_space_han: system_iO 0 (8 width) Address=000000000000EF35
exregion-0290 [23] ex_system_io_space_han: system_iO 1 (8 width) Address=000000000000EF35
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF4C
exregion-0185 [23] ex_system_memory_space: system_memory 1 (32 width) Address=0000000023FDFF4C
exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (32) Seg(0000) Bus(0000) Dev(0007) Func(0001) Reg(0040)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 0 (8) Seg(0000) Bus(0000) Dev(0001) Func(0000) Reg(00b1)
exregion-0365 [23] ex_pci_config_space_ha: pci_config 1 (8) Seg(0000) Bus(0000) Dev(0001) Func(0000) Reg(00b1)
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [22] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_TZ_.THM0._PSV] (Node e3f8bdc8)
Execute Method: [\_TZ_.THM0._TC1] (Node e3f8bd48)
exregion-0185 [30] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF38
Execute Method: [\_TZ_.THM0._TC2] (Node e3f8bd08)
exregion-0185 [30] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF38
Execute Method: [\_TZ_.THM0._TSP] (Node e3f8bcc8)
exregion-0185 [30] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF3C
Execute Method: [\_TZ_.THM0._AC0] (Node e3f8bec8)
exregion-0185 [31] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_TZ_.THM0._TMP] (Node e3f8bf08)
exregion-0185 [32] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_SI_._SST] (Node e3f8a848)
exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
exregion-0185 [47] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
[ACPI Debug]  Integer: 0x00000000
Execute Method: [\_SB_.LID0._PSW] (Node c1574808)
exregion-0185 [23] ex_system_memory_space: system_memory 0 (32 width) Address=0000000023FDFF40
Execute Method: [\_SB_.SLPB._PSW] (Node c1574708)
ds: ds_open(socket 0)
ds: ds_open(socket 1)
ds: ds_open(socket 2)
pccard: card ejected from slot 1
PCMCIA: socket e3be9828: *** DANGER *** unable to remove socket power
ds: ds_release(socket 0)
ds: ds_release(socket 1)

No surprise: It hung on the second cycle, with the same log as before.

Comment 46 Luming Yu 2006-03-24 19:20:29 UTC
>Here are the dmesgs from the first cycle, which shows the Debug still
>showing late:
It should be due to kacpid preempted the execution of acpi_leave_sleep_state.

Let continue hack with:
chage 2. in acpi_pm_finish:
Right after: 
	arg.integer.value = ACPI_SST_WORKING;
	status = acpi_evaluate_object (NULL, METHOD_NAME__SST, &arg_list, 
NULL);
	if (ACPI_FAILURE (status) && status != AE_NOT_FOUND) {
		ACPI_REPORT_ERROR (("Method _SST failed, %s\n",
			acpi_format_exception (status)));
	}
insert: acpi_in_suspend =NO

6. in acpi_thermal_check,
just before: acpi_thermal_get_temperature(tz)
  insert: 
if (acpi_in_suspend == YES)
  return_VOID;

Let's check if acpi_leave_sleep_state won't be preempted, and exuection 
sequence is like:

_SST
_WAK
_SST

Comment 47 Sanjoy Mahajan 2006-03-26 21:07:39 UTC
Okay, I first merged the acpi_in_suspend changes from before with the
Debug and I2RBcopy changes to the DSDT (which still has only THM0 in
it).  Then I put in your latest two changes, but booting gives:

Freeing unused kernel memory: 228k freed
FATAL: kernel too old
kernel panic not syncing: Attempted to kill init!

================== change 1 ============================

  change 2. in acpi_pm_finish:
  Right after: 
	  arg.integer.value = ACPI_SST_WORKING;
	  status = acpi_evaluate_object (NULL, METHOD_NAME__SST, &arg_list, NULL);
	  if (ACPI_FAILURE (status) && status != AE_NOT_FOUND) {
		  ACPI_REPORT_ERROR (("Method _SST failed, %s\n",
			  acpi_format_exception (status)));
	  }
  insert: acpi_in_suspend =NO

That code didn't appear in acpi_pm_finish (in sleep/main.c), but a close
relative of it appeared at the end of acpi_leave_sleep_state (in
hardware/hwsleep.c).  So, in acpi_pm_finish, I moved the
acpi_in_suspend = FALSE line to come after apci_leave_sleep_state
returns.  I'm using 2.6.16-rc5; are you using 2.6.16?  Anyway here's
what acpi_pm_finish now looks like:

static int acpi_pm_finish(suspend_state_t pm_state)
{
	u32 acpi_state = acpi_suspend_states[pm_state];

	acpi_leave_sleep_state(acpi_state);
	acpi_in_suspend = FALSE;   /* used to be one line above */
	acpi_disable_wakeup_device(acpi_state);

	/* reset firmware waking vector */
	acpi_set_firmware_waking_vector((acpi_physical_address) 0);

	if (init_8259A_after_S1) {
		printk("Broken toshiba laptop -> kicking interrupts\n");
		init_8259A(0);
	}
	return 0;
}


====================== change 2 ============================

  6. in acpi_thermal_check,
  just before: acpi_thermal_get_temperature(tz)
    insert: 
  if (acpi_in_suspend == YES)
    return_VOID;

I made that change too.

============================================================

For completeness, here's the diff between the vanilla kernel and the one
I just tried (let me know if I should also attach it at the bugzilla),
except I haven't included the modified DSDT:

diff -r ac486e270597 -r 9b6b830abe8f .config
--- a/.config	Sat Mar 18 08:35:34 2006 -0500
+++ b/.config	Sun Mar 26 23:06:16 2006 -0500
@@ -217,7 +217,8 @@ CONFIG_ACPI_THERMAL=m
 # CONFIG_ACPI_ASUS is not set
 CONFIG_ACPI_IBM=m
 # CONFIG_ACPI_TOSHIBA is not set
-# CONFIG_ACPI_CUSTOM_DSDT is not set
+CONFIG_ACPI_CUSTOM_DSDT=y
+CONFIG_ACPI_CUSTOM_DSDT_FILE="../../dsdt/600x.hex"
 CONFIG_ACPI_BLACKLIST_YEAR=0
 CONFIG_ACPI_DEBUG=y
 CONFIG_ACPI_EC=y
diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/ec.c
--- a/drivers/acpi/ec.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/ec.c	Sun Mar 26 23:06:16 2006 -0500
@@ -456,6 +456,7 @@ static int acpi_ec_intr_read(union acpi_
 	acpi_hw_low_level_read(8, data, &ec->common.data_addr);
 	ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Read [%02x] from address [%02x]\n",
 			  *data, address));
+	acpi_os_sleep (1);
 
       end:
 	up(&ec->intr.sem);
@@ -506,6 +507,7 @@ static int acpi_ec_intr_write(union acpi
 
 	ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Wrote [%02x] to address [%02x]\n",
 			  data, address));
+	acpi_os_sleep (1);
 
 	up(&ec->intr.sem);
 
diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/osl.c
--- a/drivers/acpi/osl.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/osl.c	Sun Mar 26 23:06:16 2006 -0500
@@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi
 	return_VOID;
 }
 
+extern int acpi_in_suspend;
+
 acpi_status
 acpi_os_queue_for_execution(u32 priority,
 			    acpi_osd_exec_callback function, void *context)
@@ -643,6 +645,9 @@ acpi_os_queue_for_execution(u32 priority
 	struct work_struct *task;
 
 	ACPI_FUNCTION_TRACE("os_queue_for_execution");
+	if (acpi_in_suspend)	/* in case kacpid is causing the queue */
+		/* should use a better error code, but oh well */
+	  return_ACPI_STATUS(AE_OK);
 
 	ACPI_DEBUG_PRINT((ACPI_DB_EXEC,
 			  "Scheduling function [%p(%p)] for deferred execution.\n",
diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/sleep/main.c
--- a/drivers/acpi/sleep/main.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/sleep/main.c	Sun Mar 26 23:06:16 2006 -0500
@@ -19,6 +19,12 @@
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
 
+/* for functions putting machine to sleep to know that we're
+   suspending, so that they can careful about what AML methods they
+   invoke (to avoid trying untested BIOS code paths) */
+int acpi_in_suspend;
+EXPORT_SYMBOL(acpi_in_suspend);
+
 u8 sleep_states[ACPI_S_STATE_COUNT];
 
 static struct pm_ops acpi_pm_ops;
@@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state
 		printk("acpi_pm_prepare does not support %d \n", pm_state);
 		return -EPERM;
 	}
+	acpi_os_wait_events_complete(NULL);
+	acpi_in_suspend = TRUE;
 	return acpi_sleep_prepare(acpi_state);
 }
 
@@ -132,6 +140,7 @@ static int acpi_pm_finish(suspend_state_
 	u32 acpi_state = acpi_suspend_states[pm_state];
 
 	acpi_leave_sleep_state(acpi_state);
+	acpi_in_suspend = FALSE;
 	acpi_disable_wakeup_device(acpi_state);
 
 	/* reset firmware waking vector */
diff -r ac486e270597 -r 9b6b830abe8f drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c	Sun Mar 26 23:06:16 2006 -0500
@@ -79,6 +79,8 @@ static int tzp;
 static int tzp;
 module_param(tzp, int, 0);
 MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n");
+
+extern int acpi_in_suspend;
 
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
@@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo
 static void acpi_thermal_run(unsigned long data)
 {
 	struct acpi_thermal *tz = (struct acpi_thermal *)data;
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 	if (!tz->zombie)
 		acpi_os_queue_for_execution(OSD_PRIORITY_GPE,
 					    acpi_thermal_check, (void *)data);
@@ -705,6 +709,8 @@ static void acpi_thermal_check(void *dat
 
 	state = tz->state;
 
+	if (acpi_in_suspend)
+		return_VOID;
 	result = acpi_thermal_get_temperature(tz);
 	if (result)
 		return_VOID;
@@ -1224,6 +1230,9 @@ static void acpi_thermal_notify(acpi_han
 	struct acpi_device *device = NULL;
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_notify");
+
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 
 	if (!tz)
 		return_VOID;

Comment 48 Luming Yu 2006-03-26 23:13:39 UTC
why? I tried your patch, boot fine!
Could you try to remove that patch, and build from scratch?

Thanks,
Luming
Comment 49 Sanjoy Mahajan 2006-03-27 08:50:05 UTC
Created attachment 7686 [details]
dmesgs from mostly working kernel! (.tgz file)

Extracted from the README included in this .tgz file:

The previous kernel failed with 'kernel panic'.  In trying to figure out the
problem I decided to add printk's rather than try an unmodified kernel (since I
know that works).  So this one is the same kernel but with printk's around all
the uses of and changes to acpi_in_suspend.  This kernel boots fine; and I
don't know why.

Kernel command line: BOOT_IMAGE=6eb2c00a8f62 ro root=305 idebus=66 apm=off
acpi=force pci=noacpi console=ttyS0,115200 console=tty0 acpi_sleep=s3_bios
cpufreq.debug=7 acpi_dbg_level=0x10 acpi_dbg_layer=0x10 acpi_dbg_layer=0x90
acpi_dbg_level=0x1F acpi_serialize

Anyway, since it booted fine, I could test it.

It did one sleep/wake cycle fine, as usual.  And on the second cycle, it went
to sleep!  But it wouldn't wake up with the Fn key, and I had to briefly slide
the power switch to wake it up. I had thought it was hung there forever, so I
slid the power switch to turn it off and reboot.  I was surprised to find that
it began waking up.  It finished waking up and then shutdown (almost certainly
because I used the power switch).

So the suspend hangs seems improved, but a mystery is why I had to use the
power switch.  Perhaps it's related to the acpi_serialize, which makes the acpi
system act more like Windows.  A perhaps related data point is from an earlier
experiment: I told the vanilla DSDT to pretend to be Win98 (by hacking it to
return True when it did that test); the resulting kernel also wouldn't wake up
with the 'Fn' key and I had to use the power switch.  So my next test will be
to try the 6eb2c00a8f62 kernel without acpi_serialize.

Contents of the .tgz file:

README		: This file
cycle1.dmesg	: dmesgs from the first sleep/wake cycle
cycle2.log	: dmesgs extracted from the syslog (hence the .log
		  extension) for the second sleep/wake
from-serial-console.dmesg : everything captured across the serial
		  console from bootup to shutdown
hacks.diff	: the full diff between a vanilla kernel/DSDT/config
		  and this one, including the modified DSDT.
config		: the kernel config (includes the diff in the file above)
Comment 50 Sanjoy Mahajan 2006-03-27 13:47:50 UTC
Created attachment 7691 [details]
dmesgs from booting without acpi_serialize

These dmesgs are from the kernel in the previous attachment (the kernel that
needed the power switch for the 2nd wakeup), but without acpi_serialize.  It
hung on the fIRST sleep, in the usual way.

Maybe worth tracking down what goes wrong here, since the bug is really easy to
trigger now?
Comment 51 Luming Yu 2006-03-27 18:25:18 UTC
To comment# 49:

Hmm, we made a progress! :-)
Did you try 2nd, 3rd, 4th... sleep cycles?
Maybe you can try vanilla DSDT too?

To comment# 50:
According to the experience of how _TMP affecting the BIOS behavior, I doubt 
the quality of the BIOS on this box. So, I recommend you use acpi_serialize as 
default, just because there could have some un-tested code paths causing 
trouble in future.

If you verify eveything is ok, then we can move on to the Fn-key-wakeup issue.
After that, probably, you can come up with a real/clean patch.

Comment 52 Sanjoy Mahajan 2006-03-27 18:28:10 UTC
> Hmm, we made a progress! :-)

I agree.

> Did you try 2nd, 3rd, 4th... sleep cycles?

No, because the power switch (required to wake it up) also shut it
down.  Oh, I can tell the acpid to shut up and not power it off, and
see if I can then wake it up without having it shut down.  Then I can
do 3rd and 4th etc. cycles.

> Maybe you can try vanilla DSDT too?

I'll try that as well.

Comment 53 Sanjoy Mahajan 2006-03-27 18:30:06 UTC
> After that, probably, you can come up with a real/clean patch.

There's also the issue of the kernel panic, which went away by adding
a few printk's.  So I don't know how reliable a patch can be if it
might cause kernel panics, or not cause them but only in poorly
understood situations.

Comment 54 Sanjoy Mahajan 2006-03-27 20:37:07 UTC
> Did you try 2nd, 3rd, 4th... sleep cycles?
> Maybe you can try vanilla DSDT too?

Just tried 10 cycles with a vanilla DSDT and it worked fine (had to
wake up cycles 2-10 with the power button).

Comment 55 Luming Yu 2006-03-28 00:46:26 UTC
Then, don't forget to delete the hack of removing THM{2,6,7}.
Comment 56 Sanjoy Mahajan 2006-03-28 04:40:49 UTC
>> Just tried 10 cycles with a vanilla DSDT and it worked fine (had to
>> wake up cycles 2-10 with the power button).

> Then, don't forget to delete the hack of removing THM{2,6,7}.

That's what I tested (the BIOS-supplied DSDT):

$ grep DSDT .config
# CONFIG_ACPI_CUSTOM_DSDT is not set

Comment 57 Luming Yu 2006-03-28 05:02:44 UTC
>$ grep DSDT .config
># CONFIG_ACPI_CUSTOM_DSDT is not set
But from "from-serial-console.dmesg", I just found THM0?
Comment 58 Sanjoy Mahajan 2006-03-28 05:08:53 UTC
> But from "from-serial-console.dmesg", I just found THM0?

Right, those messages are just with THM0.  Then, feeling brave, I
recompiled with the BIOS DSDT and told acpid not to power off the
machine when the power switch is used.  I ran 10 sleep-wake cycles
with that kernel, and they all worked (except for needing to use the
power switch to wake up the 2nd and subsequent sleeps).

Comment 59 Luming Yu 2006-03-28 05:14:43 UTC
So, do you mean THM0,2,6,7 are all included in your 10 cycles testing?
I mean there are NO kernel hack exists to prevent THM 2,6,7?
If yeas, then, it's time to do some clean up, and make a clean-patch.  :-)
But, I'm still confused with the printk and kernel panic stuff. 
Comment 60 Sanjoy Mahajan 2006-03-28 05:28:28 UTC
> I mean there are NO kernel hack exists to prevent THM 2,6,7?

Right, no hacks.  That kernel hack was gone long ago: Once I took
THM2,6,7 out of the DSDT, I removed the hack in thermal.c.  So the
kernel with 10 cycles used the BIOS-supplied DSDT.  Below is the full
diff of every file (except my own compiling scripts) relative to
vanilla kernel/BIOS DSDT (i.e. config with CUSTOM_DSDT unset).

> If yes, then, it's time to do some clean up, and make a clean-patch.  :-)

Except I'm worried about the unexplained kernel panic (when I didn't
use all the printk's).  Maybe we have not got to the bottom of the
problem, and have merely pushed it elsewhere?

Another worrisome point is that the changes made the system hang on
the first sleep, unless I use acpi_serialize.  But acpi_serialize is
not ideal on this box, because it slows down 'acpi -t' by a factor of
4, from 0.12 sec real time to 0.5 sec (but maybe nothing can be done
about that).

But here's the diff:

The current revision:
$ hg tip
changeset:   53:ec4441b5c420
tag:         tip
user:        Sanjoy Mahajan <sanjoy@mrao.cam.ac.uk>
date:        Mon Mar 27 22:04:23 2006 -0500
summary:     Use vanilla DSDT.

The base revision (revision 0 is untarring 2.6.16-rc5.tar.gz):
$ hg log -r 1
changeset:   1:ac486e270597
tag:         vanilla
user:        Sanjoy Mahajan <sanjoy@mrao.cam.ac.uk>
date:        Sat Mar 18 08:35:34 2006 -0500
summary:     Added my usual .config (uses vanilla DSDT).

The diff of the current revision relative to the base revision:
$ hg diff -r 1
diff -r ac486e270597 drivers/acpi/ec.c
--- a/drivers/acpi/ec.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/ec.c	Tue Mar 28 08:20:44 2006 -0500
@@ -456,6 +456,7 @@ static int acpi_ec_intr_read(union acpi_
   acpi_hw_low_level_read(8, data, &ec->common.data_addr);
   ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Read [%02x] from address [%02x]\n",
					   *data, address));
+					   acpi_os_sleep (1);
 
       end:
       up(&ec->intr.sem);
@@ -506,6 +507,7 @@ static int acpi_ec_intr_write(union acpi
 
	ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Wrote [%02x] to address [%02x]\n",
					         data, address));
+						 acpi_os_sleep (1);
 
	up(&ec->intr.sem);
 
diff -r ac486e270597 drivers/acpi/osl.c
--- a/drivers/acpi/osl.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/osl.c	Tue Mar 28 08:20:44 2006 -0500
@@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi
   return_VOID;
 }
 
+extern int acpi_in_suspend;
+
 acpi_status
 acpi_os_queue_for_execution(u32 priority,
						    acpi_osd_exec_callback function, void *context)
@@ -643,6 +645,12 @@ acpi_os_queue_for_execution(u32 priority
   struct work_struct *task;
 
	ACPI_FUNCTION_TRACE("os_queue_for_execution");
+	if (acpi_in_suspend)	/* in case kacpid is causing the queue */
+	  {
+	      printk(KERN_INFO PREFIX "acpi_os_queue_for_execution: acpi_in_suspend=%d, so returning\n", acpi_in_suspend);
+			       /* should use a better error code, but oh well */
+			         return_ACPI_STATUS(AE_OK);
+				   }
 
	ACPI_DEBUG_PRINT((ACPI_DB_EXEC,
			  "Scheduling function [%p(%p)] for deferred execution.\n",
diff -r ac486e270597 drivers/acpi/sleep/main.c
--- a/drivers/acpi/sleep/main.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/sleep/main.c	Tue Mar 28 08:20:44 2006 -0500
@@ -19,6 +19,12 @@
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
 
+/* for functions putting machine to sleep to know that we're
+   suspending, so that they can careful about what AML methods they
+   invoke (to avoid trying untested BIOS code paths) */
+int acpi_in_suspend;
+EXPORT_SYMBOL(acpi_in_suspend);
+
 u8 sleep_states[ACPI_S_STATE_COUNT];
 
 static struct pm_ops acpi_pm_ops;
@@ -55,6 +61,9 @@ static int acpi_pm_prepare(suspend_state
	 printk("acpi_pm_prepare does not support %d \n", pm_state);
				 return -EPERM;
				 }
+				 acpi_os_wait_events_complete(NULL);
+				 printk(KERN_INFO PREFIX "acpi_pm_prepare: Setting acpi_in_suspend.\n");
+				 acpi_in_suspend = TRUE;
				 return acpi_sleep_prepare(acpi_state);
 }
 
@@ -132,6 +141,8 @@ static int acpi_pm_finish(suspend_state_
   u32 acpi_state = acpi_suspend_states[pm_state];
 
	acpi_leave_sleep_state(acpi_state);
+	printk(KERN_INFO PREFIX "acpi_pm_finish: Unsetting acpi_in_suspend.\n");
+	acpi_in_suspend = FALSE;
	acpi_disable_wakeup_device(acpi_state);
 
	/* reset firmware waking vector */
diff -r ac486e270597 drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c	Tue Mar 28 08:20:44 2006 -0500
@@ -79,6 +79,8 @@ static int tzp;
 static int tzp;
 module_param(tzp, int, 0);
 MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n");
+
+extern int acpi_in_suspend;
 
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
@@ -683,6 +685,11 @@ static void acpi_thermal_run(unsigned lo
 static void acpi_thermal_run(unsigned long data)
 {
	struct acpi_thermal *tz = (struct acpi_thermal *)data;
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+	  {
+	      printk(KERN_INFO PREFIX "acpi_thermal_run: acpi_in_suspend=%d, so returning\n", acpi_in_suspend);
+			       return_VOID;		 /* so don't do them */
+			         }
				 if (!tz->zombie)
				    acpi_os_queue_for_execution(OSD_PRIORITY_GPE,
								    acpi_thermal_check, (void *)data);
@@ -705,6 +712,11 @@ static void acpi_thermal_check(void *dat
 
	state = tz->state;
 
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+	  {
+	      printk(KERN_INFO PREFIX "acpi_thermal_check: acpi_in_suspend=%d, so returning\n", acpi_in_suspend);
+			       return_VOID;
+			         }
				 result = acpi_thermal_get_temperature(tz);
				 if (result)
				    return_VOID;
@@ -1224,6 +1236,12 @@ static void acpi_thermal_notify(acpi_han
   struct acpi_device *device = NULL;
 
	ACPI_FUNCTION_TRACE("acpi_thermal_notify");
+
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+	  {
+	      printk(KERN_INFO PREFIX "acpi_thermal_notify: acpi_in_suspend=%d, so returning\n", acpi_in_suspend);
+			       return_VOID;		    /* so don't do them */
+			         }
 
	if (!tz)
	   return_VOID;

Comment 61 Sanjoy Mahajan 2006-03-29 10:54:44 UTC
This problem is probably similar to the problem that I reported in #5037
(pre-emptible kernels would hang going to sleep).  See especially comment #17
there: http://bugzilla.kernel.org/show_bug.cgi?id=5037#c17

So maybe the solution with acpi_in_suspend, once it's cleaned up, will fix the
pre-emptible kernels too (I haven't run them for a while, since they so
regularly would hang on sleep).
Comment 62 Luming Yu 2006-03-30 03:53:58 UTC
Hmmm, it could be.  Please clean up and send out the updated patch to acpi ML 
and LKML for review. Probably, this patch needs go to mm tree. 
Please CC me, Len brown, and akpm. 

Thanks,
Luming
Comment 63 Sanjoy Mahajan 2006-03-30 14:13:42 UTC
Bad news.  I made a cleaned-up patch (below) and tested it, and it hung
on the 4th sleep.  That was with debug_{level,layer}=0x10, so I didn't
get much debugging info.  It printed the 'Execute method' for _PTS, and
then hung at "Exec" -- I guess printing it for SST.

Just to make sure I wasn't imagining that the problem had been fixed, I
return to the usual debugging: layer=0x90 and level=0x1F.  And it cycled
with no problem (18 tries)!

So we're not done yet, and maybe we're just treating the symptoms?

-Sanjoy


diff -r ac486e270597 -r abd89292c539 drivers/acpi/osl.c
--- a/drivers/acpi/osl.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/osl.c	Thu Mar 30 10:59:57 2006 -0500
@@ -634,6 +634,8 @@ static void acpi_os_execute_deferred(voi
 	return_VOID;
 }
 
+extern int acpi_in_suspend;
+
 acpi_status
 acpi_os_queue_for_execution(u32 priority,
 			    acpi_osd_exec_callback function, void *context)
@@ -643,6 +645,8 @@ acpi_os_queue_for_execution(u32 priority
 	struct work_struct *task;
 
 	ACPI_FUNCTION_TRACE("os_queue_for_execution");
+	if (acpi_in_suspend)	/* in case kacpid is causing the queue */
+		return_ACPI_STATUS(AE_OK);
 
 	ACPI_DEBUG_PRINT((ACPI_DB_EXEC,
 			  "Scheduling function [%p(%p)] for deferred execution.\n",
diff -r ac486e270597 -r abd89292c539 drivers/acpi/sleep/main.c
--- a/drivers/acpi/sleep/main.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/sleep/main.c	Thu Mar 30 10:59:57 2006 -0500
@@ -19,6 +19,12 @@
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
 
+/* for functions putting machine to sleep to know that we're
+   suspending, so that they can careful about what AML methods they
+   invoke (to avoid trying untested BIOS code paths) */
+int acpi_in_suspend;
+EXPORT_SYMBOL(acpi_in_suspend);
+
 u8 sleep_states[ACPI_S_STATE_COUNT];
 
 static struct pm_ops acpi_pm_ops;
@@ -55,6 +61,8 @@ static int acpi_pm_prepare(suspend_state
 		printk("acpi_pm_prepare does not support %d \n", pm_state);
 		return -EPERM;
 	}
+	acpi_os_wait_events_complete(NULL);
+	acpi_in_suspend = TRUE;
 	return acpi_sleep_prepare(acpi_state);
 }
 
@@ -132,6 +140,7 @@ static int acpi_pm_finish(suspend_state_
 	u32 acpi_state = acpi_suspend_states[pm_state];
 
 	acpi_leave_sleep_state(acpi_state);
+	acpi_in_suspend = FALSE;
 	acpi_disable_wakeup_device(acpi_state);
 
 	/* reset firmware waking vector */
diff -r ac486e270597 -r abd89292c539 drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c	Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c	Thu Mar 30 10:59:57 2006 -0500
@@ -79,6 +79,8 @@ static int tzp;
 static int tzp;
 module_param(tzp, int, 0);
 MODULE_PARM_DESC(tzp, "Thermal zone polling frequency, in 1/10 seconds.\n");
+
+extern int acpi_in_suspend;
 
 static int acpi_thermal_add(struct acpi_device *device);
 static int acpi_thermal_remove(struct acpi_device *device, int type);
@@ -683,6 +685,8 @@ static void acpi_thermal_run(unsigned lo
 static void acpi_thermal_run(unsigned long data)
 {
 	struct acpi_thermal *tz = (struct acpi_thermal *)data;
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 	if (!tz->zombie)
 		acpi_os_queue_for_execution(OSD_PRIORITY_GPE,
 					    acpi_thermal_check, (void *)data);
@@ -705,6 +709,8 @@ static void acpi_thermal_check(void *dat
 
 	state = tz->state;
 
+	if (acpi_in_suspend)
+		return_VOID;
 	result = acpi_thermal_get_temperature(tz);
 	if (result)
 		return_VOID;
@@ -1224,6 +1230,9 @@ static void acpi_thermal_notify(acpi_han
 	struct acpi_device *device = NULL;
 
 	ACPI_FUNCTION_TRACE("acpi_thermal_notify");
+
+	if (acpi_in_suspend)	/* thermal methods might cause a hang */
+		return_VOID;	/* so don't do them */
 
 	if (!tz)
 		return_VOID;

Comment 64 Luming Yu 2006-03-30 17:30:34 UTC
The reason for 4th suspend hang could be:
1. the patch doesn't fix the known problem cleanly, we need to come up with a 
   better patch for the known problem.
2. There are other unknown problems in kernel? in driver? in BIOS? . :-(


Comment 65 Sanjoy Mahajan 2006-03-30 17:35:03 UTC
> 1. the patch doesn't fix the known problem cleanly, we need to come
> up with a better patch for the known problem.

This is quite likely, since we don't fully understand the problem.  We
just know that calling _TMP in the wrong place messes up the EC, but
we don't know how (or what defines 'wrong place'), so we can't be sure
that we fixed the whole problem and that no other methods can do the
same thing.

Comment 66 Luming Yu 2006-03-30 17:53:56 UTC
>> 1. the patch doesn't fix the known problem cleanly, we need to come
>> up with a better patch for the known problem.
>
>This is quite likely, since we don't fully understand the problem.  We
>just know that calling _TMP in the wrong place messes up the EC, but
>we don't know how (or what defines 'wrong place'), so we can't be sure
>that we fixed the whole problem and that no other methods can do the
>same thing.

Yes, this is a good question.  The real fix should be in BIOS.

But, the key point is we assume Windows work, and we expect linux work too.

So, we do need to avoid un-tested (by windows) BIOS code path. 
For this bug, maybe we didn't handle all cases to stop the execution of 
unnecessary AML methods/ code path.

Comment 67 Sanjoy Mahajan 2006-03-30 22:29:13 UTC
> For this bug, maybe we didn't handle all cases to stop the execution of 
> unnecessary AML methods/ code path.

Maybe figuring out the kernel panic would help?  Because the kernel
panic comes and goes, depending on printk's (and the hang does too -
not happening with lots of debugging, for example), it makes me think
that some memory is getting corrupted somewhere.  And by putting in
printk's we're changing the location of what gets corrupted, so
changing the result but not the (unknown) cause.

Comment 68 Sanjoy Mahajan 2006-03-31 12:05:38 UTC
More bad news.  My other sleep script, which unloads thermal before
suspending, just hung on the first sleep.  It has a vanilla 2.6.16-rc5
kernel with the BIOS DSDT (so no mods), but it's the first time it's
ever hung.  I suspect one of two recent changes:

1. The trackpoint died a couple days ago, so I had attached a USB/PS2
   external mouse.  But I tried sleeping again with that situation,
   with no problems.

2. I had booted it up and left it overnight without logging in, and
   so presumably without the X screensaver (not sure about that), and
   then logged in.  Not sure if that can make a difference.

Either way, it indicates that we don't know the whole story.

I wonder if the approach is to disable EC updates during suspend and
wake.  So hack the ACPI interpreter to ignore UPDT() (is it called
that in every ACPI bios implementation?) while acpi_in_suspend is
true, but let everything else happen.

Another thought is about the reason that I need to use the power
switch to wake it up (with the mods): acpi_os_queue_for_execution is
disabled by the acpi_in_suspend=TRUE setting, but maybe some methods
need to run to set Fn as the wakeup key.  Which maybe means I should
try the hack to prevent UPDT() running.

Comment 69 Luming Yu 2006-03-31 19:17:24 UTC
>More bad news.  My other sleep script, which unloads thermal before
>suspending, just hung on the first sleep
It is worthy to do same test without loading thermal moudle at all.

PS.
Hmmm, we need to find out an efficient way to reproduce this hung.
Please apply patch at http://bugzilla.kernel.org/show_bug.cgi?id=5749#c11

and register a interested method with command like:
echo "0:\\_SB.BAT0:_BST:\\_SB.BAT0:_BST:1:1" > /proc/acpi/hotkey/event_config

And, execute it with command:
echo -n "1:0:0:0" > /proc/acpi/hotkey/action

In this way, without changing kernel, you can directly test the interested AML 
methods that could cause hang. The BIOS needs to be fully tested. :-(
Comment 70 Sanjoy Mahajan 2006-04-03 16:09:23 UTC
> But, the key point is we assume Windows work, and we expect linux work too.

I just saw the documentation on acpi_os_name:

commit a1f9e65e2085e0a87f28a4d5a8ae43b32c087f24
Author: Len Brown <len.brown@intel.com>
Date:   Wed Jan 25 23:47:36 2006 -0500

    [ACPI] document cmdline acpi_os_name=


So I used it to simulate Windows code paths.  I tested the vanilla
2.6.16-rc5 kernel with the BIOS DSDT, using

   acpi_os_name="Microsoft Windows" acpi_serialize

It hung on the second sleep, as usual (and the same hang if I didn't
use acpi_serialize).

So then I wondered whether this BIOS works with Windows at all.  But
it has to.  What else could IBM have tested it with?  So Windows must
do something smart during the suspend, like disabling the EC updates
or like the other hacks that we've been trying.

Comment 71 Luming Yu 2006-04-03 18:23:46 UTC
> What else could IBM have tested it with?  So Windows must
>do something smart during the suspend, like disabling the EC updates
>or like the other hacks that we've been trying.

It's NOT called smart-thing. 

>It hung on the second sleep, as usual (and the same hang if I didn't
>use acpi_serialize).
Do you use pathc at http://bugzilla.kernel.org/show_bug.cgi?id=5989#c63?

I suggest you use that patch, and acpi_serialize option for testing.


Comment 72 Sanjoy Mahajan 2006-04-03 20:47:38 UTC
> Do you use patch at http://bugzilla.kernel.org/show_bug.cgi?id=5989#c63?
> I suggest you use that patch, and acpi_serialize option for testing.

Good news!  I used that patch with 

   acpi_os_name="Microsoft Windows" acpi_serialize

and I haven't been able to hang it.  I did 14 cycles with no problem.
The first 8 cycles were with debug_{level,layer}=0x10, which was the
combination that hung more easily than layer=0x90,level=0x1F.  To
check whether the debug params matters, I did the last 6 cycles with
layer=0x90,level=0x1F.

To further stress it, I turned on thermal polling: 1 second for each
of the four thermal zones (starting them 0.25 seconds apart to
maximize the chance that a thermal poll happens at a dangerous time
during the suspend or wake).  Six further cycles worked fine.  Then I
unloaded and loaded the thermal module, which often helps produce
hangs on the next suspend, and that was fine for six more cycles.  I
haven't been able to hang it at all.

To summarize:

1. hangs on 4th S3 : patch in #63.
2. hangs on 2nd S3 : vanilla + acpi_os_name="Microsoft Windows"
3. hangs on 2nd S3 : vanilla + acpi_os_name="Microsoft Windows" acpi_serialize 
4. doesn't hang    :   patch + acpi_os_name="Microsoft Windows" 

Probably the patch + acpi_serialize will hang too, but I'm not sure.
The unclean version of the patch (i.e. with debugging printk's) did
hang with just acpi_serialize.

The slight problems: the fan state and polling frequencies are garbled
on resume.  By fan state being garbled, I mean that it is sometime on
even though the temps are all below the trip points, and acpi -t
reports that the fan is off.  So the system doesn't have the right fan
state.  Similarly with the thermal polling: It reports 1 second for
each zone, but it's not polling at all (no TMP methods reported in the
dmesgs).  If I set the polling intervals to 1 second, then it starts
polling again.

Also, I need to wake it up using the power switch instead of the Fn
key.  Using the power switch produces a couple sharp clicks on the
speaker.

Comment 73 Sanjoy Mahajan 2006-05-17 09:47:23 UTC
Same problem with 2.6.17-rc4 with boot params 

  acpi_os_name="Microsoft Windows" acpi_serialize

and the vanilla (BIOS) DSDT.
Comment 74 Luming Yu 2006-07-02 22:33:26 UTC

*** This bug has been marked as a duplicate of 6749 ***