Bug 7499

Summary: S3 resume hang - Toshiba Portege 4000
Product: Power Management Reporter: Andrey Borzenkov (arvidjaar)
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjwysocki)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, astarikovskiy, bunk, lenb, pavel, rjwysocki, trenn
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20-git Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: acpidump
dmesg
dmidecode
interrupts
modules
lspci
kernel config
Screenshot 1 of kernel stack
screeshot 2 of kernel stack
Screenshot of spinlock lockup during resume from STR
.config used to bisect lockup problem
Recompiled DSDT with DSKQ event/signal/wait stuff compiled out
remove inderection around acpi_os_acquire_lock()
acpi spinlocks cleanup
Allows to carry out partial suspend and test selected parts of the suspend code
Don't try to carry lock across suspend

Description Andrey Borzenkov 2006-11-12 04:14:03 UTC
Most recent kernel where this bug did not occur:
Can't tell exactly. I have tried STR since about 2.6.15 and it "mostly worked" 
with intermittent lockups. Since 2.8.18 this almost never works - I am unable 
to safely resume more than once if ever; this has been reproduced on 
2.6.19-rc5 as well.

Distribution:
Mandriva cooker (post 2007)

Hardware Environment:
Toshiba Portege 4000; detailed hardware information attached

Software Environment:
Vanilla kernel 2.6.19-rc5 compiled with gcc-4.1.1. config attached.
suspend-scripts from Mandriva (set of scripts to run on suspend/resume)

Problem Description:
System stays dead on resume from STR; on display the output of previous
suspend can be seen, nothing more. The only way out is power button.

Steps to reproduce:
Try to suspend to RAM several times. Since 2.6.18 it never worked more than 
once.

Once (under 2.6.18) I was able to capture screen output; see also 
http://marc.theaimsgroup.com/?l=linux-kernel&m=115989508210686&w=2. I am 
unable to get any output under 2.6.19-rc5 unfortunately.

The bug looks similar to bug 6749, but I do not have hardware EC (as long
as I can believe dmesg output). I am also able to reproduce this problem
booting with init=/bin/sh with the following minimal module set loaded:

Module                  Size  Used by
reiserfs              262764  1
ide_disk               17024  2
alim15x3               12428  0 [permanent]
ide_core              120516  2 ide_disk,alim15x3
atkbd                  18488  0
libps2                  7496  1 atkbd
i8042                  20152  0
serio                  18520  4 atkbd,i8042
Comment 1 Andrey Borzenkov 2006-11-12 04:15:48 UTC
Created attachment 9468 [details]
acpidump
Comment 2 Andrey Borzenkov 2006-11-12 04:16:39 UTC
Created attachment 9469 [details]
dmesg
Comment 3 Andrey Borzenkov 2006-11-12 04:17:24 UTC
Created attachment 9470 [details]
dmidecode
Comment 4 Andrey Borzenkov 2006-11-12 04:18:18 UTC
Created attachment 9471 [details]
interrupts
Comment 5 Andrey Borzenkov 2006-11-12 04:22:30 UTC
Created attachment 9472 [details]
modules
Comment 6 Andrey Borzenkov 2006-11-12 04:23:37 UTC
Created attachment 9473 [details]
lspci
Comment 7 Andrey Borzenkov 2006-11-12 04:26:04 UTC
Created attachment 9474 [details]
kernel config
Comment 8 Len Brown 2007-01-16 12:09:56 UTC
can you attach the screen capture you managed to get with 2.6.18?
I do not see it at the URL noted above.

Do you hear the disk drive spilling before the suspend?
Do you hear it spinning after the resume?
Comment 9 Andrey Borzenkov 2007-01-17 10:07:15 UTC
Created attachment 10107 [details]
Screenshot 1 of kernel stack
Comment 10 Andrey Borzenkov 2007-01-17 10:10:25 UTC
Created attachment 10108 [details]
screeshot 2 of kernel stack

I am not sure about suspend, but I do not hear drive spinning when resume
hangs. That for a long time made me believe it was related to libata (I noticed
this issue when I first tried to switch to libata). But later I had the same
problem using legacy IDE drivers, so ...
Comment 11 Len Brown 2007-01-17 11:06:42 UTC
Does suspend to disk work any better than suspend-to-ram?
Comment 12 Andrey Borzenkov 2007-01-17 18:32:43 UTC
Yes, suspend to disk works just fine, including recent switch from "shutdown" 
to "platform" method. Now it also resumes automatically on lid open.
Comment 13 Andrey Borzenkov 2007-01-28 07:20:13 UTC
I finally was able to bisect it within 2 commits. Commit 
4119532c95547821dbe72d6916dfa1b2148475b3 does not allow me to run STR at all:

sh-3.1# echo mem > /sys/power/state
sh: echo: write error: Operation not permitted

Commit 4c90ece249992c7a2e3fc921e5cdb8eb92193067 and above locks up *and* 
fortunately explains, *why* it locks up! I suspect this part of commit (not 
being immediate reason but at least exposing bug):

    Converted the locking mutex used for the ACPI hardware
    to a spinlock. This change should eliminate all problems
    caused by attempting to acquire a semaphore at interrupt
    level, and it means that all ACPICA external interfaces
    that directly access the ACPI hardware can be safely
    called from interrupt level.

because it locks up with spinlock lockup in acpi_os_acquire_lock =>  
spin_lock_irqsave in interrupt context (acpi_ev_sci_xrupt_handler).

I attach screenshot of lockup (I have a couple more in case something is 
completely unreadable) and .config this has been tested with. All tests were 
conducted without modules (everything was builtin), using this grub entry:

title str
kernel (hd0,1)/home/bor/build/str-bisect/arch/i386/boot/bzImage root=/dev/hda2 
vga=791 init=/bin/sh

and doing

mount -t sysfs none /sys
echo mem > /sys/power/state

"Good" cases were retested several times also across reboot.

Also something happened between c080a3e69dfb58ae9b8c7e70a1e33f4f4e493ea7 and 
fda151d9feafc0e8418f23c716587c44394fcdbf. The former at least displays 
mentioned BUG, the latter locks up without any message at all. I have not 
tried to bisect this part. 
Comment 14 Andrey Borzenkov 2007-01-28 07:22:14 UTC
Created attachment 10210 [details]
Screenshot of spinlock lockup during resume from STR
Comment 15 Andrey Borzenkov 2007-01-28 07:23:33 UTC
Created attachment 10211 [details]
.config used to bisect lockup problem
Comment 16 Andrey Borzenkov 2007-02-17 01:05:35 UTC
Still the same in current git, as of commit 
7de970e11fb832a56c897276967fb0e49f59b313
Comment 17 Rafael J. Wysocki 2007-06-04 09:28:59 UTC
Have you checked if the bug is still present in 2.6.22-rc3?  We changed the
point at which acpi_leave_sleep_state() is called, so at least the symptoms
should have changed.
Comment 18 Andrey Borzenkov 2007-06-04 20:28:40 UTC
It still does not work in 2.6.22-rc3. On resume I hear HDD noice, FAN works 
and in upper left corner I see

Stopping tasks ... done.
Suspending console(s)

I swear to have seen such report recently already :) HDD and FAN have been 
there all the time BTW.
Comment 19 Rafael J. Wysocki 2007-06-05 03:09:06 UTC
> I swear to have seen such report recently already :)

Yes.  This is the last message in that thread, AFAICS:

http://lkml.org/lkml/2007/6/3/106
Comment 20 Rafael J. Wysocki 2007-08-08 09:38:44 UTC
Any progress with this in the newest kernels?
Comment 21 Andrey Borzenkov 2007-08-08 19:29:12 UTC
It silently hangs in 2.6.23-rc2.
Comment 23 Andrey Borzenkov 2007-08-10 10:40:02 UTC
no change :(
Comment 24 Rafael J. Wysocki 2007-08-10 12:10:34 UTC
It looks like a problem with handling SCIs.

Thomas, can you have a look at this bug report, please?
Comment 25 Thomas Renninger 2007-08-16 13:55:54 UTC
Puhh.
Some ideas (I couldn't really find anything):
  - compile ACPI_DEBUG in and increase /proc/acpi/debug_level to 0x11F before  
    suspend. At least last lines which AML stuff got executed shortly before the
    hang might be interesting.
  - The machine implements a PTS and a WAK function. Does it work if you comment 
    out kernel code that invokes these, just comment out these in drivers/acpi/
    hardware/hwsleep.c:
        /* Run the _PTS and _GTS methods */
        status = acpi_evaluate_object(NULL, METHOD_NAME__PTS, &arg_list, NULL);
        if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
                return_ACPI_STATUS(status);
        }
        and:
        status = acpi_evaluate_object(NULL, METHOD_NAME__WAK, &arg_list, NULL);
        if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) {
                ACPI_EXCEPTION((AE_INFO, status, "During Method _WAK"));
        }
  - Maybe these iasl warnings are related:
DSDT.dsl  2702:                             Wait (\_SB.PCI0.DKSQ, 0x0BB8)
Warning  1103 -                   Possible operator timeout is ignored ^
    I also have never seen ASL code playing around with signal() and event()
    things.
    Do you know how to override an DSDT table? If not, just ask.
    If you do:
    acpidump >acpidump
    apcixtract -a acpidump
    iasl -d DSDT.dat
    iasl -sa DSDT.dsl
    you get above warnings.
    Do a grep DKSQ DSDT.dsl and comment out these lines in DSDT.dsl with:
    //
    recompile the DSDT.dsl with iasl -tc DSDT.dsl (AFAIK the output file you
    need is xy.hex (dsdt.hex or DSDT.hex)
    then you have two variables in .config:
    ACPI_CUSTOM_DSDT
    and
    ACPI_CUSTOM_DSDT_FILE
    The first one must be enabled.
    The second one is a string config and must point to the above .hex file.
    The recompile the kernel and reboot and you have overridden your DSDT table
    and got rid of these strange DSKQ elements.

-> Also adding Alexey, he has deep knowledge in this stuff...
Alexey, it seems like we get a deadlock on a spinlock when waking up from suspend to ram. If above suggestions don't bring anything, I am out of ideas...
Comment 26 Thomas Renninger 2007-08-16 14:02:26 UTC
Created attachment 12409 [details]
Recompiled DSDT with DSKQ event/signal/wait stuff compiled out

This might ease up things a bit for if you don't have the latest pmtools at hand...
You should be able to add a path pointing to this file into the ACPI_CUSTOM_DSDT_FILE kernel config variable, then recompile and reboot to override the DSDT table.
Comment 27 Alexey Starikovskiy 2007-08-17 04:11:45 UTC
Do I understand right, and we get interrupt while locked?
Does it mean that spin_lock_irqsave does not lock interrupts or we misuse it?
Comment 28 Alexey Starikovskiy 2007-08-17 05:22:36 UTC
Created attachment 12423 [details]
remove inderection around acpi_os_acquire_lock()

Could you please test this patch?
Comment 29 Andrey Borzenkov 2007-08-17 21:26:06 UTC
(In reply to comment #28)
> remove inderection around acpi_os_acquire_lock()

Does not help.
(In reply to comment #25)
>   - compile ACPI_DEBUG in and increase /proc/acpi/debug_level to 0x11F before 
>     suspend. At least last lines which AML stuff got executed shortly before
> the
>     hang might be interesting.

As soon as I set debug level console is spammed with debug output; apparently this is some polling (battery?) I can blindly type "echo mem > /sys/power/state" and it suspends; there is no output when it resumes.

>   - The machine implements a PTS and a WAK function. Does it work if you
> comment 
>     out kernel code that invokes these

In this case system hangs when going to sleep :) But at least it means they are somewhere related ...

>     Do a grep DKSQ DSDT.dsl and comment out these lines in DSDT.dsl with:

Did not work either. OK I commented out only Wait ... may be I have to completely comment out all of them - will try.

Final observation. First time I got flooded with ACPI debug messages I pressed SysRq-B - and system stuck with 

BUG: spinlock_trylock failure on UP on CPU#0, events/0/5
lock: dee8ec4c, .magic: dead4ead, .owner: kacpid/33 .owner_cpu:0 

stack trace pointed on SysRq handler so it probably was not that useful. This may be simply printk race condition of course.
Comment 30 Andrey Borzenkov 2007-08-17 21:37:00 UTC
(In reply to comment #29)
> Did not work either. OK I commented out only Wait ... may be I have to
> completely comment out all of them - will try.

Nope, does not work.
Comment 31 Thomas Renninger 2007-09-09 14:04:12 UTC
No idea what happens here.
Maybe the freezer kicks in while the lock is held? I don't know much about how this should work and how locks and interrupts are handled by the freezer...
Comment 32 Rafael J. Wysocki 2007-09-09 14:09:25 UTC
The freezer only affects tasks in TASK_INTERRUPTIBLE.
Comment 33 Rafael J. Wysocki 2007-09-12 10:18:08 UTC
More precisely, if a task is holding a spinlock, it can't be frozen (please have a look at Documentation/power/freezing-of-tasks.txt in the 2.6.23-rc6 source tree for details).
Comment 34 Rafael J. Wysocki 2007-09-23 05:11:07 UTC
There are some important fixes related to ACPI/suspend in the current Linus' tree (2.6.23-rc7-git4 as of today).  Andrey, can you test it, please?
Comment 35 Andrey Borzenkov 2007-09-25 12:23:16 UTC
No change in 2.6.23-rc8
Comment 36 Alexey Starikovskiy 2007-09-25 12:29:07 UTC
Andrey,
Do you have any output if you disable all spinlock debug?
Comment 37 Andrey Borzenkov 2007-09-28 11:23:23 UTC
Disabling spinlock debug makes it resume but it can't be right. I tried to enable ACPI debug, but even full debug_level (0xffffffff) shows nothing; apparently all ACPI intercation happens after console is disabled and before it is already enabled.

Running with commit 4c90ece249992c7a2e3fc921e5cdb8eb92193067 gives me reproducible recursive spinlock on acpi_gbl_hardware_lock in acpi_ev_gpe_detect() but this seems to be red herring - this line does not even exist anymore in current code (which still hangs) and looking at code I cannot see how it can legally try to get this lock twice. So I tend to suspect some obscure memory corruption here.
Comment 38 Alexey Starikovskiy 2007-09-28 11:36:13 UTC
There are several known places in kernel which break deadlock detection. Might be we need one of the deadlock code authors to look into this...
Comment 39 Rafael J. Wysocki 2007-09-28 12:48:01 UTC
I agree.

If the symptom is not reproducible without lockdep, it may indicate a lockdep problem.
Comment 40 Andrey Borzenkov 2007-09-28 23:33:05 UTC
Please do not dismiss it that lightly. The issue is real.

I have never said anything about lockdep. I do not get any lokdep warnings. I get error due to SPINLOCK_DEBUG, which is rather different.

The problem *is* reproducible on SMP kernel without SPINLOCK_DEBUG (or without lockdep for that matter). And e.g. Mandriva ships SMP kernel as default now also for laptops

The problem is not reproducible on UP kernel without SPINLOCK_DEBUG simply because otherwise locks are not compiled at all. If it is possible to actually use spinlock on UP without debugging, I am ready to test it.
Comment 41 Rafael J. Wysocki 2007-09-29 08:29:00 UTC
I'm not dismissing the problem, but you seem to be the only person able to reproduce it right and we have no clue about what can be wrong.

I'll try to prepare a debug patch for you, please give me some time.
Comment 42 Alexey Starikovskiy 2007-09-29 11:55:16 UTC
Created attachment 12990 [details]
acpi spinlocks cleanup

Here is one more patch to make usage of spinlocks to be more visible.
Please test.
Comment 43 Andrey Borzenkov 2007-09-29 13:41:40 UTC
(In reply to comment #42)
No change, sorry. As in all previous versions this sometimes works the very first time after reboot; but second attempt in the raw still hangs.
Comment 44 Rafael J. Wysocki 2007-09-29 14:08:28 UTC
Created attachment 12991 [details]
Allows to carry out partial suspend and test selected parts of the suspend code

Please apply the attached patch on top of 2.6.23-rc8 with the patches from:

http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc8/patches/

applied and do the following:

# echo 8 > /proc/sys/kernel/printk
# echo 5 > /sys/power/pm_test_level
# echo mem > /sys/power/state
(should wait for approx. 3 sec. and return to the boot prompt)
# echo 4 > /sys/power/pm_test_level
# echo mem > /sys/power/state
(should wait for approx. 3 sec. and return to the boot prompt)
...
# echo 1 > /sys/power/pm_test_level
# echo mem > /sys/power/state
(should wait for approx. 3 sec. and return to the boot prompt)

and see if you can reproduce the problem and for which test level.

[Echoing 0 to /sys/power/pm_test_level restores the normal behavior.]
Comment 45 Andrey Borzenkov 2007-09-29 14:29:08 UTC
(In reply to comment #44)
Are you sure the patch is against 2.6.23-rc8?

{pts/1}% patch -p1 --ignore-whitespace  --dry-run < ~/patch/selective_suspend.patch
patching file kernel/power/main.c
Hunk #1 succeeded at 27 with fuzz 2 (offset -4 lines).
Hunk #2 FAILED at 172.
Hunk #3 FAILED at 206.
Hunk #4 succeeded at 292 with fuzz 1 (offset -3 lines).
Hunk #5 succeeded at 420 (offset -3 lines).
2 out of 5 hunks FAILED -- saving rejects to file kernel/power/main.c.rej
patching file kernel/power/power.h
Hunk #1 succeeded at 194 (offset 12 lines).

and looking at failed hunk:

patch:
@@ -136,7 +176,10 @@ static int suspend_enter(suspend_state_t
                printk(KERN_ERR "Some devices failed to power down\n");
                goto Done;
        }
-       error = suspend_ops->enter(state);
+

and pm_suspend_enter():
                printk(KERN_ERR "Some devices failed to power down\n");
                goto Done;
        }
        error = pm_ops->enter(state);
Comment 46 Rafael J. Wysocki 2007-09-29 14:36:06 UTC
(In reply to comment #45)
> (In reply to comment #44)
> Are you sure the patch is against 2.6.23-rc8?

No, it is not, as I said in the previous post.
Comment 47 Rafael J. Wysocki 2007-09-29 16:27:00 UTC
You need additional patches and the link is in Comment #44.
Comment 48 Alexey Starikovskiy 2007-09-30 00:24:08 UTC
Created attachment 12995 [details]
Don't try to carry lock across suspend

It looks like ACPI is trying to lock read of the register, which will carry lock across the suspend, but then interrupts become enabled and we might get a 
deadlock with interrupt handler.
Please apply patch on top of 12990.
Comment 49 Andrey Borzenkov 2007-09-30 01:22:38 UTC
(In reply to comment #48)
> Created an attachment (id=12995) [details]
> Don't try to carry lock across suspend

Yes this seems to have fixed it. 30 suspend/resume cycles across reboots and system still alive. Thank you!

> Please apply patch on top of 12990.

Is it really necessary? It looks too intrusive; would it be possible to make minimal patch?

Once more - thank you! Rafael, sorry, but this arrived just as I finished compiling kernel with your patches; I guess it is not required now? Anyway thank you for your patience.
Comment 50 Alexey Starikovskiy 2007-09-30 01:50:27 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=7499
> 
> 
> 
> 
> 
> ------- Comment #49 from arvidjaar@mail.ru  2007-09-30 01:22 -------
> (In reply to comment #48)
>> Created an attachment (id=12995)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=12995&action=view)
>  [details]
>> Don't try to carry lock across suspend
> 
> Yes this seems to have fixed it. 30 suspend/resume cycles across reboots and
> system still alive. Thank you!
> 
>> Please apply patch on top of 12990.
> 
> Is it really necessary? It looks too intrusive; would it be possible to make
> minimal patch?
> 
> Once more - thank you! Rafael, sorry, but this arrived just as I finished
> compiling kernel with your patches; I guess it is not required now? Anyway
> thank you for your patience.
> 
> 
Yes, finally I would make minimal patch. First patch was needed to make sure that 
we don't make something else stupid with our wrappers...
Comment 51 Alexey Starikovskiy 2007-09-30 11:49:17 UTC
Cleaned-up patches (2) were sent to linux-acpi mail list. Marking bug resolved.
Thanks for reporting and testing!
Comment 52 Andrey Borzenkov 2007-09-30 12:46:42 UTC
I confirm that your patches on linux-acpi fix the problem.
Comment 53 Len Brown 2009-03-31 22:12:38 UTC
2d571b33cf7efd6a894e765e3cb45587ec5b834a
ACPICA: hw: Don't carry spinlock over suspend

shipped in 2.6.24

closed.