Most recent kernel where this bug did not occur: Can't tell exactly. I have tried STR since about 2.6.15 and it "mostly worked" with intermittent lockups. Since 2.8.18 this almost never works - I am unable to safely resume more than once if ever; this has been reproduced on 2.6.19-rc5 as well. Distribution: Mandriva cooker (post 2007) Hardware Environment: Toshiba Portege 4000; detailed hardware information attached Software Environment: Vanilla kernel 2.6.19-rc5 compiled with gcc-4.1.1. config attached. suspend-scripts from Mandriva (set of scripts to run on suspend/resume) Problem Description: System stays dead on resume from STR; on display the output of previous suspend can be seen, nothing more. The only way out is power button. Steps to reproduce: Try to suspend to RAM several times. Since 2.6.18 it never worked more than once. Once (under 2.6.18) I was able to capture screen output; see also http://marc.theaimsgroup.com/?l=linux-kernel&m=115989508210686&w=2. I am unable to get any output under 2.6.19-rc5 unfortunately. The bug looks similar to bug 6749, but I do not have hardware EC (as long as I can believe dmesg output). I am also able to reproduce this problem booting with init=/bin/sh with the following minimal module set loaded: Module Size Used by reiserfs 262764 1 ide_disk 17024 2 alim15x3 12428 0 [permanent] ide_core 120516 2 ide_disk,alim15x3 atkbd 18488 0 libps2 7496 1 atkbd i8042 20152 0 serio 18520 4 atkbd,i8042
Created attachment 9468 [details] acpidump
Created attachment 9469 [details] dmesg
Created attachment 9470 [details] dmidecode
Created attachment 9471 [details] interrupts
Created attachment 9472 [details] modules
Created attachment 9473 [details] lspci
Created attachment 9474 [details] kernel config
can you attach the screen capture you managed to get with 2.6.18? I do not see it at the URL noted above. Do you hear the disk drive spilling before the suspend? Do you hear it spinning after the resume?
Created attachment 10107 [details] Screenshot 1 of kernel stack
Created attachment 10108 [details] screeshot 2 of kernel stack I am not sure about suspend, but I do not hear drive spinning when resume hangs. That for a long time made me believe it was related to libata (I noticed this issue when I first tried to switch to libata). But later I had the same problem using legacy IDE drivers, so ...
Does suspend to disk work any better than suspend-to-ram?
Yes, suspend to disk works just fine, including recent switch from "shutdown" to "platform" method. Now it also resumes automatically on lid open.
I finally was able to bisect it within 2 commits. Commit 4119532c95547821dbe72d6916dfa1b2148475b3 does not allow me to run STR at all: sh-3.1# echo mem > /sys/power/state sh: echo: write error: Operation not permitted Commit 4c90ece249992c7a2e3fc921e5cdb8eb92193067 and above locks up *and* fortunately explains, *why* it locks up! I suspect this part of commit (not being immediate reason but at least exposing bug): Converted the locking mutex used for the ACPI hardware to a spinlock. This change should eliminate all problems caused by attempting to acquire a semaphore at interrupt level, and it means that all ACPICA external interfaces that directly access the ACPI hardware can be safely called from interrupt level. because it locks up with spinlock lockup in acpi_os_acquire_lock => spin_lock_irqsave in interrupt context (acpi_ev_sci_xrupt_handler). I attach screenshot of lockup (I have a couple more in case something is completely unreadable) and .config this has been tested with. All tests were conducted without modules (everything was builtin), using this grub entry: title str kernel (hd0,1)/home/bor/build/str-bisect/arch/i386/boot/bzImage root=/dev/hda2 vga=791 init=/bin/sh and doing mount -t sysfs none /sys echo mem > /sys/power/state "Good" cases were retested several times also across reboot. Also something happened between c080a3e69dfb58ae9b8c7e70a1e33f4f4e493ea7 and fda151d9feafc0e8418f23c716587c44394fcdbf. The former at least displays mentioned BUG, the latter locks up without any message at all. I have not tried to bisect this part.
Created attachment 10210 [details] Screenshot of spinlock lockup during resume from STR
Created attachment 10211 [details] .config used to bisect lockup problem
Still the same in current git, as of commit 7de970e11fb832a56c897276967fb0e49f59b313
Have you checked if the bug is still present in 2.6.22-rc3? We changed the point at which acpi_leave_sleep_state() is called, so at least the symptoms should have changed.
It still does not work in 2.6.22-rc3. On resume I hear HDD noice, FAN works and in upper left corner I see Stopping tasks ... done. Suspending console(s) I swear to have seen such report recently already :) HDD and FAN have been there all the time BTW.
> I swear to have seen such report recently already :) Yes. This is the last message in that thread, AFAICS: http://lkml.org/lkml/2007/6/3/106
Any progress with this in the newest kernels?
It silently hangs in 2.6.23-rc2.
Hmm, can you check with these two patches: http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc2/patches/20-acpi-clear-GPEs-before-enabling-IRQs.patch http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc2/patches/19-ACPI-Enable-GPEs-before-_WAK-is-called.patch on top of 2.6.23-rc2?
no change :(
It looks like a problem with handling SCIs. Thomas, can you have a look at this bug report, please?
Puhh. Some ideas (I couldn't really find anything): - compile ACPI_DEBUG in and increase /proc/acpi/debug_level to 0x11F before suspend. At least last lines which AML stuff got executed shortly before the hang might be interesting. - The machine implements a PTS and a WAK function. Does it work if you comment out kernel code that invokes these, just comment out these in drivers/acpi/ hardware/hwsleep.c: /* Run the _PTS and _GTS methods */ status = acpi_evaluate_object(NULL, METHOD_NAME__PTS, &arg_list, NULL); if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) { return_ACPI_STATUS(status); } and: status = acpi_evaluate_object(NULL, METHOD_NAME__WAK, &arg_list, NULL); if (ACPI_FAILURE(status) && status != AE_NOT_FOUND) { ACPI_EXCEPTION((AE_INFO, status, "During Method _WAK")); } - Maybe these iasl warnings are related: DSDT.dsl 2702: Wait (\_SB.PCI0.DKSQ, 0x0BB8) Warning 1103 - Possible operator timeout is ignored ^ I also have never seen ASL code playing around with signal() and event() things. Do you know how to override an DSDT table? If not, just ask. If you do: acpidump >acpidump apcixtract -a acpidump iasl -d DSDT.dat iasl -sa DSDT.dsl you get above warnings. Do a grep DKSQ DSDT.dsl and comment out these lines in DSDT.dsl with: // recompile the DSDT.dsl with iasl -tc DSDT.dsl (AFAIK the output file you need is xy.hex (dsdt.hex or DSDT.hex) then you have two variables in .config: ACPI_CUSTOM_DSDT and ACPI_CUSTOM_DSDT_FILE The first one must be enabled. The second one is a string config and must point to the above .hex file. The recompile the kernel and reboot and you have overridden your DSDT table and got rid of these strange DSKQ elements. -> Also adding Alexey, he has deep knowledge in this stuff... Alexey, it seems like we get a deadlock on a spinlock when waking up from suspend to ram. If above suggestions don't bring anything, I am out of ideas...
Created attachment 12409 [details] Recompiled DSDT with DSKQ event/signal/wait stuff compiled out This might ease up things a bit for if you don't have the latest pmtools at hand... You should be able to add a path pointing to this file into the ACPI_CUSTOM_DSDT_FILE kernel config variable, then recompile and reboot to override the DSDT table.
Do I understand right, and we get interrupt while locked? Does it mean that spin_lock_irqsave does not lock interrupts or we misuse it?
Created attachment 12423 [details] remove inderection around acpi_os_acquire_lock() Could you please test this patch?
(In reply to comment #28) > remove inderection around acpi_os_acquire_lock() Does not help. (In reply to comment #25) > - compile ACPI_DEBUG in and increase /proc/acpi/debug_level to 0x11F before > suspend. At least last lines which AML stuff got executed shortly before > the > hang might be interesting. As soon as I set debug level console is spammed with debug output; apparently this is some polling (battery?) I can blindly type "echo mem > /sys/power/state" and it suspends; there is no output when it resumes. > - The machine implements a PTS and a WAK function. Does it work if you > comment > out kernel code that invokes these In this case system hangs when going to sleep :) But at least it means they are somewhere related ... > Do a grep DKSQ DSDT.dsl and comment out these lines in DSDT.dsl with: Did not work either. OK I commented out only Wait ... may be I have to completely comment out all of them - will try. Final observation. First time I got flooded with ACPI debug messages I pressed SysRq-B - and system stuck with BUG: spinlock_trylock failure on UP on CPU#0, events/0/5 lock: dee8ec4c, .magic: dead4ead, .owner: kacpid/33 .owner_cpu:0 stack trace pointed on SysRq handler so it probably was not that useful. This may be simply printk race condition of course.
(In reply to comment #29) > Did not work either. OK I commented out only Wait ... may be I have to > completely comment out all of them - will try. Nope, does not work.
No idea what happens here. Maybe the freezer kicks in while the lock is held? I don't know much about how this should work and how locks and interrupts are handled by the freezer...
The freezer only affects tasks in TASK_INTERRUPTIBLE.
More precisely, if a task is holding a spinlock, it can't be frozen (please have a look at Documentation/power/freezing-of-tasks.txt in the 2.6.23-rc6 source tree for details).
There are some important fixes related to ACPI/suspend in the current Linus' tree (2.6.23-rc7-git4 as of today). Andrey, can you test it, please?
No change in 2.6.23-rc8
Andrey, Do you have any output if you disable all spinlock debug?
Disabling spinlock debug makes it resume but it can't be right. I tried to enable ACPI debug, but even full debug_level (0xffffffff) shows nothing; apparently all ACPI intercation happens after console is disabled and before it is already enabled. Running with commit 4c90ece249992c7a2e3fc921e5cdb8eb92193067 gives me reproducible recursive spinlock on acpi_gbl_hardware_lock in acpi_ev_gpe_detect() but this seems to be red herring - this line does not even exist anymore in current code (which still hangs) and looking at code I cannot see how it can legally try to get this lock twice. So I tend to suspect some obscure memory corruption here.
There are several known places in kernel which break deadlock detection. Might be we need one of the deadlock code authors to look into this...
I agree. If the symptom is not reproducible without lockdep, it may indicate a lockdep problem.
Please do not dismiss it that lightly. The issue is real. I have never said anything about lockdep. I do not get any lokdep warnings. I get error due to SPINLOCK_DEBUG, which is rather different. The problem *is* reproducible on SMP kernel without SPINLOCK_DEBUG (or without lockdep for that matter). And e.g. Mandriva ships SMP kernel as default now also for laptops The problem is not reproducible on UP kernel without SPINLOCK_DEBUG simply because otherwise locks are not compiled at all. If it is possible to actually use spinlock on UP without debugging, I am ready to test it.
I'm not dismissing the problem, but you seem to be the only person able to reproduce it right and we have no clue about what can be wrong. I'll try to prepare a debug patch for you, please give me some time.
Created attachment 12990 [details] acpi spinlocks cleanup Here is one more patch to make usage of spinlocks to be more visible. Please test.
(In reply to comment #42) No change, sorry. As in all previous versions this sometimes works the very first time after reboot; but second attempt in the raw still hangs.
Created attachment 12991 [details] Allows to carry out partial suspend and test selected parts of the suspend code Please apply the attached patch on top of 2.6.23-rc8 with the patches from: http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc8/patches/ applied and do the following: # echo 8 > /proc/sys/kernel/printk # echo 5 > /sys/power/pm_test_level # echo mem > /sys/power/state (should wait for approx. 3 sec. and return to the boot prompt) # echo 4 > /sys/power/pm_test_level # echo mem > /sys/power/state (should wait for approx. 3 sec. and return to the boot prompt) ... # echo 1 > /sys/power/pm_test_level # echo mem > /sys/power/state (should wait for approx. 3 sec. and return to the boot prompt) and see if you can reproduce the problem and for which test level. [Echoing 0 to /sys/power/pm_test_level restores the normal behavior.]
(In reply to comment #44) Are you sure the patch is against 2.6.23-rc8? {pts/1}% patch -p1 --ignore-whitespace --dry-run < ~/patch/selective_suspend.patch patching file kernel/power/main.c Hunk #1 succeeded at 27 with fuzz 2 (offset -4 lines). Hunk #2 FAILED at 172. Hunk #3 FAILED at 206. Hunk #4 succeeded at 292 with fuzz 1 (offset -3 lines). Hunk #5 succeeded at 420 (offset -3 lines). 2 out of 5 hunks FAILED -- saving rejects to file kernel/power/main.c.rej patching file kernel/power/power.h Hunk #1 succeeded at 194 (offset 12 lines). and looking at failed hunk: patch: @@ -136,7 +176,10 @@ static int suspend_enter(suspend_state_t printk(KERN_ERR "Some devices failed to power down\n"); goto Done; } - error = suspend_ops->enter(state); + and pm_suspend_enter(): printk(KERN_ERR "Some devices failed to power down\n"); goto Done; } error = pm_ops->enter(state);
(In reply to comment #45) > (In reply to comment #44) > Are you sure the patch is against 2.6.23-rc8? No, it is not, as I said in the previous post.
You need additional patches and the link is in Comment #44.
Created attachment 12995 [details] Don't try to carry lock across suspend It looks like ACPI is trying to lock read of the register, which will carry lock across the suspend, but then interrupts become enabled and we might get a deadlock with interrupt handler. Please apply patch on top of 12990.
(In reply to comment #48) > Created an attachment (id=12995) [details] > Don't try to carry lock across suspend Yes this seems to have fixed it. 30 suspend/resume cycles across reboots and system still alive. Thank you! > Please apply patch on top of 12990. Is it really necessary? It looks too intrusive; would it be possible to make minimal patch? Once more - thank you! Rafael, sorry, but this arrived just as I finished compiling kernel with your patches; I guess it is not required now? Anyway thank you for your patience.
bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=7499 > > > > > > ------- Comment #49 from arvidjaar@mail.ru 2007-09-30 01:22 ------- > (In reply to comment #48) >> Created an attachment (id=12995) > --> (http://bugzilla.kernel.org/attachment.cgi?id=12995&action=view) > [details] >> Don't try to carry lock across suspend > > Yes this seems to have fixed it. 30 suspend/resume cycles across reboots and > system still alive. Thank you! > >> Please apply patch on top of 12990. > > Is it really necessary? It looks too intrusive; would it be possible to make > minimal patch? > > Once more - thank you! Rafael, sorry, but this arrived just as I finished > compiling kernel with your patches; I guess it is not required now? Anyway > thank you for your patience. > > Yes, finally I would make minimal patch. First patch was needed to make sure that we don't make something else stupid with our wrappers...
Cleaned-up patches (2) were sent to linux-acpi mail list. Marking bug resolved. Thanks for reporting and testing!
I confirm that your patches on linux-acpi fix the problem.
2d571b33cf7efd6a894e765e3cb45587ec5b834a ACPICA: hw: Don't carry spinlock over suspend shipped in 2.6.24 closed.