Bug 14668
Description
Maxim Levitsky
2009-11-22 21:25:08 UTC
Also note that I blacklisted 'battery', 'ac' and 'video' drivers, because the attempts to take lock were from them. Also, here is the backtrace <7>[ 1617.430020] CPU0: Thermal monitoring handled by SMI <4>[ 1617.430020] Enabling non-boot CPUs ... <4>[ 1617.450349] lockdep: fixing up alternatives. <6>[ 1617.450569] SMP alternatives: switching to SMP code <6>[ 1617.458575] Booting processor 1 APIC 0x1 ip 0x6000 <6>[ 1617.422535] Initializing CPU#1 <6>[ 1617.422535] Calibrating delay using timer specific routine.. 3421.80 BogoMIPS (lpj=17109034) <6>[ 1617.422535] CPU: L1 I cache: 32K, L1 D cache: 32K <6>[ 1617.422535] CPU: L2 cache: 2048K <6>[ 1617.422535] CPU: Physical Processor ID: 0 <6>[ 1617.422535] CPU: Processor Core ID: 1 <6>[ 1617.422535] CPU1: Thermal monitoring enabled (TM1) <6>[ 1617.610123] CPU1: Intel(R) Core(TM)2 Duo CPU T5450 @ 1.66GHz stepping 0d <4>[ 1617.630722] CPU1 is up <3>[ 1800.680133] INFO: task tee:4850 blocked for more than 120 seconds. <3>[ 1800.680451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6>[ 1800.680840] tee D 0000000000000000 0 4850 4231 0x00000000 <3>[ 1800.681485] BUG: using smp_processor_id() in preemptible [00000000] code: khungtaskd/369 <4>[ 1800.681881] caller is show_stack_log_lvl+0x32/0x170 <4>[ 1800.682145] Pid: 369, comm: khungtaskd Not tainted 2.6.32-rc8-wl #194 <4>[ 1800.682463] Call Trace: <4>[ 1800.682648] [<ffffffff811e44ab>] debug_smp_processor_id+0xcb/0xe0 <4>[ 1800.682961] [<ffffffff8100f282>] show_stack_log_lvl+0x32/0x170 <4>[ 1800.683272] [<ffffffff81394803>] ? printk+0x3c/0x41 <4>[ 1800.683538] [<ffffffff81010977>] show_stack+0x17/0x20 <4>[ 1800.683808] [<ffffffff8103c952>] sched_show_task+0x92/0x100 <4>[ 1800.684089] [<ffffffff81086a2d>] watchdog+0x27d/0x2e0 <4>[ 1800.684351] [<ffffffff8108681d>] ? watchdog+0x6d/0x2e0 <4>[ 1800.684626] [<ffffffff813982ee>] ? _spin_unlock_irqrestore+0x6e/0x80 <4>[ 1800.684934] [<ffffffff810867b0>] ? watchdog+0x0/0x2e0 <4>[ 1800.685202] [<ffffffff8105bbce>] kthread+0x8e/0xa0 <4>[ 1800.685456] [<ffffffff8100cf1a>] child_rip+0xa/0x20 <4>[ 1800.685727] [<ffffffff8100c880>] ? restore_args+0x0/0x30 <4>[ 1800.685999] [<ffffffff8105bb40>] ? kthread+0x0/0xa0 <4>[ 1800.686255] [<ffffffff8100cf10>] ? child_rip+0x0/0x20 <4>[ 1800.686515] ffff88007e0cf828 0000000000000046 0000000000000000 0000000000000000 <4>[ 1800.687149] 0000000000000000 0000000000000000 ffff880002213c58 0000000000000002 <4>[ 1800.687784] ffff88000220efc8 00000001000202b4 ffff88007f932218 ffff88007e0cffd8 <4>[ 1800.688421] Call Trace: <4>[ 1800.688594] [<ffffffff81395ba5>] schedule_timeout+0x205/0x270 <4>[ 1800.688888] [<ffffffff8106d03b>] ? mark_held_locks+0x6b/0xa0 <4>[ 1800.689172] [<ffffffff8139824b>] ? _spin_unlock_irq+0x2b/0x60 <4>[ 1800.689462] [<ffffffff8106d335>] ? trace_hardirqs_on_caller+0x145/0x190 <4>[ 1800.689781] [<ffffffff8139726d>] __down_timeout+0x6d/0xb0 <4>[ 1800.690095] [<ffffffff810609bf>] ? down_timeout+0x1f/0x60 <4>[ 1800.690387] [<ffffffff810609f8>] down_timeout+0x58/0x60 <4>[ 1800.690659] [<ffffffff81213bcc>] acpi_os_wait_semaphore+0x49/0x57 <4>[ 1800.690964] [<ffffffff81226786>] acpi_ex_system_wait_semaphore+0x3c/0x4f <4>[ 1800.691285] [<ffffffff81221339>] acpi_ev_acquire_global_lock+0xbf/0xc3 <4>[ 1800.691600] [<ffffffff81228354>] acpi_ex_acquire_mutex_object+0x39/0x63 <4>[ 1800.691917] [<ffffffff81227fd0>] acpi_ex_acquire_global_lock+0x24/0x46 <4>[ 1800.692233] [<ffffffff81223a72>] acpi_ex_read_data_from_field+0x121/0x16f <4>[ 1800.692566] [<ffffffff8106d38d>] ? trace_hardirqs_on+0xd/0x10 <4>[ 1800.692865] [<ffffffff81228868>] acpi_ex_resolve_node_to_value+0x18c/0x218 <4>[ 1800.693207] [<ffffffff812244b7>] acpi_ex_resolve_to_value+0x203/0x20c <4>[ 1800.693521] [<ffffffff8121eab9>] acpi_ds_evaluate_name_path+0x79/0xf5 <4>[ 1800.693838] [<ffffffff8121d738>] acpi_ds_exec_end_op+0x96/0x3d6 <4>[ 1800.694132] [<ffffffff8122f0ef>] acpi_ps_parse_loop+0x7bb/0x940 <4>[ 1800.694427] [<ffffffff8122e1f9>] acpi_ps_parse_aml+0x9a/0x2d9 <4>[ 1800.694716] [<ffffffff8122f90a>] acpi_ps_execute_method+0x1e4/0x2b6 <4>[ 1800.695024] [<ffffffff8122b25d>] acpi_ns_evaluate+0xe1/0x1a8 <4>[ 1800.695309] [<ffffffff8122acb5>] acpi_evaluate_object+0x13d/0x236 <4>[ 1800.695612] [<ffffffff8106d03b>] ? mark_held_locks+0x6b/0xa0 <4>[ 1800.695897] [<ffffffff8121449f>] acpi_evaluate_integer+0x2f/0x4d <4>[ 1800.696198] [<ffffffff8106d335>] ? trace_hardirqs_on_caller+0x145/0x190 <4>[ 1800.696516] [<ffffffff81235815>] acpi_ac_get_state+0x36/0x6a <4>[ 1800.696806] [<ffffffff812358e3>] acpi_ac_resume+0x27/0x59 <4>[ 1800.697082] [<ffffffff81215dd4>] acpi_device_resume+0x25/0x2b <4>[ 1800.697373] [<ffffffff8126945a>] dpm_resume_end+0x4aa/0x500 <4>[ 1800.697658] [<ffffffff8107d9b9>] hibernation_snapshot+0xc9/0x280 <4>[ 1800.697953] [<ffffffff8107dc5d>] hibernate+0xed/0x1f0 <4>[ 1800.698218] [<ffffffff8107c46c>] state_store+0xec/0x100 <4>[ 1800.698488] [<ffffffff811d7ae7>] kobj_attr_store+0x17/0x20 <4>[ 1800.698780] [<ffffffff81137924>] sysfs_write_file+0xd4/0x150 <4>[ 1800.699066] [<ffffffff810d3d38>] vfs_write+0xb8/0x1a0 <4>[ 1800.699342] [<ffffffff810d3efc>] sys_write+0x4c/0x80 <4>[ 1800.699602] [<ffffffff8100beab>] system_call_fastpath+0x16/0x1b <4>[ 1800.699914] 2 locks held by tee/4850: <4>[ 1800.700179] #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff8113788f>] sysfs_write_file+0x3f/0x150 <4>[ 1800.700838] #1: (pm_mutex){+.+.+.}, at: [<ffffffff8107db87>] hibernate+0x17/0x1f0 <0>[ 1800.701438] Kernel panic - not syncing: hung_task: blocked tasks <4>[ 1800.701716] Pid: 369, comm: khungtaskd Not tainted 2.6.32-rc8-wl #194 <4>[ 1800.702009] Call Trace: <4>[ 1800.702169] [<ffffffff8139470a>] panic+0x7d/0x13a <4>[ 1800.702408] [<ffffffff81086a59>] watchdog+0x2a9/0x2e0 <4>[ 1800.702664] [<ffffffff8108681d>] ? watchdog+0x6d/0x2e0 <4>[ 1800.702945] [<ffffffff813982ee>] ? _spin_unlock_irqrestore+0x6e/0x80 <4>[ 1800.703238] [<ffffffff810867b0>] ? watchdog+0x0/0x2e0 <4>[ 1800.703489] [<ffffffff8105bbce>] kthread+0x8e/0xa0 <4>[ 1800.703731] [<ffffffff8100cf1a>] child_rip+0xa/0x20 <4>[ 1800.703975] [<ffffffff8100c880>] ? restore_args+0x0/0x30 <4>[ 1800.704234] [<ffffffff8105bb40>] ? kthread+0x0/0xa0 <4>[ 1800.704482] [<ffffffff8100cf10>] ? child_rip+0x0/0x20 <0>[ 1800.704796] Rebooting in 60 seconds.. Created attachment 23875 [details]
acpidump
And yet another 280 cycles, video, ac, battery loaded with Xorg and everything else prove that I am not kidding, this was the last hibernation issue on my laptop. Please build kernel with "CONFIG_ACPI_DEBUG=y" Then boot with "acpi.debug_layer=0x00000004 acpi.debug_level=0x00000200" Suspend/Resume, attach the dmesg to see who acqurie/release the global lock. Folks, any update on that? The suggested test is not necessary because I know that lock is taken by accessing the EC configuration space. I do the test soon though Maxim, Please clarify comment #1: you say ac driver is blacklisted, yet trace shows ac resume function. No problem, this was backtrace before I blacklisted it. There are several methods which might work -- we could put 'if' statements into each driver ac/battery to reject procfs/sysfs calls during suspend/resume. We might do it in EC. Or we could disable (as in make it unlockable) global lock. Safest route would be first, most effective -- last. -EAGAIN seems to be appropriate value. Created attachment 24318 [details]
Refuse locking Global Lock if suspended
Please check if such patch works for your system.
Nope, same hang. However I think I got the idea about what happens. I enabled the debugging bits as you requested, and I see that during _resume_, while memory image is read, there are few instances of access to global lock, which is not and ordinary lock, but it lives in FACS, and thus is not saved in hibernation image (and this is correct). What happens is that during resume, ec transaction begins, hw lock is taken, but then in the middle, control is passed to resuming kernel, it tries to take that lock again... boom. On resume, vga buffer is not cleared, but is overwritten by the resuming kernel. Partially, message about taking the lock was visable. Resumed kernel have shown message 'Waiting for hardware Global Lock' I didn't capture the kernel log through. One note about how thermal monitoring & fan control work on my system. CPU is programmed to send an SMI, when temperature reaches a threshold, (40 C), this triggers the handler, and it programs the chipset to issue an SCI, which is handled by AML code, and it writes new temperature to the EC. This is the path that I think is triggered while resume image is read. Also EC could send an interrupt, and it will be handled by AML code too, and probably result in EC access. I think that the right solution is to abort EC transaction, (and release the lock) just before control is passed to resuming kernel (and then refuse to do more transactions, because this kernel is going away anyway) I think this is possible to do using new pm operations. After reading the EC driver sources, I understand that the only way to fix that is to release the hardware lock just before the resume. What do you think. This problem pretty much affects most notebooks. Maxim, I'm not sure what you by that... One option is to have special flag in booting kernel which prevents non-essential battery/thermal(?)/ac from installing. Second option is again in booting kernel -- release ACPI GL unconditionally (wait for it to become free ?) before transferring control to resume kernel. Could do elaborate your idea? patch or idea of it would be good :) One option is to have special flag in booting kernel which prevents non-essential battery/thermal(?)/ac from installing. ^^^ This won't work because EC can be accessed from GPE handler for example, and it is here. The second option is exactly what I meant. I did that locally, but it seems that I am going to have a lot of fun.... because I have here now problems with both suspend part and resume (2.6.33-rc3) I guess that something is broken again. I restore 2.6.32, apply the patch I did locally, and see how well it performs... Created attachment 24684 [details]
Failed attempt to fix that bug
for reference this is the latest attempt to fix that bug. Still hangs on resume.
Its is really wierd.
However with the lock busted, I made 590 suspend to disk cycles now with 2.6.33-rc3
One of the problems I had was a bug in my own driver I have here (mtd driver for xD card reader) other was the deadlocks on memory allocations during suspend.
I'm not sure the patch does exactly what you wanted. I thought it would just release the global lock instead of merely preventing it from being acquired. Also, shouldn't we actually make sure that the global lock is released before suspend to RAM? As well as in the hibernate .pre_snapshot() phase? Alex? Really, trying to keep the global hardware lock by _any_ _driver_ over a suspend-resume cycle is a plain bug that needs fixing. I mean _really_ _fixed_, not worked around by, let's face it, ugly hacks. Created attachment 24808 [details] ACPI / EC: Do not carry out EC transactions while suspended I wonder if this patch makes any difference. It is a modified version of the Alex's patch from comment #9, but (1) it closes the race between acpi_ec_transaction() and acpi_ec_suspend/resume() (so the suspend routine will actually wait for transactions in progress to complete before setting the flag and disabling the GPE) and (2) it prevents transactions from being carried out while suspended regardless of the global lock (so there's no difference in behavior between the systems with and without the global lock). It should work for you if the theory is correct and if no one else acquires the global lock independently. _However_, it may introduce regressions, because disabling the EC transactions at this point may be too early (_PTS is still going to be called after that). So, it may be necessary to move the disabling of EC transactions to the platform prepare callback, right after _PTS has been executed. >I'm not sure the patch does exactly what you wanted. I thought it would just >release the global lock instead of merely preventing it from being acquired. > >Also, shouldn't we actually make sure that the global lock is released before >suspend to RAM? As well as in the hibernate .pre_snapshot() phase? Alex? Why? My patch (which doesn't work sadly), takes the lock, and then sets global variable (lock_suspended), then it releases it (and does release the _hardware_ lock. At that point all further attempts to take the lock just succeed without touching the global lock (because of that global variable) Since it doesn't work, I assume that EC is somehow busted, thus on resume, indeed an transaction hangs in the middle, with lock held, or in other words, that resuming kernel takes the lock. I will test the above patch. BTW, if the patch above doesn't work for you, this means the global lock is acquired from somewhere else, most likely from some AML routine, and not released. That may be a result of a BIOS bug so we may need to low-level release the global lock unconditionally in the platform .prepare etc. callbacks to avoid that. That said, I don't really think it should be possible to suspend in the middle of an EC transaction in progress, so something like my patch will be necessary anyway (perhaps with the modification that EC transactions will be blocked after executing _PTS and unblocked before executing _WAK). (In reply to comment #18) > >I'm not sure the patch does exactly what you wanted. I thought it would > just > >release the global lock instead of merely preventing it from being acquired. > > > >Also, shouldn't we actually make sure that the global lock is released > before > >suspend to RAM? As well as in the hibernate .pre_snapshot() phase? Alex? > > Why? > > My patch (which doesn't work sadly), takes the lock, and then sets global > variable (lock_suspended), then it releases it (and does release the > _hardware_ > lock. I don't see that happen in the patch (perhaps I'm misreading it). How is the hardware lock released? You should have called acpi_ev_release_global_lock() (directly or indirectly) for that to happen, but I don't see that in your patch. You only seem to set acpi_gbl_global_lock_suspended, which doesn't have any effect if the hardware lock is already being held at this point (which we assume is the case, BTW). But what I do next? I call the acpi_ex_release_mutex_object, and it releases the global lock (since I didn't touch the release path). No, it doesn't. It only releases the mutex used for protecting the global lock and its reference counter (acpi_ev_global_lock_acquired) from concurrent access. The global lock is released by ACPI_RELEASE_GLOBAL_LOCK(), but that can only be called if the reference counter is 0, which is what acpi_ev_release_global_lock() does. [BTW, if I'm not mistaken, there is a bug in acpi_ev_acquire_global_lock() that should do acpi_os_release_mutex(acpi_gbl_global_lock_mutex->mutex.os_mutex) if it failed to acquire the global lock.] Ah, sorry, I overlooked the "Special case for Global Lock" in acpi_ex_acquire_mutex_object() (how broken is this?). Howver, this also has to be done during suspend because of the global lock's reference counter and mutex. Namely, if we suspended with the global lock held, the reference counter will be 1 and the mutex will be seen as held in the image kernel even if the global lock itself has been released in .pre_restore(). (In reply to comment #22) > [BTW, if I'm not mistaken, there is a bug in acpi_ev_acquire_global_lock() > that > should do acpi_os_release_mutex(acpi_gbl_global_lock_mutex->mutex.os_mutex) > if > it failed to acquire the global lock.] Scratch that, I was mistaken. :-) Sure, My goal was just to be sure that hw lock isn't held by kernel that does the resume. Now testing both GFP_NOIO and above patch (42 cycles and counting...) The patch does work (about 70 cycles), but it introduces side effects as such that both 'battery' and 'video' acpi drivers misbehave. What kind of misbehavior is it? Ah, I've just read the e-mails. Sorry for the bug in the patch, I'll attach fixed version in a minute. Created attachment 24842 [details]
ACPI / EC: Add protection from suspending in the middle of EC transaction
Fixed version of the last patch.
At least after few cycles, both battery and brightness driver works. Great, thanks a lot for testing. How many cycles does it usually take to hang the box during resume without the patch? In fact the patch from comment #30 fixes resume on my HP nx6325, which was a regression from 2.6.31, so adding this to the regression list. It takes around 40 cycles to show the problem. I do more testing tomorrow. It takes around 40 cycles to show the problem. I do more testing tomorrow. This patch survived 123 hibernate cycles. However, this patch ought to introduce regressions like was described in comment #17 (In reply to comment #36) > This patch survived 123 hibernate cycles. Good. > However, this patch ought to introduce regressions like was described in > comment #17 Bad. What exactly doesn't work? Here everything works. However if DSDT accesses EC after it was suspended, now it won't work. or if another acpi driver accesses EC in .suspend, .resume or even normally while EC is suspended. Created attachment 24863 [details]
ACPI / EC: Remover race between EC driver and suspend process
Can you check if this patch still fixes the issue for you, please?
Yep, works perfectly. Posting this from my notebook after 138 hibernate cycles. Great, thanks for testing. I've send a cleaned-up version of it to linux-acpi for review. I'll attach it here for easier reference in a minute. BTW, it appears that my nx6325 issue was only accidentally "fixed" by the previous patch, so clearing the regression flag and removing from the regression list. Created attachment 24884 [details]
ACPI / EC: Remove race between EC driver and suspend process
Cleaned-up version of the previous patch.
Created attachment 24886 [details]
ACPI / EC: Remove race between EC driver and suspend process (rev. 2)
One more update.
After discussing the previous patch with Alex I changed it so that the resume behavior is not modified by it, which shouldn't change things in your test case. Still, some testing would be highly appreciated.
Created attachment 24904 [details]
ACPI / EC: Remove race between EC driver and suspend process (rev. 3)
Unfortunately, I have to update it once again.
I overlooked the fact that the previous patch would break the error paths and test modes, so the EC transactions need to be "resumed" in more places than in the previous patch.
This patch is essentially a combination of the two previous patches. It "resumes" EC transactions as soon as reasonably possible during wake-up and at the same time it takes care of the error paths and suspend test modes as well.
Just for your information, I tested this patch and GFP_NOIO patch for 169 hibernate cycles. Its so nice to have reliable hibernation again. Thanks for testing. The GFP_NOIO patch is in the -mm tree and I hope Andrew will push it before the final 2.6.33. This one is 2.6.34 material if Len accepts it. I'll add your "tested-by" to it if Len doesn't take it earlier. Created attachment 25350 [details]
ACPI / EC / PM: Close race between EC and resume from hibernation
I've discussed the previous patch with Len and we've concluded that the problem is hibernation-specific and therefore it may be fixed by a simpler patch like the attached one.
Maxim, please check if the patch from comment #47 still fixes the problem for you. patch in comment #47 is in the acpi tree commit f6bb13aa1ea3bb26a4c783822347873f085b9000 Author: Rafael J. Wysocki <rjw@sisk.pl> Date: Thu Mar 4 01:52:58 2010 +0100 ACPI / EC / PM: Close race between EC and resume from hibernation shipped in linux-2.6.34-rc2 closed Belive me or not, but that bug exploded again here: This is wireless-testing kernel based on 2.6.34-rc3 + no PM related patches. <6>[ 16.512418] PM: Read 519208 kbytes in 11.32 seconds (45.86 MB/s) <7>[ 16.512780] PM: Image successfully loaded <5>[ 16.513801] sd 0:0:0:0: [sda] Synchronizing SCSI cache <6>[ 16.514878] ata_piix 0000:00:1f.1: PCI INT A disabled <6>[ 16.590170] PM: quiesce of devices complete after 76.674 msecs <6>[ 16.591562] PM: late quiesce of devices complete after 1.054 msecs <4>[ 16.592112] Disabling non-boot CPUs ... <6>[ 16.730028] CPU 1 is now offline <4>[ 16.730221] lockdep: fixing up alternatives. <6>[ 16.730469] SMP alternatives: switching to UP code <6>[ 240.870014] PM: Restoring platform NVS memory <4>[ 240.870014] Enabling non-boot CPUs ... <4>[ 240.890711] lockdep: fixing up alternatives. <6>[ 240.890720] SMP alternatives: switching to SMP code <6>[ 240.896468] Booting Node 0 Processor 1 APIC 0x1 <4>[ 241.100023] CPU1 is up <6>[ 241.101089] ACPI: Waking up from system sleep state S4 <3>[ 361.850150] INFO: task pm-hibernate:6030 blocked for more than 120 seconds. <3>[ 361.850553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6>[ 361.851002] pm-hibernate D 0000000000000000 0 6030 5811 0x00000000 <4>[ 361.851019] ffff880047b67708 0000000000000046 ffff880000000000 ffff880047b67fd8 <4>[ 361.851037] ffff8800217a0000 00000000000121c0 0000000000004000 ffff880047b67fd8 <4>[ 361.851054] 0000000000004000 00000000000121c0 ffffffff8160e020 ffff8800217a0000 <4>[ 361.851071] Call Trace: <4>[ 361.851085] [<ffffffff8106a94d>] ? __module_text_address+0xd/0x60 <4>[ 361.851101] [<ffffffff8137007d>] schedule_timeout+0x20d/0x280 <4>[ 361.851115] [<ffffffff8104eac8>] ? __kernel_text_address+0x58/0x80 <4>[ 361.851130] [<ffffffff8137163d>] __down_timeout+0x6d/0xb0 <4>[ 361.851143] [<ffffffff8105697f>] ? down_timeout+0x1f/0x60 <4>[ 361.851156] [<ffffffff810569b8>] down_timeout+0x58/0x60 <4>[ 361.851169] [<ffffffff811f3e60>] acpi_os_wait_semaphore+0x98/0x137 <4>[ 361.851185] [<ffffffff8120b17b>] acpi_ex_system_wait_semaphore+0x40/0x8b <4>[ 361.851200] [<ffffffff81204f46>] acpi_ev_acquire_global_lock+0x11d/0x121 <4>[ 361.851215] [<ffffffff8120dfec>] acpi_ex_acquire_mutex_object+0x39/0x63 <4>[ 361.851230] [<ffffffff8120d160>] acpi_ex_acquire_global_lock+0x24/0x46 <4>[ 361.851244] [<ffffffff81207b23>] acpi_ex_write_data_to_field+0x257/0x283 <4>[ 361.851260] [<ffffffff8120c6e0>] acpi_ex_store_object_to_node+0x108/0x1fa <4>[ 361.851276] [<ffffffff8120cdc3>] acpi_ex_store+0xbe/0x29b <4>[ 361.851289] [<ffffffff8120a31a>] acpi_ex_opcode_1A_1T_1R+0x35c/0x4b7 <4>[ 361.851304] [<ffffffff811ffe13>] acpi_ds_exec_end_op+0xef/0x570 <4>[ 361.851318] [<ffffffff81217386>] acpi_ps_parse_loop+0x8b2/0xa6c <4>[ 361.851333] [<ffffffff810be8af>] ? kmem_cache_alloc+0x8f/0x110 <4>[ 361.851347] [<ffffffff812162c5>] acpi_ps_parse_aml+0x10a/0x3c1 <4>[ 361.851361] [<ffffffff81217cc0>] acpi_ps_execute_method+0x20e/0x31a <4>[ 361.851375] [<ffffffff81211c6c>] acpi_ns_evaluate+0x18c/0x2c9 <4>[ 361.851389] [<ffffffff812114e4>] acpi_evaluate_object+0x1ad/0x2f1 <4>[ 361.851403] [<ffffffff81206a81>] ? acpi_ev_walk_gpe_list+0x75/0x87 <4>[ 361.851417] [<ffffffff8120f35f>] acpi_leave_sleep_state+0xa6/0x151 <4>[ 361.851431] [<ffffffff811f548f>] acpi_pm_finish+0x34/0x49 <4>[ 361.851444] [<ffffffff811f54b2>] acpi_hibernation_finish+0xe/0x10 <4>[ 361.851458] [<ffffffff81070ebc>] hibernation_snapshot+0xec/0x2c0 <4>[ 361.851471] [<ffffffff813704e9>] ? mutex_unlock+0x9/0x10 <4>[ 361.851483] [<ffffffff8107118c>] hibernate+0xfc/0x1f0 <4>[ 361.851496] [<ffffffff8106fadc>] state_store+0xec/0x100 <4>[ 361.851510] [<ffffffff811b2ec7>] kobj_attr_store+0x17/0x20 <4>[ 361.851524] [<ffffffff8112c342>] sysfs_write_file+0xf2/0x170 <4>[ 361.851538] [<ffffffff810c6708>] vfs_write+0xb8/0x1a0 <4>[ 361.851551] [<ffffffff8137224a>] ? lockdep_sys_exit_thunk+0x35/0x67 <4>[ 361.851564] [<ffffffff810c68cc>] sys_write+0x4c/0x80 <4>[ 361.851578] [<ffffffff81002deb>] system_call_fastpath+0x16/0x1b <4>[ 361.851590] INFO: lockdep is turned off. <0>[ 361.851599] Kernel panic - not syncing: hung_task: blocked tasks <4>[ 361.851946] Pid: 352, comm: khungtaskd Tainted: P 2.6.34-rc3-wl #38 <4>[ 361.851959] Call Trace: <4>[ 361.851967] [<ffffffff8136ef25>] panic+0x82/0xfc <4>[ 361.851979] [<ffffffff8107a669>] watchdog+0x2a9/0x2e0 <4>[ 361.851991] [<ffffffff8107a42d>] ? watchdog+0x6d/0x2e0 <4>[ 361.852004] [<ffffffff81372cd3>] ? _raw_spin_unlock_irqrestore+0x33/0x60 <4>[ 361.852019] [<ffffffff8107a3c0>] ? watchdog+0x0/0x2e0 <4>[ 361.852031] [<ffffffff8105102e>] kthread+0x9e/0xb0 <4>[ 361.852044] [<ffffffff81003b74>] kernel_thread_helper+0x4/0x10 <4>[ 361.852058] [<ffffffff81050f90>] ? kthread+0x0/0xb0 <4>[ 361.852070] [<ffffffff81003b70>] ? kernel_thread_helper+0x0/0x10 <0>[ 361.852090] Rebooting in 10 seconds.. No wonder its broken. I see sole call to acpi_ec_suspend_transactions in acpi_pm_pre_restore. Well it too late to call it from that point, because that code will be executed in the *resuming* kernel. At that hw lock is already taken. This has to be called from suspending kernel Not exactly. This means _PTS doesn't prevent this thing from happening as we have assumed. Created attachment 25831 [details] ACPI / EC / PM: Fix race between EC transactions and system suspend Please try to reproduce with this patch applied (it's the patch from comment #44 rebased on top of the current mainline). Created attachment 25833 [details]
ACPI / EC / PM: Fix race between EC transactions and system suspend
Sorry, with this one applied, please (the previous one was incomplete).
Sure. About my comment #52, its totally wrong. the resuming kernel should suspend the transactions. Still something is going on. I will run my hibernate loop tonight with latest patch applied, Thanks! 150 cycles tested and still works. I guess this patch fixes that bug correctly Thanks for being persistent. :-) I'll submit the patch to Len, but still if you could do some more testing just to make sure the problem is really gone with this patch, that would be appreciated. Sorry for no responding before. According to all tests I done, 150 cycles is much more that enough. I never seen this fail after 40 cycles. Any reason, this is still not in kernel? Was this patch forgotten.? I reopen this bug The bugzilla protocol is that a bug is marked RESOLVED if a fix exists that is a candidate for upstream. It is CLOSED when the patch makes it upstream. In this report commit d5a64513c6a171262082c250592c062e97a2c693 ACPI / EC / PM: Fix race between EC transactions and system suspend is staged in the acpi tree waiting for inclusion upstream. So this bug is now marked RESOLVED, and will be marked CLOSED when the journey is complete. shipped in 2.6.35-rc2 closed |