Bug 14668 - Resume from disk hangs in acpi_ex_acquire_global_lock
Summary: Resume from disk hangs in acpi_ex_acquire_global_lock
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: EC (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks: 7216
  Show dependency tree
 
Reported: 2009-11-22 21:25 UTC by Maxim Levitsky
Modified: 2010-06-09 04:37 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.32-rc8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
acpidump (173.28 KB, text/plain)
2009-11-22 21:28 UTC, Maxim Levitsky
Details
Refuse locking Global Lock if suspended (1.67 KB, patch)
2009-12-27 18:51 UTC, Alexey Starikovskiy
Details | Diff
Failed attempt to fix that bug (4.11 KB, patch)
2010-01-23 09:35 UTC, Maxim Levitsky
Details | Diff
ACPI / EC: Do not carry out EC transactions while suspended (1.71 KB, patch)
2010-01-30 20:19 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC: Add protection from suspending in the middle of EC transaction (2.34 KB, patch)
2010-01-31 21:15 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC: Remover race between EC driver and suspend process (6.58 KB, patch)
2010-02-01 23:07 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC: Remove race between EC driver and suspend process (6.22 KB, patch)
2010-02-02 22:22 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC: Remove race between EC driver and suspend process (rev. 2) (5.88 KB, patch)
2010-02-03 00:37 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC: Remove race between EC driver and suspend process (rev. 3) (7.22 KB, patch)
2010-02-04 00:37 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC / PM: Close race between EC and resume from hibernation (4.29 KB, patch)
2010-03-04 00:44 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC / PM: Fix race between EC transactions and system suspend (5.57 KB, patch)
2010-04-02 21:10 UTC, Rafael J. Wysocki
Details | Diff
ACPI / EC / PM: Fix race between EC transactions and system suspend (5.75 KB, patch)
2010-04-02 21:18 UTC, Rafael J. Wysocki
Details | Diff

Description Maxim Levitsky 2009-11-22 21:25:08 UTC
I found out that my system will occasionally fail when resuming from hibernation.
Common denominator was that attempt to acquire global acpi lock failed.

According to the source, this lock is only used on access to fields of 'OperationRegion's that have 'Lock' attribute, or in other simpler words, this lock protects access to device registers.

My notebook has:

    OperationRegion (NVST, SystemMemory, 0x7FEBCE89, 0x00000063)
    Field (NVST, AnyAcc, Lock, Preserve)

    OperationRegion (RCRB, SystemMemory, 0xFED1C000, 0x4000)
    Field (RCRB, DWordAcc, Lock, Preserve)

    OperationRegion (ERAM, EmbeddedControl, 0x00, 0xFF)
    Field (ERAM, ByteAcc, Lock, Preserve)


NVST is in memory, containing some bios flags. It is used in some places.
RCRB is a BAR of southbridge, and used I think for pcie hotplug.
I know that because I mapped all unknown memory ranges using intel docs.
(Southbridge/LPC configuration (Root Complex Base Address))

I don't thing that access to these registers hangs in the middle.


However, the last third register, looks very suspicions, 'cause its the EC.
In fact system hangs, while it tries to read from this range.
But it hangs because it can't take the lock, and I suspect that this is due to access to same range.

So, ether global lock isn't released during suspend, or first attempt to touch EC stalls.

Note that I don't see EC GPE storms warning anymore lately, so EC operates in interrupt mode.


For testing, I removed this global lock, and then I did ~230 s2disk cycles, using /sys/power/disk directly, and 'reboot' method.

Then, I did the full test that is ~130 cycles using 'platform' method, pm_hibernate, and using rtc alarm to wake the system. 

I couldn't make the system hang.
Comment 1 Maxim Levitsky 2009-11-22 21:27:21 UTC
Also note that I blacklisted 'battery', 'ac' and 'video' drivers, because the attempts to take lock were from them.

Also, here is the backtrace 

<7>[ 1617.430020] CPU0: Thermal monitoring handled by SMI
<4>[ 1617.430020] Enabling non-boot CPUs ...
<4>[ 1617.450349] lockdep: fixing up alternatives.
<6>[ 1617.450569] SMP alternatives: switching to SMP code
<6>[ 1617.458575] Booting processor 1 APIC 0x1 ip 0x6000
<6>[ 1617.422535] Initializing CPU#1
<6>[ 1617.422535] Calibrating delay using timer specific routine.. 3421.80 BogoMIPS (lpj=17109034)
<6>[ 1617.422535] CPU: L1 I cache: 32K, L1 D cache: 32K
<6>[ 1617.422535] CPU: L2 cache: 2048K
<6>[ 1617.422535] CPU: Physical Processor ID: 0
<6>[ 1617.422535] CPU: Processor Core ID: 1
<6>[ 1617.422535] CPU1: Thermal monitoring enabled (TM1)
<6>[ 1617.610123] CPU1: Intel(R) Core(TM)2 Duo CPU     T5450  @ 1.66GHz stepping 0d
<4>[ 1617.630722] CPU1 is up
<3>[ 1800.680133] INFO: task tee:4850 blocked for more than 120 seconds.
<3>[ 1800.680451] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[ 1800.680840] tee           D 0000000000000000     0  4850   4231 0x00000000
<3>[ 1800.681485] BUG: using smp_processor_id() in preemptible [00000000] code: khungtaskd/369
<4>[ 1800.681881] caller is show_stack_log_lvl+0x32/0x170
<4>[ 1800.682145] Pid: 369, comm: khungtaskd Not tainted 2.6.32-rc8-wl #194
<4>[ 1800.682463] Call Trace:
<4>[ 1800.682648]  [<ffffffff811e44ab>] debug_smp_processor_id+0xcb/0xe0
<4>[ 1800.682961]  [<ffffffff8100f282>] show_stack_log_lvl+0x32/0x170
<4>[ 1800.683272]  [<ffffffff81394803>] ? printk+0x3c/0x41
<4>[ 1800.683538]  [<ffffffff81010977>] show_stack+0x17/0x20
<4>[ 1800.683808]  [<ffffffff8103c952>] sched_show_task+0x92/0x100
<4>[ 1800.684089]  [<ffffffff81086a2d>] watchdog+0x27d/0x2e0
<4>[ 1800.684351]  [<ffffffff8108681d>] ? watchdog+0x6d/0x2e0
<4>[ 1800.684626]  [<ffffffff813982ee>] ? _spin_unlock_irqrestore+0x6e/0x80
<4>[ 1800.684934]  [<ffffffff810867b0>] ? watchdog+0x0/0x2e0
<4>[ 1800.685202]  [<ffffffff8105bbce>] kthread+0x8e/0xa0
<4>[ 1800.685456]  [<ffffffff8100cf1a>] child_rip+0xa/0x20
<4>[ 1800.685727]  [<ffffffff8100c880>] ? restore_args+0x0/0x30
<4>[ 1800.685999]  [<ffffffff8105bb40>] ? kthread+0x0/0xa0
<4>[ 1800.686255]  [<ffffffff8100cf10>] ? child_rip+0x0/0x20
<4>[ 1800.686515]  ffff88007e0cf828 0000000000000046 0000000000000000 0000000000000000
<4>[ 1800.687149]  0000000000000000 0000000000000000 ffff880002213c58 0000000000000002
<4>[ 1800.687784]  ffff88000220efc8 00000001000202b4 ffff88007f932218 ffff88007e0cffd8
<4>[ 1800.688421] Call Trace:
<4>[ 1800.688594]  [<ffffffff81395ba5>] schedule_timeout+0x205/0x270
<4>[ 1800.688888]  [<ffffffff8106d03b>] ? mark_held_locks+0x6b/0xa0
<4>[ 1800.689172]  [<ffffffff8139824b>] ? _spin_unlock_irq+0x2b/0x60
<4>[ 1800.689462]  [<ffffffff8106d335>] ? trace_hardirqs_on_caller+0x145/0x190
<4>[ 1800.689781]  [<ffffffff8139726d>] __down_timeout+0x6d/0xb0
<4>[ 1800.690095]  [<ffffffff810609bf>] ? down_timeout+0x1f/0x60
<4>[ 1800.690387]  [<ffffffff810609f8>] down_timeout+0x58/0x60
<4>[ 1800.690659]  [<ffffffff81213bcc>] acpi_os_wait_semaphore+0x49/0x57
<4>[ 1800.690964]  [<ffffffff81226786>] acpi_ex_system_wait_semaphore+0x3c/0x4f
<4>[ 1800.691285]  [<ffffffff81221339>] acpi_ev_acquire_global_lock+0xbf/0xc3
<4>[ 1800.691600]  [<ffffffff81228354>] acpi_ex_acquire_mutex_object+0x39/0x63
<4>[ 1800.691917]  [<ffffffff81227fd0>] acpi_ex_acquire_global_lock+0x24/0x46
<4>[ 1800.692233]  [<ffffffff81223a72>] acpi_ex_read_data_from_field+0x121/0x16f
<4>[ 1800.692566]  [<ffffffff8106d38d>] ? trace_hardirqs_on+0xd/0x10
<4>[ 1800.692865]  [<ffffffff81228868>] acpi_ex_resolve_node_to_value+0x18c/0x218
<4>[ 1800.693207]  [<ffffffff812244b7>] acpi_ex_resolve_to_value+0x203/0x20c
<4>[ 1800.693521]  [<ffffffff8121eab9>] acpi_ds_evaluate_name_path+0x79/0xf5
<4>[ 1800.693838]  [<ffffffff8121d738>] acpi_ds_exec_end_op+0x96/0x3d6
<4>[ 1800.694132]  [<ffffffff8122f0ef>] acpi_ps_parse_loop+0x7bb/0x940
<4>[ 1800.694427]  [<ffffffff8122e1f9>] acpi_ps_parse_aml+0x9a/0x2d9
<4>[ 1800.694716]  [<ffffffff8122f90a>] acpi_ps_execute_method+0x1e4/0x2b6
<4>[ 1800.695024]  [<ffffffff8122b25d>] acpi_ns_evaluate+0xe1/0x1a8
<4>[ 1800.695309]  [<ffffffff8122acb5>] acpi_evaluate_object+0x13d/0x236
<4>[ 1800.695612]  [<ffffffff8106d03b>] ? mark_held_locks+0x6b/0xa0
<4>[ 1800.695897]  [<ffffffff8121449f>] acpi_evaluate_integer+0x2f/0x4d
<4>[ 1800.696198]  [<ffffffff8106d335>] ? trace_hardirqs_on_caller+0x145/0x190
<4>[ 1800.696516]  [<ffffffff81235815>] acpi_ac_get_state+0x36/0x6a
<4>[ 1800.696806]  [<ffffffff812358e3>] acpi_ac_resume+0x27/0x59
<4>[ 1800.697082]  [<ffffffff81215dd4>] acpi_device_resume+0x25/0x2b
<4>[ 1800.697373]  [<ffffffff8126945a>] dpm_resume_end+0x4aa/0x500
<4>[ 1800.697658]  [<ffffffff8107d9b9>] hibernation_snapshot+0xc9/0x280
<4>[ 1800.697953]  [<ffffffff8107dc5d>] hibernate+0xed/0x1f0
<4>[ 1800.698218]  [<ffffffff8107c46c>] state_store+0xec/0x100
<4>[ 1800.698488]  [<ffffffff811d7ae7>] kobj_attr_store+0x17/0x20
<4>[ 1800.698780]  [<ffffffff81137924>] sysfs_write_file+0xd4/0x150
<4>[ 1800.699066]  [<ffffffff810d3d38>] vfs_write+0xb8/0x1a0
<4>[ 1800.699342]  [<ffffffff810d3efc>] sys_write+0x4c/0x80
<4>[ 1800.699602]  [<ffffffff8100beab>] system_call_fastpath+0x16/0x1b
<4>[ 1800.699914] 2 locks held by tee/4850:
<4>[ 1800.700179]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff8113788f>] sysfs_write_file+0x3f/0x150
<4>[ 1800.700838]  #1:  (pm_mutex){+.+.+.}, at: [<ffffffff8107db87>] hibernate+0x17/0x1f0
<0>[ 1800.701438] Kernel panic - not syncing: hung_task: blocked tasks
<4>[ 1800.701716] Pid: 369, comm: khungtaskd Not tainted 2.6.32-rc8-wl #194
<4>[ 1800.702009] Call Trace:
<4>[ 1800.702169]  [<ffffffff8139470a>] panic+0x7d/0x13a
<4>[ 1800.702408]  [<ffffffff81086a59>] watchdog+0x2a9/0x2e0
<4>[ 1800.702664]  [<ffffffff8108681d>] ? watchdog+0x6d/0x2e0
<4>[ 1800.702945]  [<ffffffff813982ee>] ? _spin_unlock_irqrestore+0x6e/0x80
<4>[ 1800.703238]  [<ffffffff810867b0>] ? watchdog+0x0/0x2e0
<4>[ 1800.703489]  [<ffffffff8105bbce>] kthread+0x8e/0xa0
<4>[ 1800.703731]  [<ffffffff8100cf1a>] child_rip+0xa/0x20
<4>[ 1800.703975]  [<ffffffff8100c880>] ? restore_args+0x0/0x30
<4>[ 1800.704234]  [<ffffffff8105bb40>] ? kthread+0x0/0xa0
<4>[ 1800.704482]  [<ffffffff8100cf10>] ? child_rip+0x0/0x20
<0>[ 1800.704796] Rebooting in 60 seconds..
Comment 2 Maxim Levitsky 2009-11-22 21:28:22 UTC
Created attachment 23875 [details]
acpidump
Comment 3 Maxim Levitsky 2009-11-23 08:07:59 UTC
And yet another 280 cycles, video, ac, battery loaded with Xorg and everything else prove that I am not kidding, this was the last hibernation issue on my laptop.
Comment 4 Lin Ming 2009-12-08 09:54:47 UTC
Please build kernel with "CONFIG_ACPI_DEBUG=y"
Then boot with "acpi.debug_layer=0x00000004 acpi.debug_level=0x00000200"

Suspend/Resume, attach the dmesg to see who acqurie/release the global lock.
Comment 5 Maxim Levitsky 2009-12-26 22:33:02 UTC
Folks, any update on that?

The suggested test is not necessary because I know that lock is taken by accessing the EC configuration space.

I do the test soon though
Comment 6 Alexey Starikovskiy 2009-12-26 23:58:53 UTC
Maxim,
Please clarify comment #1: you say ac driver is blacklisted, yet trace shows ac resume function.
Comment 7 Maxim Levitsky 2009-12-27 16:51:07 UTC
No problem, this was backtrace before I blacklisted it.
Comment 8 Alexey Starikovskiy 2009-12-27 18:50:14 UTC
There are several methods which might work -- we could put 'if' statements into each driver ac/battery to reject procfs/sysfs calls during suspend/resume. We might do it in EC. Or we could disable (as in make it unlockable) global lock. Safest route would be first, most effective -- last. -EAGAIN seems to be appropriate value.
Comment 9 Alexey Starikovskiy 2009-12-27 18:51:46 UTC
Created attachment 24318 [details]
Refuse locking Global Lock if suspended

Please check if such patch works for your system.
Comment 10 Maxim Levitsky 2009-12-28 18:49:56 UTC
Nope, same hang.

However I think I got the idea about what happens.
I enabled the debugging bits as you requested, and I see that during _resume_, while memory image is read, there are few instances of access to global lock, which is not and ordinary lock, but it lives in FACS, and thus is not saved in hibernation image (and this is correct).

What happens is that during resume, ec transaction begins, hw lock is taken, but then in the middle, control is passed to resuming kernel, it tries to take that lock again... boom.

On resume, vga buffer is not cleared, but is overwritten by the resuming kernel.
Partially, message about taking the lock was visable.

Resumed kernel have shown message 'Waiting for hardware Global Lock'
I didn't capture the kernel log through.

One note about how thermal monitoring & fan control work on my system.

CPU is programmed to send an SMI, when temperature reaches a threshold, (40 C), this triggers the handler, and it programs the chipset to issue an SCI, which is handled by AML code, and it writes new temperature to the EC.
This is the path that I think is triggered while resume image is read.
Also EC could send an interrupt, and it will be handled by AML code too, and probably result in EC access.


I think that the right solution is to abort EC transaction, (and release the lock) just before control is passed to resuming kernel (and then refuse to do more transactions, because this kernel is going away anyway)
I think this is possible to do using new pm operations.
Comment 11 Maxim Levitsky 2010-01-12 21:56:49 UTC
After reading the EC driver sources, I understand that the only way to fix that is to release the hardware lock just before the resume.

What do you think.

This problem pretty much affects most notebooks.
Comment 12 Alexey Starikovskiy 2010-01-13 15:57:06 UTC
Maxim,
I'm not sure what you by that...
One option is to have special flag in booting kernel which prevents non-essential battery/thermal(?)/ac from installing.
Second option is again in booting kernel -- release ACPI GL unconditionally (wait for it to become free ?) before transferring control to resume kernel.
Could do elaborate your idea? patch or idea of it would be good :)
Comment 13 Maxim Levitsky 2010-01-13 19:27:09 UTC
One option is to have special flag in booting kernel which prevents
non-essential battery/thermal(?)/ac from installing.

^^^ This won't work because EC can be accessed from GPE handler for example, and it is here.

The second option is exactly what I meant.

I did that locally, but it seems that I am going to have a lot of fun.... because I have here now problems with both suspend part and resume (2.6.33-rc3) I guess that something is broken again.

I restore 2.6.32, apply the patch I did locally, and see how well it performs...
Comment 14 Maxim Levitsky 2010-01-23 09:35:05 UTC
Created attachment 24684 [details]
Failed attempt to fix that bug

for reference this is the latest attempt to fix that bug. Still hangs on resume.
Its is really wierd.

However with the lock busted, I made 590 suspend to disk cycles now with 2.6.33-rc3

One of the problems I had was a bug in my own driver I have here (mtd driver for xD card reader) other was the deadlocks on memory allocations during suspend.
Comment 15 Rafael J. Wysocki 2010-01-30 19:04:04 UTC
I'm not sure the patch does exactly what you wanted.  I thought it would just release the global lock instead of merely preventing it from being acquired.

Also, shouldn't we actually make sure that the global lock is released before suspend to RAM?  As well as in the hibernate .pre_snapshot() phase?  Alex?
Comment 16 Rafael J. Wysocki 2010-01-30 19:11:57 UTC
Really, trying to keep the global hardware lock by _any_ _driver_ over a suspend-resume cycle is a plain bug that needs fixing.  I mean _really_ _fixed_, not worked around by, let's face it, ugly hacks.
Comment 17 Rafael J. Wysocki 2010-01-30 20:19:32 UTC
Created attachment 24808 [details]
ACPI / EC: Do not carry out EC transactions while suspended

I wonder if this patch makes any difference.

It is a modified version of the Alex's patch from comment #9, but (1) it closes the race between acpi_ec_transaction() and acpi_ec_suspend/resume() (so the suspend routine will actually wait for transactions in progress to complete before setting the flag and disabling the GPE) and (2) it prevents transactions from being carried out while suspended regardless of the global lock (so there's no difference in behavior between the systems with and without the global lock).

It should work for you if the theory is correct and if no one else acquires the global lock independently.

_However_, it may introduce regressions, because disabling the EC transactions at this point may be too early (_PTS is still going to be called after that).  So, it may be necessary to move the disabling of EC transactions to the platform prepare callback, right after _PTS has been executed.
Comment 18 Maxim Levitsky 2010-01-30 21:16:00 UTC
>I'm not sure the patch does exactly what you wanted.  I thought it would just
>release the global lock instead of merely preventing it from being acquired.
>
>Also, shouldn't we actually make sure that the global lock is released before
>suspend to RAM?  As well as in the hibernate .pre_snapshot() phase?  Alex?

Why?

My patch (which doesn't work sadly), takes the lock, and then sets global variable (lock_suspended), then it releases it (and does release the _hardware_ lock. At that point all further attempts to take the lock just succeed without touching the global lock (because of that global variable)

Since it doesn't work, I assume that EC is somehow busted, thus on resume, indeed an transaction hangs in the middle, with lock held, or in other words, that resuming kernel takes the lock.

I will test the above patch.
Comment 19 Rafael J. Wysocki 2010-01-30 21:24:23 UTC
BTW, if the patch above doesn't work for you, this means the global lock is acquired from somewhere else, most likely from some AML routine, and not released.  That may be a result of a BIOS bug so we may need to low-level release the global lock unconditionally in the platform .prepare etc. callbacks to avoid that.

That said, I don't really think it should be possible to suspend in the middle of an EC transaction in progress, so something like my patch will be necessary anyway (perhaps with the modification that EC transactions will be blocked after executing _PTS and unblocked before executing _WAK).
Comment 20 Rafael J. Wysocki 2010-01-30 21:35:04 UTC
(In reply to comment #18)
> >I'm not sure the patch does exactly what you wanted.  I thought it would
> just
> >release the global lock instead of merely preventing it from being acquired.
> >
> >Also, shouldn't we actually make sure that the global lock is released
> before
> >suspend to RAM?  As well as in the hibernate .pre_snapshot() phase?  Alex?
> 
> Why?
> 
> My patch (which doesn't work sadly), takes the lock, and then sets global
> variable (lock_suspended), then it releases it (and does release the
> _hardware_
> lock.

I don't see that happen in the patch (perhaps I'm misreading it).  How is the hardware lock released?  You should have called acpi_ev_release_global_lock() (directly or indirectly) for that to happen, but I don't see that in your patch.  You only seem to set acpi_gbl_global_lock_suspended, which doesn't have any effect if the hardware lock is already being held at this point (which we assume is the case, BTW).
Comment 21 Maxim Levitsky 2010-01-30 21:41:43 UTC
But what I do next? I call the acpi_ex_release_mutex_object, and it releases the global lock (since I didn't touch the release path).
Comment 22 Rafael J. Wysocki 2010-01-30 21:51:48 UTC
No, it doesn't.  It only releases the mutex used for protecting the global lock and its reference counter (acpi_ev_global_lock_acquired) from concurrent access.  The global lock is released by ACPI_RELEASE_GLOBAL_LOCK(), but that can only be called if the reference counter is 0, which is what acpi_ev_release_global_lock() does.

[BTW, if I'm not mistaken, there is a bug in acpi_ev_acquire_global_lock() that should do acpi_os_release_mutex(acpi_gbl_global_lock_mutex->mutex.os_mutex) if it failed to acquire the global lock.]
Comment 23 Rafael J. Wysocki 2010-01-30 21:58:22 UTC
Ah, sorry, I overlooked the "Special case for Global Lock" in acpi_ex_acquire_mutex_object() (how broken is this?).
Comment 24 Rafael J. Wysocki 2010-01-30 22:11:54 UTC
Howver, this also has to be done during suspend because of the global lock's reference counter and mutex.  Namely, if we suspended with the global lock held, the reference counter will be 1 and the mutex will be seen as held in the image kernel even if the global lock itself has been released in .pre_restore().
Comment 25 Rafael J. Wysocki 2010-01-30 22:51:58 UTC
(In reply to comment #22)
> [BTW, if I'm not mistaken, there is a bug in acpi_ev_acquire_global_lock()
> that
> should do acpi_os_release_mutex(acpi_gbl_global_lock_mutex->mutex.os_mutex)
> if
> it failed to acquire the global lock.]

Scratch that, I was mistaken. :-)
Comment 26 Maxim Levitsky 2010-01-30 22:55:08 UTC
Sure, My goal was just to be sure that hw lock isn't held by kernel that does the resume.

Now testing both GFP_NOIO and above patch (42 cycles and counting...)
Comment 27 Maxim Levitsky 2010-01-31 20:24:34 UTC
The patch does work (about 70 cycles), but it introduces side effects as such that both 'battery' and 'video' acpi drivers misbehave.
Comment 28 Rafael J. Wysocki 2010-01-31 21:04:46 UTC
What kind of misbehavior is it?
Comment 29 Rafael J. Wysocki 2010-01-31 21:10:16 UTC
Ah, I've just read the e-mails.  Sorry for the bug in the patch, I'll attach fixed version in a minute.
Comment 30 Rafael J. Wysocki 2010-01-31 21:15:05 UTC
Created attachment 24842 [details]
ACPI / EC: Add protection from suspending in the middle of EC transaction

Fixed version of the last patch.
Comment 31 Maxim Levitsky 2010-01-31 21:25:49 UTC
At least after few cycles, both battery and brightness driver works.
Comment 32 Rafael J. Wysocki 2010-01-31 21:55:28 UTC
Great, thanks a lot for testing.

How many cycles does it usually take to hang the box during resume without the patch?
Comment 33 Rafael J. Wysocki 2010-01-31 22:25:10 UTC
In fact the patch from comment #30 fixes resume on my HP nx6325, which was a regression from 2.6.31, so adding this to the regression list.
Comment 34 Maxim Levitsky 2010-01-31 23:12:52 UTC
It takes around 40 cycles to show the problem.

I do more testing tomorrow.
Comment 35 Maxim Levitsky 2010-01-31 23:13:54 UTC
It takes around 40 cycles to show the problem.

I do more testing tomorrow.
Comment 36 Maxim Levitsky 2010-02-01 19:55:30 UTC
This patch survived 123 hibernate cycles.
However, this patch ought to introduce regressions like was described in comment #17
Comment 37 Rafael J. Wysocki 2010-02-01 21:25:20 UTC
(In reply to comment #36)
> This patch survived 123 hibernate cycles.

Good.

> However, this patch ought to introduce regressions like was described in
> comment #17

Bad.  What exactly doesn't work?
Comment 38 Maxim Levitsky 2010-02-01 22:32:37 UTC
Here everything works.
However if DSDT accesses EC after it was suspended, now it won't work.
or if another acpi driver accesses EC in .suspend, .resume or even normally while EC is suspended.
Comment 39 Rafael J. Wysocki 2010-02-01 23:07:24 UTC
Created attachment 24863 [details]
ACPI / EC: Remover race between EC driver and suspend process

Can you check if this patch still fixes the issue for you, please?
Comment 40 Maxim Levitsky 2010-02-02 22:06:00 UTC
Yep, works perfectly.

Posting this from my notebook after 138 hibernate cycles.
Comment 41 Rafael J. Wysocki 2010-02-02 22:17:22 UTC
Great, thanks for testing.  I've send a cleaned-up version of it to linux-acpi for review.  I'll attach it here for easier reference in a minute.

BTW, it appears that my nx6325 issue was only accidentally "fixed" by the previous patch, so clearing the regression flag and removing from the regression list.
Comment 42 Rafael J. Wysocki 2010-02-02 22:22:37 UTC
Created attachment 24884 [details]
ACPI / EC: Remove race between EC driver and suspend process

Cleaned-up version of the previous patch.
Comment 43 Rafael J. Wysocki 2010-02-03 00:37:51 UTC
Created attachment 24886 [details]
ACPI / EC: Remove race between EC driver and suspend process (rev. 2)

One more update.

After discussing the previous patch with Alex I changed it so that the resume behavior is not modified by it, which shouldn't change things in your test case.  Still, some testing would be highly appreciated.
Comment 44 Rafael J. Wysocki 2010-02-04 00:37:33 UTC
Created attachment 24904 [details]
ACPI / EC: Remove race between EC driver and suspend process (rev. 3)

Unfortunately, I have to update it once again.

I overlooked the fact that the previous patch would break the error paths and test modes, so the EC transactions need to be "resumed" in more places than in the previous patch.

This patch is essentially a combination of the two previous patches.  It "resumes" EC transactions as soon as reasonably possible during wake-up and at the same time it takes care of the error paths and suspend test modes as well.
Comment 45 Maxim Levitsky 2010-02-05 13:33:28 UTC
Just for your information, I tested this patch and GFP_NOIO patch for 169 hibernate cycles.

Its so nice to have reliable hibernation again.
Comment 46 Rafael J. Wysocki 2010-02-05 21:59:23 UTC
Thanks for testing.  The GFP_NOIO patch is in the -mm tree and I hope Andrew will push it before the final 2.6.33.

This one is 2.6.34 material if Len accepts it.  I'll add your "tested-by" to it if Len doesn't take it earlier.
Comment 47 Rafael J. Wysocki 2010-03-04 00:44:36 UTC
Created attachment 25350 [details]
ACPI / EC / PM: Close race between EC and resume from hibernation

I've discussed the previous patch with Len and we've concluded that the problem is hibernation-specific and therefore it may be fixed by a simpler patch like the attached one.
Comment 48 Rafael J. Wysocki 2010-03-04 00:45:50 UTC
Maxim, please check if the patch from comment #47 still fixes the problem for you.
Comment 49 Len Brown 2010-03-08 19:17:21 UTC
patch in comment #47 is in the acpi tree
Comment 50 Len Brown 2010-04-01 00:17:50 UTC
commit f6bb13aa1ea3bb26a4c783822347873f085b9000
Author: Rafael J. Wysocki <rjw@sisk.pl>
Date:   Thu Mar 4 01:52:58 2010 +0100

    ACPI / EC / PM: Close race between EC and resume from hibernation


shipped in linux-2.6.34-rc2
closed
Comment 51 Maxim Levitsky 2010-04-02 10:59:28 UTC
Belive me or not, but that bug exploded again here:
This is wireless-testing kernel based on 2.6.34-rc3 + no PM related patches.


<6>[   16.512418] PM: Read 519208 kbytes in 11.32 seconds (45.86 MB/s)
<7>[   16.512780] PM: Image successfully loaded
<5>[   16.513801] sd 0:0:0:0: [sda] Synchronizing SCSI cache
<6>[   16.514878] ata_piix 0000:00:1f.1: PCI INT A disabled
<6>[   16.590170] PM: quiesce of devices complete after 76.674 msecs
<6>[   16.591562] PM: late quiesce of devices complete after 1.054 msecs
<4>[   16.592112] Disabling non-boot CPUs ...
<6>[   16.730028] CPU 1 is now offline
<4>[   16.730221] lockdep: fixing up alternatives.
<6>[   16.730469] SMP alternatives: switching to UP code
<6>[  240.870014] PM: Restoring platform NVS memory
<4>[  240.870014] Enabling non-boot CPUs ...
<4>[  240.890711] lockdep: fixing up alternatives.
<6>[  240.890720] SMP alternatives: switching to SMP code
<6>[  240.896468] Booting Node 0 Processor 1 APIC 0x1
<4>[  241.100023] CPU1 is up
<6>[  241.101089] ACPI: Waking up from system sleep state S4
<3>[  361.850150] INFO: task pm-hibernate:6030 blocked for more than 120 seconds.
<3>[  361.850553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[  361.851002] pm-hibernate  D 0000000000000000     0  6030   5811 0x00000000
<4>[  361.851019]  ffff880047b67708 0000000000000046 ffff880000000000 ffff880047b67fd8
<4>[  361.851037]  ffff8800217a0000 00000000000121c0 0000000000004000 ffff880047b67fd8
<4>[  361.851054]  0000000000004000 00000000000121c0 ffffffff8160e020 ffff8800217a0000
<4>[  361.851071] Call Trace:
<4>[  361.851085]  [<ffffffff8106a94d>] ? __module_text_address+0xd/0x60
<4>[  361.851101]  [<ffffffff8137007d>] schedule_timeout+0x20d/0x280
<4>[  361.851115]  [<ffffffff8104eac8>] ? __kernel_text_address+0x58/0x80
<4>[  361.851130]  [<ffffffff8137163d>] __down_timeout+0x6d/0xb0
<4>[  361.851143]  [<ffffffff8105697f>] ? down_timeout+0x1f/0x60
<4>[  361.851156]  [<ffffffff810569b8>] down_timeout+0x58/0x60
<4>[  361.851169]  [<ffffffff811f3e60>] acpi_os_wait_semaphore+0x98/0x137
<4>[  361.851185]  [<ffffffff8120b17b>] acpi_ex_system_wait_semaphore+0x40/0x8b
<4>[  361.851200]  [<ffffffff81204f46>] acpi_ev_acquire_global_lock+0x11d/0x121
<4>[  361.851215]  [<ffffffff8120dfec>] acpi_ex_acquire_mutex_object+0x39/0x63
<4>[  361.851230]  [<ffffffff8120d160>] acpi_ex_acquire_global_lock+0x24/0x46
<4>[  361.851244]  [<ffffffff81207b23>] acpi_ex_write_data_to_field+0x257/0x283
<4>[  361.851260]  [<ffffffff8120c6e0>] acpi_ex_store_object_to_node+0x108/0x1fa
<4>[  361.851276]  [<ffffffff8120cdc3>] acpi_ex_store+0xbe/0x29b
<4>[  361.851289]  [<ffffffff8120a31a>] acpi_ex_opcode_1A_1T_1R+0x35c/0x4b7
<4>[  361.851304]  [<ffffffff811ffe13>] acpi_ds_exec_end_op+0xef/0x570
<4>[  361.851318]  [<ffffffff81217386>] acpi_ps_parse_loop+0x8b2/0xa6c
<4>[  361.851333]  [<ffffffff810be8af>] ? kmem_cache_alloc+0x8f/0x110
<4>[  361.851347]  [<ffffffff812162c5>] acpi_ps_parse_aml+0x10a/0x3c1
<4>[  361.851361]  [<ffffffff81217cc0>] acpi_ps_execute_method+0x20e/0x31a
<4>[  361.851375]  [<ffffffff81211c6c>] acpi_ns_evaluate+0x18c/0x2c9
<4>[  361.851389]  [<ffffffff812114e4>] acpi_evaluate_object+0x1ad/0x2f1
<4>[  361.851403]  [<ffffffff81206a81>] ? acpi_ev_walk_gpe_list+0x75/0x87
<4>[  361.851417]  [<ffffffff8120f35f>] acpi_leave_sleep_state+0xa6/0x151
<4>[  361.851431]  [<ffffffff811f548f>] acpi_pm_finish+0x34/0x49
<4>[  361.851444]  [<ffffffff811f54b2>] acpi_hibernation_finish+0xe/0x10
<4>[  361.851458]  [<ffffffff81070ebc>] hibernation_snapshot+0xec/0x2c0
<4>[  361.851471]  [<ffffffff813704e9>] ? mutex_unlock+0x9/0x10
<4>[  361.851483]  [<ffffffff8107118c>] hibernate+0xfc/0x1f0
<4>[  361.851496]  [<ffffffff8106fadc>] state_store+0xec/0x100
<4>[  361.851510]  [<ffffffff811b2ec7>] kobj_attr_store+0x17/0x20
<4>[  361.851524]  [<ffffffff8112c342>] sysfs_write_file+0xf2/0x170
<4>[  361.851538]  [<ffffffff810c6708>] vfs_write+0xb8/0x1a0
<4>[  361.851551]  [<ffffffff8137224a>] ? lockdep_sys_exit_thunk+0x35/0x67
<4>[  361.851564]  [<ffffffff810c68cc>] sys_write+0x4c/0x80
<4>[  361.851578]  [<ffffffff81002deb>] system_call_fastpath+0x16/0x1b
<4>[  361.851590] INFO: lockdep is turned off.
<0>[  361.851599] Kernel panic - not syncing: hung_task: blocked tasks
<4>[  361.851946] Pid: 352, comm: khungtaskd Tainted: P           2.6.34-rc3-wl #38
<4>[  361.851959] Call Trace:
<4>[  361.851967]  [<ffffffff8136ef25>] panic+0x82/0xfc
<4>[  361.851979]  [<ffffffff8107a669>] watchdog+0x2a9/0x2e0
<4>[  361.851991]  [<ffffffff8107a42d>] ? watchdog+0x6d/0x2e0
<4>[  361.852004]  [<ffffffff81372cd3>] ? _raw_spin_unlock_irqrestore+0x33/0x60
<4>[  361.852019]  [<ffffffff8107a3c0>] ? watchdog+0x0/0x2e0
<4>[  361.852031]  [<ffffffff8105102e>] kthread+0x9e/0xb0
<4>[  361.852044]  [<ffffffff81003b74>] kernel_thread_helper+0x4/0x10
<4>[  361.852058]  [<ffffffff81050f90>] ? kthread+0x0/0xb0
<4>[  361.852070]  [<ffffffff81003b70>] ? kernel_thread_helper+0x0/0x10
<0>[  361.852090] Rebooting in 10 seconds..
Comment 52 Maxim Levitsky 2010-04-02 11:23:20 UTC
No wonder its broken.
I see sole call to acpi_ec_suspend_transactions in acpi_pm_pre_restore.
Well it too late to call it from that point, because that code will be executed in the *resuming* kernel. At that hw lock is already taken.
This has to be called from suspending kernel
Comment 53 Rafael J. Wysocki 2010-04-02 16:40:26 UTC
Not exactly.  This means _PTS doesn't prevent this thing from happening as we have assumed.
Comment 54 Rafael J. Wysocki 2010-04-02 21:10:47 UTC
Created attachment 25831 [details]
ACPI / EC / PM: Fix race between EC transactions and system suspend

Please try to reproduce with this patch applied (it's the patch from comment #44 rebased on top of the current mainline).
Comment 55 Rafael J. Wysocki 2010-04-02 21:18:13 UTC
Created attachment 25833 [details]
ACPI / EC / PM: Fix race between EC transactions and system suspend

Sorry, with this one applied, please (the previous one was incomplete).
Comment 56 Maxim Levitsky 2010-04-02 22:22:23 UTC
Sure.
About my comment #52, its totally wrong. the resuming kernel should suspend the transactions. Still something is going on.
I will run my hibernate loop tonight with latest patch applied,
Thanks!
Comment 57 Maxim Levitsky 2010-04-03 13:44:26 UTC
150 cycles tested and still works. I guess this patch fixes that bug correctly
Comment 58 Rafael J. Wysocki 2010-04-03 18:42:02 UTC
Thanks for being persistent. :-)

I'll submit the patch to Len, but still if you could do some more testing just to make sure the problem is really gone with this patch, that would be appreciated.
Comment 59 Maxim Levitsky 2010-04-10 00:22:05 UTC
Sorry for no responding before.
According to all tests I done, 150 cycles is much more that enough.
I never seen this fail after 40 cycles.
Comment 60 Maxim Levitsky 2010-05-07 02:12:00 UTC
Any reason, this is still not in kernel?
Comment 61 Maxim Levitsky 2010-05-21 21:38:34 UTC
Was this patch forgotten.?

I reopen this bug
Comment 62 Len Brown 2010-06-04 18:08:56 UTC
The bugzilla protocol is that a bug is marked RESOLVED
if a fix exists that is a candidate for upstream.
It is CLOSED when the patch makes it upstream.

In this report
commit d5a64513c6a171262082c250592c062e97a2c693
    ACPI / EC / PM: Fix race between EC transactions and system suspend

is staged in the acpi tree waiting for inclusion upstream.

So this bug is now marked RESOLVED, and will be marked CLOSED
when the journey is complete.
Comment 63 Len Brown 2010-06-09 04:37:32 UTC
shipped in 2.6.35-rc2

closed

Note You need to log in before you can comment on or make changes to this bug.