Bug 13245
Summary: | possible circular locking dependency detected | ||
---|---|---|---|
Product: | Networking | Reporter: | Miles Lane (miles.lane) |
Component: | Wireless | Assignee: | networking_wireless (networking_wireless) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | johannes, linville, rjw, tomi |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30-rc4-git1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216, 13070 | ||
Attachments: |
INFO message text
PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs |
Hi all, This is effectively a copy of http://bugzilla.kernel.org/show_bug.cgi?id=13245 but with a few more people. I hope this works and bugme picks it up. The report is below. The problem here is effectively that something runs off the events workqueue, and ends up, with some indirection that I think doesn't matter much, using cpu_add_remove_lock, either while being on the events workqueue or in this case depending on a lock that was used from the events workqueue. It seems that the problem is in disable_nonboot_cpus, which via the notifier indirection calls cleanup_workqueue_thread() which flushes the workqueue so needs the fake "events" lock. On the other hand, via some indirection, things that run off the events end up using (depending on something that uses) cpu_add_remove_lock. Thoughts? [ 1231.558889] [ INFO: possible circular locking dependency detected ] [ 1231.558895] 2.6.30-rc4-git1 #5 [ 1231.558899] ------------------------------------------------------- [ 1231.558903] pm-suspend/4741 is trying to acquire lock: [ 1231.558908] (events){+.+.+.}, at: [<ffffffff80249feb>] cleanup_workqueue_thread+0x1e/0xf8 [ 1231.558927] [ 1231.558929] but task is already holding lock: [ 1231.558932] (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8023a8a5>] disable_nonboot_cpus+0x33/0x11d [ 1231.558947] [ 1231.558948] which lock already depends on the new lock. [ 1231.558951] [ 1231.558954] [ 1231.558955] the existing dependency chain (in reverse order) is: [ 1231.558959] [ 1231.558961] -> #5 (cpu_add_remove_lock){+.+.+.}: [ 1231.558969] [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22 [ 1231.558980] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.558988] [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429 [ 1231.558999] [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35 [ 1231.559008] [<ffffffff8023a80e>] cpu_maps_update_begin+0x12/0x14 [ 1231.559017] [<ffffffff8024a34b>] __create_workqueue_key+0x161/0x226 [ 1231.559026] [<ffffffff80275cc8>] stop_machine_create+0x3a/0x9b [ 1231.559035] [<ffffffff80275d42>] stop_machine+0x19/0x4b [ 1231.559044] [<ffffffffa0115238>] 0xffffffffa0115238 [ 1231.559061] [<ffffffff8020906b>] do_one_initcall+0x65/0x153 [ 1231.559072] [<ffffffff80266b50>] sys_init_module+0xab/0x1d3 [ 1231.559080] [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b [ 1231.559089] [<ffffffffffffffff>] 0xffffffffffffffff [ 1231.559097] [ 1231.559099] -> #4 (setup_lock){+.+.+.}: [ 1231.559105] [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22 [ 1231.559114] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.559122] [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429 [ 1231.559132] [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35 [ 1231.559141] [<ffffffff80275ca0>] stop_machine_create+0x12/0x9b [ 1231.559150] [<ffffffff8023a888>] disable_nonboot_cpus+0x16/0x11d [ 1231.559159] [<ffffffff80267993>] suspend_devices_and_enter+0xd6/0x1b1 [ 1231.559168] [<ffffffff80267bfa>] enter_state+0x163/0x1c9 [ 1231.559176] [<ffffffff80267d17>] state_store+0xb7/0xd8 [ 1231.559184] [<ffffffff803bd78b>] kobj_attr_store+0x17/0x19 [ 1231.559194] [<ffffffff803137b3>] sysfs_write_file+0xe4/0x119 [ 1231.559203] [<ffffffff802c237a>] vfs_write+0xab/0x105 [ 1231.559212] [<ffffffff802c2498>] sys_write+0x47/0x70 [ 1231.559219] [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b [ 1231.559228] [<ffffffffffffffff>] 0xffffffffffffffff [ 1231.559236] [ 1231.559238] -> #3 (dpm_list_mtx){+.+.+.}: [ 1231.559245] [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22 [ 1231.559253] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.559262] [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429 [ 1231.559272] [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35 [ 1231.559281] [<ffffffff8046002d>] device_pm_add+0x1e/0xb5 [ 1231.559292] [<ffffffff80459fae>] device_add+0x37c/0x557 [ 1231.559301] [<ffffffffa008f453>] wiphy_register+0x148/0x1ec [cfg80211] [ 1231.559320] [<ffffffffa00c62f8>] ieee80211_register_hw+0xe9/0x3aa [mac80211] [ 1231.559348] [<ffffffffa01776e2>] iwl3945_pci_probe+0xb5c/0xcce [iwl3945] [ 1231.559368] [<ffffffff803d1a5f>] local_pci_probe+0x12/0x16 [ 1231.559378] [<ffffffff803d274e>] pci_device_probe+0x5f/0x89 [ 1231.559388] [<ffffffff8045bf49>] driver_probe_device+0x9e/0x143 [ 1231.559398] [<ffffffff8045c046>] __driver_attach+0x58/0x7b [ 1231.559406] [<ffffffff8045b83c>] bus_for_each_dev+0x54/0x8a [ 1231.559415] [<ffffffff8045bdbb>] driver_attach+0x1c/0x1e [ 1231.559423] [<ffffffff8045b178>] bus_add_driver+0xb2/0x1e6 [ 1231.559432] [<ffffffff8045c300>] driver_register+0xb6/0x121 [ 1231.559439] [<ffffffff803d2b76>] __pci_register_driver+0x61/0xcf [ 1231.559448] [<ffffffffa018505c>] 0xffffffffa018505c [ 1231.559456] [<ffffffff8020906b>] do_one_initcall+0x65/0x153 [ 1231.559464] [<ffffffff80266b50>] sys_init_module+0xab/0x1d3 [ 1231.559473] [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b [ 1231.559481] [<ffffffffffffffff>] 0xffffffffffffffff [ 1231.559489] [ 1231.559491] -> #2 (cfg80211_mutex){+.+.+.}: [ 1231.559498] [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22 [ 1231.559507] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.559515] [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429 [ 1231.559524] [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35 [ 1231.559533] [<ffffffffa0090a0b>] reg_todo+0x5d/0x4d8 [cfg80211] [ 1231.559553] [<ffffffff80249bf9>] worker_thread+0x24b/0x367 [ 1231.559561] [<ffffffff8024d94d>] kthread+0x56/0x83 [ 1231.559570] [<ffffffff8020c37a>] child_rip+0xa/0x20 [ 1231.559577] [<ffffffffffffffff>] 0xffffffffffffffff [ 1231.559585] [ 1231.559587] -> #1 (reg_work){+.+.+.}: [ 1231.559593] [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22 [ 1231.559603] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.559611] [<ffffffff80249bf0>] worker_thread+0x242/0x367 [ 1231.559619] [<ffffffff8024d94d>] kthread+0x56/0x83 [ 1231.559627] [<ffffffff8020c37a>] child_rip+0xa/0x20 [ 1231.559634] [<ffffffffffffffff>] 0xffffffffffffffff [ 1231.559658] [ 1231.559659] -> #0 (events){+.+.+.}: [ 1231.559666] [<ffffffff8025db36>] __lock_acquire+0x97c/0xc22 [ 1231.559675] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.559683] [<ffffffff8024a012>] cleanup_workqueue_thread+0x45/0xf8 [ 1231.559691] [<ffffffff8057c347>] workqueue_cpu_callback+0xc2/0x105 [ 1231.559703] [<ffffffff8058e6a1>] notifier_call_chain+0x5e/0x92 [ 1231.559712] [<ffffffff80251c78>] raw_notifier_call_chain+0xf/0x11 [ 1231.559722] [<ffffffff8057a9db>] _cpu_down+0x278/0x295 [ 1231.559730] [<ffffffff8023a8ea>] disable_nonboot_cpus+0x78/0x11d [ 1231.559740] [<ffffffff80267993>] suspend_devices_and_enter+0xd6/0x1b1 [ 1231.559749] [<ffffffff80267bfa>] enter_state+0x163/0x1c9 [ 1231.559757] [<ffffffff80267d17>] state_store+0xb7/0xd8 [ 1231.559765] [<ffffffff803bd78b>] kobj_attr_store+0x17/0x19 [ 1231.559773] [<ffffffff803137b3>] sysfs_write_file+0xe4/0x119 [ 1231.559781] [<ffffffff802c237a>] vfs_write+0xab/0x105 [ 1231.559788] [<ffffffff802c2498>] sys_write+0x47/0x70 [ 1231.559796] [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b [ 1231.559804] [<ffffffffffffffff>] 0xffffffffffffffff [ 1231.559813] [ 1231.559814] other info that might help us debug this: [ 1231.559817] [ 1231.559822] 4 locks held by pm-suspend/4741: [ 1231.559825] #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff80313707>] sysfs_write_file+0x38/0x119 [ 1231.559838] #1: (pm_mutex){+.+.+.}, at: [<ffffffff80267c57>] enter_state+0x1c0/0x1c9 [ 1231.559851] #2: (dpm_list_mtx){+.+.+.}, at: [<ffffffff8045f533>] device_pm_lock+0x12/0x14 [ 1231.559864] #3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8023a8a5>] disable_nonboot_cpus+0x33/0x11d [ 1231.559878] [ 1231.559880] stack backtrace: [ 1231.559886] Pid: 4741, comm: pm-suspend Not tainted 2.6.30-rc4-git1 #5 [ 1231.559892] Call Trace: [ 1231.559895] [<ffffffff8025ce44>] print_circular_bug_tail+0x71/0x7c [ 1231.559895] [<ffffffff8025db36>] __lock_acquire+0x97c/0xc22 [ 1231.559895] [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a [ 1231.559895] [<ffffffff80249feb>] ? cleanup_workqueue_thread+0x1e/0xf8 [ 1231.559895] [<ffffffff8024a012>] cleanup_workqueue_thread+0x45/0xf8 [ 1231.559895] [<ffffffff80249feb>] ? cleanup_workqueue_thread+0x1e/0xf8 [ 1231.559895] [<ffffffff8057c347>] workqueue_cpu_callback+0xc2/0x105 [ 1231.559895] [<ffffffff8058e6a1>] notifier_call_chain+0x5e/0x92 [ 1231.559895] [<ffffffff80251c78>] raw_notifier_call_chain+0xf/0x11 [ 1231.559895] [<ffffffff8057a9db>] _cpu_down+0x278/0x295 [ 1231.559895] [<ffffffff8023a8ea>] disable_nonboot_cpus+0x78/0x11d [ 1231.559895] [<ffffffff80267993>] suspend_devices_and_enter+0xd6/0x1b1 [ 1231.559895] [<ffffffff80267bfa>] enter_state+0x163/0x1c9 [ 1231.559895] [<ffffffff80267d17>] state_store+0xb7/0xd8 [ 1231.559895] [<ffffffff803bd78b>] kobj_attr_store+0x17/0x19 [ 1231.559895] [<ffffffff803137b3>] sysfs_write_file+0xe4/0x119 [ 1231.559895] [<ffffffff802c237a>] vfs_write+0xab/0x105 [ 1231.559895] [<ffffffff802c2498>] sys_write+0x47/0x70 [ 1231.559895] [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b Yes, I am still seeing this with 2.6.30-rc6-git1. On Sunday 17 May 2009, Ming Lei wrote:
> I also can reproduce it on 2.6.30-rc5-next-20090515.
>
> 2009/5/17 Rafael J. Wysocki <rjw@sisk.pl>:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.29. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13245
> > Subject : possible circular locking dependency detected
> > Submitter : Miles Lane <miles.lane@gmail.com>
> > Date : 2009-05-04 16:56 (13 days old)
*** Bug 13296 has been marked as a duplicate of this bug. *** Is it possible that this causes my system to hang after "Suspending consoles" ? (In reply to comment #5) > Is it possible that this causes my system to hang after "Suspending consoles" > ? I'm not quite sure but I don't think so -- this might be a false positive. I have provided a shortened deadlock scenario here: http://thread.gmane.org/gmane.linux.kernel/835569/focus=838839 Oleg and me independently arrived at the conclusion that the commits PM: Change hibernation code ordering 4aecd6718939eb3c4145b248369b65f7483a8a02 PM: Change suspend code ordering 900af0d973856d6feb6fc088c2d0d3fde57707d3 are the ones causing this problem (for hibernate/suspend respectively). Created attachment 21511 [details]
PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs
The attached patch should fix the problem. Miles please test it.
Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://patchwork.kernel.org/patch/25557/ Fixed by commit 32bdfac5462d777f35b00838893c4f87baf23efe . |
Created attachment 21214 [details] INFO message text As far as I can tell, the system is running normally after resume. Also, I only see this INFO message on the first suspend/resume cycle. Subsequent suspends do not trigger the message.