Bug 13245

Summary: possible circular locking dependency detected
Product: Networking Reporter: Miles Lane (miles.lane)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: johannes, linville, rjw, tomi
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc4-git1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 13070    
Attachments: INFO message text
PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs

Description Miles Lane 2009-05-04 16:56:58 UTC
Created attachment 21214 [details]
INFO message text

As far as I can tell, the system is running normally after resume.  
Also, I only see this INFO message on the first suspend/resume cycle.  Subsequent suspends do not trigger the message.
Comment 1 Johannes Berg 2009-05-04 17:22:32 UTC
Hi all,

This is effectively a copy of
http://bugzilla.kernel.org/show_bug.cgi?id=13245 but with a few more
people. I hope this works and bugme picks it up.

The report is below.

The problem here is effectively that something runs off the events
workqueue, and ends up, with some indirection that I think doesn't
matter much, using cpu_add_remove_lock, either while being on the events
workqueue or in this case depending on a lock that was used from the
events workqueue.

It seems that the problem is in disable_nonboot_cpus, which via the
notifier indirection calls cleanup_workqueue_thread() which flushes the
workqueue so needs the fake "events" lock.

On the other hand, via some indirection, things that run off the events
end up using (depending on something that uses) cpu_add_remove_lock.

Thoughts?

[ 1231.558889] [ INFO: possible circular locking dependency detected ]
[ 1231.558895] 2.6.30-rc4-git1 #5
[ 1231.558899] -------------------------------------------------------
[ 1231.558903] pm-suspend/4741 is trying to acquire lock:
[ 1231.558908]  (events){+.+.+.}, at: [<ffffffff80249feb>] cleanup_workqueue_thread+0x1e/0xf8
[ 1231.558927] 
[ 1231.558929] but task is already holding lock:
[ 1231.558932]  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8023a8a5>] disable_nonboot_cpus+0x33/0x11d
[ 1231.558947] 
[ 1231.558948] which lock already depends on the new lock.
[ 1231.558951] 
[ 1231.558954] 
[ 1231.558955] the existing dependency chain (in reverse order) is:
[ 1231.558959] 
[ 1231.558961] -> #5 (cpu_add_remove_lock){+.+.+.}:
[ 1231.558969]        [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22
[ 1231.558980]        [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.558988]        [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429
[ 1231.558999]        [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35
[ 1231.559008]        [<ffffffff8023a80e>] cpu_maps_update_begin+0x12/0x14
[ 1231.559017]        [<ffffffff8024a34b>] __create_workqueue_key+0x161/0x226
[ 1231.559026]        [<ffffffff80275cc8>] stop_machine_create+0x3a/0x9b
[ 1231.559035]        [<ffffffff80275d42>] stop_machine+0x19/0x4b
[ 1231.559044]        [<ffffffffa0115238>] 0xffffffffa0115238
[ 1231.559061]        [<ffffffff8020906b>] do_one_initcall+0x65/0x153
[ 1231.559072]        [<ffffffff80266b50>] sys_init_module+0xab/0x1d3
[ 1231.559080]        [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b
[ 1231.559089]        [<ffffffffffffffff>] 0xffffffffffffffff
[ 1231.559097] 
[ 1231.559099] -> #4 (setup_lock){+.+.+.}:
[ 1231.559105]        [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22
[ 1231.559114]        [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.559122]        [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429
[ 1231.559132]        [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35
[ 1231.559141]        [<ffffffff80275ca0>] stop_machine_create+0x12/0x9b
[ 1231.559150]        [<ffffffff8023a888>] disable_nonboot_cpus+0x16/0x11d
[ 1231.559159]        [<ffffffff80267993>] suspend_devices_and_enter+0xd6/0x1b1
[ 1231.559168]        [<ffffffff80267bfa>] enter_state+0x163/0x1c9
[ 1231.559176]        [<ffffffff80267d17>] state_store+0xb7/0xd8
[ 1231.559184]        [<ffffffff803bd78b>] kobj_attr_store+0x17/0x19
[ 1231.559194]        [<ffffffff803137b3>] sysfs_write_file+0xe4/0x119
[ 1231.559203]        [<ffffffff802c237a>] vfs_write+0xab/0x105
[ 1231.559212]        [<ffffffff802c2498>] sys_write+0x47/0x70
[ 1231.559219]        [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b
[ 1231.559228]        [<ffffffffffffffff>] 0xffffffffffffffff
[ 1231.559236] 
[ 1231.559238] -> #3 (dpm_list_mtx){+.+.+.}:
[ 1231.559245]        [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22
[ 1231.559253]        [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.559262]        [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429
[ 1231.559272]        [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35
[ 1231.559281]        [<ffffffff8046002d>] device_pm_add+0x1e/0xb5
[ 1231.559292]        [<ffffffff80459fae>] device_add+0x37c/0x557
[ 1231.559301]        [<ffffffffa008f453>] wiphy_register+0x148/0x1ec [cfg80211]
[ 1231.559320]        [<ffffffffa00c62f8>] ieee80211_register_hw+0xe9/0x3aa [mac80211]
[ 1231.559348]        [<ffffffffa01776e2>] iwl3945_pci_probe+0xb5c/0xcce [iwl3945]
[ 1231.559368]        [<ffffffff803d1a5f>] local_pci_probe+0x12/0x16
[ 1231.559378]        [<ffffffff803d274e>] pci_device_probe+0x5f/0x89
[ 1231.559388]        [<ffffffff8045bf49>] driver_probe_device+0x9e/0x143
[ 1231.559398]        [<ffffffff8045c046>] __driver_attach+0x58/0x7b
[ 1231.559406]        [<ffffffff8045b83c>] bus_for_each_dev+0x54/0x8a
[ 1231.559415]        [<ffffffff8045bdbb>] driver_attach+0x1c/0x1e
[ 1231.559423]        [<ffffffff8045b178>] bus_add_driver+0xb2/0x1e6
[ 1231.559432]        [<ffffffff8045c300>] driver_register+0xb6/0x121
[ 1231.559439]        [<ffffffff803d2b76>] __pci_register_driver+0x61/0xcf
[ 1231.559448]        [<ffffffffa018505c>] 0xffffffffa018505c
[ 1231.559456]        [<ffffffff8020906b>] do_one_initcall+0x65/0x153
[ 1231.559464]        [<ffffffff80266b50>] sys_init_module+0xab/0x1d3
[ 1231.559473]        [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b
[ 1231.559481]        [<ffffffffffffffff>] 0xffffffffffffffff
[ 1231.559489] 
[ 1231.559491] -> #2 (cfg80211_mutex){+.+.+.}:
[ 1231.559498]        [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22
[ 1231.559507]        [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.559515]        [<ffffffff8058a217>] __mutex_lock_common+0x5a/0x429
[ 1231.559524]        [<ffffffff8058a680>] mutex_lock_nested+0x30/0x35
[ 1231.559533]        [<ffffffffa0090a0b>] reg_todo+0x5d/0x4d8 [cfg80211]
[ 1231.559553]        [<ffffffff80249bf9>] worker_thread+0x24b/0x367
[ 1231.559561]        [<ffffffff8024d94d>] kthread+0x56/0x83
[ 1231.559570]        [<ffffffff8020c37a>] child_rip+0xa/0x20
[ 1231.559577]        [<ffffffffffffffff>] 0xffffffffffffffff
[ 1231.559585] 
[ 1231.559587] -> #1 (reg_work){+.+.+.}:
[ 1231.559593]        [<ffffffff8025dc42>] __lock_acquire+0xa88/0xc22
[ 1231.559603]        [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.559611]        [<ffffffff80249bf0>] worker_thread+0x242/0x367
[ 1231.559619]        [<ffffffff8024d94d>] kthread+0x56/0x83
[ 1231.559627]        [<ffffffff8020c37a>] child_rip+0xa/0x20
[ 1231.559634]        [<ffffffffffffffff>] 0xffffffffffffffff
[ 1231.559658] 
[ 1231.559659] -> #0 (events){+.+.+.}:
[ 1231.559666]        [<ffffffff8025db36>] __lock_acquire+0x97c/0xc22
[ 1231.559675]        [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.559683]        [<ffffffff8024a012>] cleanup_workqueue_thread+0x45/0xf8
[ 1231.559691]        [<ffffffff8057c347>] workqueue_cpu_callback+0xc2/0x105
[ 1231.559703]        [<ffffffff8058e6a1>] notifier_call_chain+0x5e/0x92
[ 1231.559712]        [<ffffffff80251c78>] raw_notifier_call_chain+0xf/0x11
[ 1231.559722]        [<ffffffff8057a9db>] _cpu_down+0x278/0x295
[ 1231.559730]        [<ffffffff8023a8ea>] disable_nonboot_cpus+0x78/0x11d
[ 1231.559740]        [<ffffffff80267993>] suspend_devices_and_enter+0xd6/0x1b1
[ 1231.559749]        [<ffffffff80267bfa>] enter_state+0x163/0x1c9
[ 1231.559757]        [<ffffffff80267d17>] state_store+0xb7/0xd8
[ 1231.559765]        [<ffffffff803bd78b>] kobj_attr_store+0x17/0x19
[ 1231.559773]        [<ffffffff803137b3>] sysfs_write_file+0xe4/0x119
[ 1231.559781]        [<ffffffff802c237a>] vfs_write+0xab/0x105
[ 1231.559788]        [<ffffffff802c2498>] sys_write+0x47/0x70
[ 1231.559796]        [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b
[ 1231.559804]        [<ffffffffffffffff>] 0xffffffffffffffff
[ 1231.559813] 
[ 1231.559814] other info that might help us debug this:
[ 1231.559817] 
[ 1231.559822] 4 locks held by pm-suspend/4741:
[ 1231.559825]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff80313707>] sysfs_write_file+0x38/0x119
[ 1231.559838]  #1:  (pm_mutex){+.+.+.}, at: [<ffffffff80267c57>] enter_state+0x1c0/0x1c9
[ 1231.559851]  #2:  (dpm_list_mtx){+.+.+.}, at: [<ffffffff8045f533>] device_pm_lock+0x12/0x14
[ 1231.559864]  #3:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8023a8a5>] disable_nonboot_cpus+0x33/0x11d
[ 1231.559878] 
[ 1231.559880] stack backtrace:
[ 1231.559886] Pid: 4741, comm: pm-suspend Not tainted 2.6.30-rc4-git1 #5
[ 1231.559892] Call Trace:
[ 1231.559895]  [<ffffffff8025ce44>] print_circular_bug_tail+0x71/0x7c
[ 1231.559895]  [<ffffffff8025db36>] __lock_acquire+0x97c/0xc22
[ 1231.559895]  [<ffffffff8025ded2>] lock_acquire+0xf6/0x11a
[ 1231.559895]  [<ffffffff80249feb>] ? cleanup_workqueue_thread+0x1e/0xf8
[ 1231.559895]  [<ffffffff8024a012>] cleanup_workqueue_thread+0x45/0xf8
[ 1231.559895]  [<ffffffff80249feb>] ? cleanup_workqueue_thread+0x1e/0xf8
[ 1231.559895]  [<ffffffff8057c347>] workqueue_cpu_callback+0xc2/0x105
[ 1231.559895]  [<ffffffff8058e6a1>] notifier_call_chain+0x5e/0x92
[ 1231.559895]  [<ffffffff80251c78>] raw_notifier_call_chain+0xf/0x11
[ 1231.559895]  [<ffffffff8057a9db>] _cpu_down+0x278/0x295
[ 1231.559895]  [<ffffffff8023a8ea>] disable_nonboot_cpus+0x78/0x11d
[ 1231.559895]  [<ffffffff80267993>] suspend_devices_and_enter+0xd6/0x1b1
[ 1231.559895]  [<ffffffff80267bfa>] enter_state+0x163/0x1c9
[ 1231.559895]  [<ffffffff80267d17>] state_store+0xb7/0xd8
[ 1231.559895]  [<ffffffff803bd78b>] kobj_attr_store+0x17/0x19
[ 1231.559895]  [<ffffffff803137b3>] sysfs_write_file+0xe4/0x119
[ 1231.559895]  [<ffffffff802c237a>] vfs_write+0xab/0x105
[ 1231.559895]  [<ffffffff802c2498>] sys_write+0x47/0x70
[ 1231.559895]  [<ffffffff8020b26b>] system_call_fastpath+0x16/0x1b
Comment 2 Miles Lane 2009-05-17 06:48:44 UTC
Yes, I am still seeing this with 2.6.30-rc6-git1.
Comment 3 Rafael J. Wysocki 2009-05-17 10:17:12 UTC
On Sunday 17 May 2009, Ming Lei wrote:
> I also can reproduce it on 2.6.30-rc5-next-20090515.
> 
> 2009/5/17 Rafael J. Wysocki <rjw@sisk.pl>:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.29.  Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=13245
> > Subject         : possible circular locking dependency detected
> > Submitter       : Miles Lane <miles.lane@gmail.com>
> > Date            : 2009-05-04 16:56 (13 days old)
Comment 4 Rafael J. Wysocki 2009-05-17 10:31:05 UTC
*** Bug 13296 has been marked as a duplicate of this bug. ***
Comment 5 Tomas Janousek 2009-05-17 12:39:28 UTC
Is it possible that this causes my system to hang after "Suspending consoles" ?
Comment 6 Johannes Berg 2009-05-17 12:59:28 UTC
(In reply to comment #5)
> Is it possible that this causes my system to hang after "Suspending consoles"
> ?

I'm not quite sure but I don't think so -- this might be a false positive.
Comment 7 Johannes Berg 2009-05-22 10:49:12 UTC
I have provided a shortened deadlock scenario here:
http://thread.gmane.org/gmane.linux.kernel/835569/focus=838839

Oleg and me independently arrived at the conclusion that the commits

	PM: Change hibernation code ordering
	4aecd6718939eb3c4145b248369b65f7483a8a02

	PM: Change suspend code ordering
	900af0d973856d6feb6fc088c2d0d3fde57707d3

are the ones causing this problem (for hibernate/suspend respectively).
Comment 8 Rafael J. Wysocki 2009-05-23 23:21:30 UTC
Created attachment 21511 [details]
PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs

The attached patch should fix the problem.  Miles please test it.
Comment 9 Rafael J. Wysocki 2009-05-24 21:45:26 UTC
Handled-By : Rafael J. Wysocki <rjw@sisk.pl>
Patch : http://patchwork.kernel.org/patch/25557/
Comment 10 Rafael J. Wysocki 2009-05-30 21:36:44 UTC
Fixed by commit 32bdfac5462d777f35b00838893c4f87baf23efe .