Bug 14999

Summary: possible circular locking dependency detected in rfkill at suspend
Product: Drivers Reporter: Christian Casteyde (casteyde.christian)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: Larry.Finger, linville, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14885    
Attachments: git bisect log

Description Christian Casteyde 2010-01-06 21:52:58 UTC
Kernel 2.6.33-rc3 64bits
Acer Aspire 1511LMi
Bluewhite 64 13 (64bits port of Slackware 13)

While suspending to ram, I got the following warning:

...
PM: Entering mem sleep
Suspending console(s) (use no_console_suspend to debug)

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.33-rc3 #2
-------------------------------------------------------
hibernate.sh/3424 is trying to acquire lock:
 ((&(&rfkill->poll_work)->work)){+.+...}, at: [<ffffffff81049944>] __cancel_work_timer+0x74
/0x210

but task is already holding lock:
 (s_active){++++.+}, at: [<ffffffff8110ae82>] sysfs_get_active_two+0x22/0x60

which lock already depends on the new lock.

I didn't experienced this warning with previous versions.

the existing dependency chain (in reverse order) is:

-> #2 (s_active){++++.+}:
       [<ffffffff81060f9c>] __lock_acquire+0x165c/0x1d10
       [<ffffffff810616ab>] lock_acquire+0x5b/0x80
       [<ffffffff8110ab41>] sysfs_addrm_finish+0xf1/0x180
       [<ffffffff81108bf6>] sysfs_hash_and_remove+0x56/0x80
       [<ffffffff8110be4f>] sysfs_remove_group+0x4f/0xe0
       [<ffffffff812d53f4>] dpm_sysfs_remove+0x14/0x20
       [<ffffffff812ceb25>] device_del+0x45/0x1a0
       [<ffffffff812cec91>] device_unregister+0x11/0x20
       [<ffffffff812d9539>] _request_firmware+0x2f9/0x4e0
       [<ffffffff812d97ae>] request_firmware+0xe/0x10
       [<ffffffff8132f618>] b43_do_request_fw+0x98/0x230
       [<ffffffff81330ba7>] b43_try_request_fw+0x97/0x360
       [<ffffffff81332b7a>] b43_wireless_core_init+0x1ca/0x10e0
       [<ffffffff81333c40>] b43_op_start+0x1b0/0x1d0
       [<ffffffff81565809>] ieee80211_open+0x3d9/0x6d0
       [<ffffffff814558aa>] dev_open+0xca/0x110
       [<ffffffff81454aed>] dev_change_flags+0x9d/0x1e0
       [<ffffffff814b3879>] devinet_ioctl+0x8e9/0x8f0
       [<ffffffff814b4c14>] inet_ioctl+0x94/0xc0
       [<ffffffff8144010d>] sock_do_ioctl+0x2d/0x60
       [<ffffffff814401b6>] sock_ioctl+0x76/0x280
       [<ffffffff810bff15>] vfs_ioctl+0x35/0xd0
       [<ffffffff810c00ea>] do_vfs_ioctl+0x8a/0x5c0
       [<ffffffff810c066a>] sys_ioctl+0x4a/0x80
       [<ffffffff81002468>] system_call_fastpath+0x16/0x1b

-> #1 (&wl->mutex){+.+.+.}:
       [<ffffffff81060f9c>] __lock_acquire+0x165c/0x1d10
       [<ffffffff810616ab>] lock_acquire+0x5b/0x80
       [<ffffffff8158851b>] mutex_lock_nested+0x5b/0x380
       [<ffffffff8134a244>] b43_rfkill_poll+0x44/0x130
       [<ffffffff815675c2>] ieee80211_rfkill_poll+0x22/0x40
       [<ffffffff8153e204>] cfg80211_rfkill_poll+0x14/0x20
       [<ffffffff81574333>] rfkill_poll+0x23/0x40
       [<ffffffff81048fd5>] worker_thread+0x1e5/0x340
       [<ffffffff8104d21e>] kthread+0x9e/0xb0
       [<ffffffff81003074>] kernel_thread_helper+0x4/0x10

-> #0 ((&(&rfkill->poll_work)->work)){+.+...}:
       [<ffffffff810615ac>] __lock_acquire+0x1c6c/0x1d10
       [<ffffffff810616ab>] lock_acquire+0x5b/0x80
       [<ffffffff81049976>] __cancel_work_timer+0xa6/0x210
       [<ffffffff81049aed>] cancel_delayed_work_sync+0xd/0x10
       [<ffffffff8157422f>] rfkill_pause_polling+0x1f/0x30
       [<ffffffff8157424d>] rfkill_suspend+0xd/0x20
       [<ffffffff812d5b43>] legacy_suspend+0x33/0x80
       [<ffffffff812d6070>] dpm_suspend_start+0x460/0x4f0
       [<ffffffff8106da5f>] suspend_devices_and_enter+0x5f/0x200
       [<ffffffff8106dd29>] enter_state+0x129/0x150
       [<ffffffff8106d33a>] state_store+0xba/0x100
       [<ffffffff8123f057>] kobj_attr_store+0x17/0x20
       [<ffffffff811094c4>] sysfs_write_file+0xd4/0x150
       [<ffffffff810b15e6>] vfs_write+0xc6/0x150
       [<ffffffff810b1760>] sys_write+0x50/0x90
       [<ffffffff81002468>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

4 locks held by hibernate.sh/3424:
 #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff81109434>] sysfs_write_file+0x44/0x150
 #1:  (s_active){++++.+}, at: [<ffffffff8110ae9d>] sysfs_get_active_two+0x3d/0x60
 #2:  (s_active){++++.+}, at: [<ffffffff8110ae82>] sysfs_get_active_two+0x22/0x60
 #3:  (pm_mutex){+.+.+.}, at: [<ffffffff8106dc49>] enter_state+0x49/0x150

stack backtrace:
Pid: 3424, comm: hibernate.sh Not tainted 2.6.33-rc3 #2
Call Trace:
 [<ffffffff8105f2d0>] print_circular_bug+0xf0/0x100
 [<ffffffff81049944>] ? __cancel_work_timer+0x74/0x210
 [<ffffffff810615ac>] __lock_acquire+0x1c6c/0x1d10
 [<ffffffff8105e7bf>] ? mark_held_locks+0x6f/0x90
 [<ffffffff810616ab>] lock_acquire+0x5b/0x80
 [<ffffffff81049944>] ? __cancel_work_timer+0x74/0x210
 [<ffffffff81049976>] __cancel_work_timer+0xa6/0x210
 [<ffffffff81049944>] ? __cancel_work_timer+0x74/0x210
 [<ffffffff8105e7bf>] ? mark_held_locks+0x6f/0x90
 [<ffffffff815887b5>] ? mutex_lock_nested+0x2f5/0x380
 [<ffffffff81574240>] ? rfkill_suspend+0x0/0x20
 [<ffffffff81049aed>] cancel_delayed_work_sync+0xd/0x10
 [<ffffffff8157422f>] rfkill_pause_polling+0x1f/0x30
 [<ffffffff8157424d>] rfkill_suspend+0xd/0x20
 [<ffffffff812d5b43>] legacy_suspend+0x33/0x80
 [<ffffffff812d6070>] dpm_suspend_start+0x460/0x4f0
 [<ffffffff8106da5f>] suspend_devices_and_enter+0x5f/0x200
 [<ffffffff8106dd29>] enter_state+0x129/0x150
 [<ffffffff8106d33a>] state_store+0xba/0x100
 [<ffffffff8123f057>] kobj_attr_store+0x17/0x20
 [<ffffffff811094c4>] sysfs_write_file+0xd4/0x150
 [<ffffffff810b15e6>] vfs_write+0xc6/0x150
 [<ffffffff810b1760>] sys_write+0x50/0x90
 [<ffffffff81002468>] system_call_fastpath+0x16/0x1b
parport_pc 00:0b: disabled
b43-pci-bridge 0000:02:08.0: PCI INT A disabled
...
PM: suspend of devices complete after 567.248 msecs
PM: late suspend of devices complete after 39.686 msecs
ACPI: Preparing to enter system sleep state S3
Back to C!
ACPI: Waking up from system sleep state S3
...

The hibernation script called stops the network with:
/etc/rc.d/rc.inet1 stop

before going to sleep.
Comment 1 Christian Casteyde 2010-01-13 19:37:37 UTC
Update: still present in -rc4.
Comment 2 Rafael J. Wysocki 2010-01-26 12:36:47 UTC
On Tuesday 26 January 2010, Christian Casteyde wrote:
> Yes, it's still present in 33-rc5.
Comment 3 Larry Finger 2010-01-26 17:11:45 UTC
When I suspend to RAM or disk with my b43-wireless system, I get a different locking problem. I don't know if the two are related.

=============================================
[ INFO: possible recursive locking detected ]
2.6.33-rc5-Linus-dirty #173                  
---------------------------------------------
sh/3488 is trying to acquire lock:           
 (s_active){++++.+}, at: [<ffffffff81167413>] sysfs_addrm_finish+0x43/0x70

but task is already holding lock:
 (s_active){++++.+}, at: [<ffffffff8116771d>] sysfs_get_active_two+0x3d/0x60

other info that might help us debug this:
4 locks held by sh/3488:                 
 #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff81165b7f>] sysfs_write_file+0x3f/0x160
 #1:  (s_active){++++.+}, at: [<ffffffff8116771d>] sysfs_get_active_two+0x3d/0x60   
 #2:  (s_active){++++.+}, at: [<ffffffff81167702>] sysfs_get_active_two+0x22/0x60   
 #3:  (dbs_mutex){+.+.+.}, at: [<ffffffff81271517>] cpufreq_governor_dbs+0xe7/0x480 

stack backtrace:
Pid: 3488, comm: sh Not tainted 2.6.33-rc5-Linus-dirty #173
Call Trace:
 [<ffffffff8107c36b>] __lock_acquire+0xf6b/0x1d30
 [<ffffffff81078e9f>] ? lockdep_init_map+0x5f/0x5d0
 [<ffffffff8107d1cb>] lock_acquire+0x9b/0x120
 [<ffffffff81167413>] ? sysfs_addrm_finish+0x43/0x70
 [<ffffffff81166ba3>] sysfs_deactivate+0xc3/0x110
 [<ffffffff81167413>] ? sysfs_addrm_finish+0x43/0x70
 [<ffffffff81167413>] sysfs_addrm_finish+0x43/0x70
 [<ffffffff81165206>] sysfs_hash_and_remove+0x56/0x80
 [<ffffffff8116895f>] sysfs_remove_group+0x4f/0xf0
 [<ffffffff8127152b>] cpufreq_governor_dbs+0xfb/0x480
 [<ffffffff8107a8dd>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffff8107a92d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8126e314>] __cpufreq_governor+0x94/0x160
 [<ffffffff8126f84f>] __cpufreq_set_policy+0x11f/0x180
 [<ffffffff8126fc66>] store_scaling_governor+0xc6/0x200
 [<ffffffff81270530>] ? handle_update+0x0/0x10
 [<ffffffff8126f702>] store+0x62/0x90
 [<ffffffff81165c21>] sysfs_write_file+0xe1/0x160
 [<ffffffff8110b0c8>] vfs_write+0xb8/0x180
 [<ffffffff8110b26c>] sys_write+0x4c/0x80
 [<ffffffff81002dab>] system_call_fastpath+0x16/0x1b
Comment 4 Rafael J. Wysocki 2010-01-26 18:06:34 UTC
This is a result of adding some new debug code to sysfs, AFAICS.  It has been reported for several times already.
Comment 5 Christian Casteyde 2010-01-30 15:04:25 UTC
I don't know if Larry's trace is related to the same problem, but indeed mine seems to be due to the new debug code in sysfs.
I bisected it between -rc3 and -rc4 and got this first bad commit:

commit 846f99749ab68bbc7f75c74fec305de675b1a1bf
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sat Jan 2 13:37:12 2010 -0800

    sysfs: Add lockdep annotations for the sysfs active reference

    Holding locks over device_del -> kobject_del -> sysfs_deactivate can
    cause deadlocks if those same locks are grabbed in sysfs show or store
    methods.

    The I model s_active count + completion as a sleeping read/write lock.
    I describe to lockdep sysfs_get_active as a read_trylock,
    sysfs_put_active as a read_unlock, and sysfs_deactivate as a
    write_lock and write_unlock pair.  This seems to capture the essence
    for purposes of finding deadlocks, and in my testing gives finds real
    issues and ignores non-issues.

    This brings us back to holding locks over kobject_del is a problem
    that ideally we should find a way of addressing, but at least lockdep
    can tell us about the problems instead of requiring developers to debug
    rare strange system deadlocks, that happen when sysfs files are removed
    while being written to.

    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

I append the bisect log.

Maybe this should not be considered as a regression, since showing a potential bug doesn't introduce it: it was latent in previous versions and only visible now. Moreover, this is not blocking since I do not run kernel with debug options normally.
Comment 6 Christian Casteyde 2010-01-30 15:05:22 UTC
Created attachment 24792 [details]
git bisect log
Comment 7 Rafael J. Wysocki 2010-02-01 21:29:38 UTC
On Monday 01 February 2010, Christian Casteyde wrote:
> Yes, still applies to 2.6.32-rc6.
Comment 8 Rafael J. Wysocki 2010-02-08 21:07:24 UTC
On Monday 08 February 2010, Christian Casteyde wrote:
> Still present in 2.6.33-rc7.
> 
> Bonjour,
> Le dimanche 7 février 2010, vous avez écrit :
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.32.  Please verify if it still should be listed and let the
> > tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14999
> > Subject             : possible circular locking dependency detected in
> rfkill at
> > suspend Submitter   : Christian Casteyde <casteyde.christian@free.fr>
> > Date                : 2010-01-06 21:52 (33 days old)
Comment 9 Rafael J. Wysocki 2010-02-15 20:44:50 UTC
On Monday 15 February 2010, Eric W. Biederman wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> 
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.32.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14999
> > Subject             : possible circular locking dependency detected in
> rfkill at suspend
> > Submitter   : Christian Casteyde <casteyde.christian@free.fr>
> > Date                : 2010-01-06 21:52 (40 days old)
> 
> I have a set of patches to clear up the sysfs lockdep false positives.  
> http://lkml.org/lkml/2010/2/11/329
> 
> It requires sprinking a bunch of sysfs_attr_init on the few dynamically
> allocated sysfs attributes, and I haven't found them all yet.  Otherwise
> it seems to be working well.
> 
> Greg should pick up those patches shortly and they should be appearing
> in linux-next and then ultimately in 2.6.34.
Comment 10 Rafael J. Wysocki 2010-02-27 19:09:49 UTC
On Saturday 27 February 2010, Christian Casteyde wrote:
> Still present in 2.6.33.
Comment 11 John W. Linville 2010-03-03 16:53:37 UTC
Christian, have you tested linux-next?  Or could you test 2.6.34-rc1 when it is available?
Comment 12 Christian Casteyde 2010-03-10 18:24:26 UTC
I've tested 2.6.34-rc1, but I cannot say anything because the wireless adapter doesn't work at all with this kernel. Please see http://bugzilla.kernel.org/show_bug.cgi?id=15505 for more details.

I'll recheck this bug once #15505 is fixed, since I obviously need the driver to be started to check suspend/release correctly.
Comment 13 Christian Casteyde 2010-03-12 12:31:12 UTC
With 2.6.34-rc1 + the fix for #15505, I don't get the warning anymore.
I think it was fixed in -rc1 therefore.