Bug 8054

Summary:	tipc_ref_discard tipc_deleteport locking dependency
Product:	Networking	Reporter:	Eric Sesterhenn (snakebyte)
Component:	Other	Assignee:	Arnaldo Carvalho de Melo (acme)
Status:	REJECTED DOCUMENTED
Severity:	normal	CC:	allan.stephens, jon.maloy, protasnb
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.21-rc1	Subsystem:
Regression:	---	Bisected commit-id:

Description Eric Sesterhenn 2007-02-22 04:19:59 UTC

Most recent kernel where this bug did *NOT* occur: - also seen this on 2.6.20
Distribution: Gentoo
Hardware Environment: AMD-K6
Software Environment: 
Problem Description:

[  273.837978] =======================================================
[  273.838209] [ INFO: possible circular locking dependency detected ]
[  273.838324] 2.6.21-rc1 #5
[  273.838427] -------------------------------------------------------
[  273.838535] sfuzz/3950 is trying to acquire lock:
[  273.838642]  (ref_table_lock){-+..}, at: [<c0531389>] tipc_ref_discard+0x29/0xe0
[  273.839116] 
[  273.839121] but task is already holding lock:
[  273.839297]  (&table[i].lock){-+..}, at: [<c052fae0>] tipc_deleteport+0x40/0x180
[  273.839728] 
[  273.839733] which lock already depends on the new lock.
[  273.839741] 
[  273.839998] 
[  273.840003] the existing dependency chain (in reverse order) is:
[  273.840182] 
[  273.840187] -> #1 (&table[i].lock){-+..}:
[  273.840581]        [<c0138994>] __lock_acquire+0xeb4/0x1020
[  273.841352]        [<c0138b69>] lock_acquire+0x69/0xa0
[  273.842041]        [<c0537f80>] _spin_lock_bh+0x40/0x60
[  273.842801]        [<c05314ab>] tipc_ref_acquire+0x6b/0xe0
[  273.844131]        [<c052ef93>] tipc_createport_raw+0x33/0x260
[  273.844823]        [<c0530121>] tipc_createport+0x41/0x120
[  273.845528]        [<c0529b4c>] tipc_subscr_start+0xcc/0x120
[  273.846217]        [<c0521d04>] process_signal_queue+0x44/0x80
[  273.846926]        [<c011df38>] tasklet_action+0x38/0x80
[  273.847626]        [<c011e1db>] __do_softirq+0x5b/0xc0
[  273.848374]        [<c0105e48>] do_softirq+0x88/0xe0
[  273.849630]        [<ffffffff>] 0xffffffff
[  273.850315] 
[  273.850319] -> #0 (ref_table_lock){-+..}:
[  273.850713]        [<c013878d>] __lock_acquire+0xcad/0x1020
[  273.851416]        [<c0138b69>] lock_acquire+0x69/0xa0
[  273.852127]        [<c0538040>] _write_lock_bh+0x40/0x60
[  273.852835]        [<c0531389>] tipc_ref_discard+0x29/0xe0
[  273.853552]        [<c052fafa>] tipc_deleteport+0x5a/0x180
[  273.854832]        [<c05317d8>] tipc_create+0x58/0x160
[  273.855530]        [<c04603d2>] __sock_create+0x112/0x280
[  273.856237]        [<c046057a>] sock_create+0x1a/0x20
[  273.856931]        [<c04605a3>] sys_socketpair+0x23/0x1a0
[  273.857642]        [<c0461257>] sys_socketcall+0x137/0x260
[  273.858343]        [<c0102cb0>] syscall_call+0x7/0xb
[  273.859052]        [<ffffffff>] 0xffffffff
[  273.860333] 
[  273.860338] other info that might help us debug this:
[  273.860345] 
[  273.860596] 1 lock held by sfuzz/3950:
[  273.860697]  #0:  (&table[i].lock){-+..}, at: [<c052fae0>]
tipc_deleteport+0x40/0x180
[  273.861208] 
[  273.861213] stack backtrace:
[  273.861391]  [<c01042ba>] show_trace_log_lvl+0x1a/0x40
[  273.861591]  [<c0104a92>] show_trace+0x12/0x20
[  273.861779]  [<c0104b99>] dump_stack+0x19/0x20
[  273.861968]  [<c013686e>] print_circular_bug_tail+0x6e/0x80
[  273.862162]  [<c013878d>] __lock_acquire+0xcad/0x1020
[  273.862354]  [<c0138b69>] lock_acquire+0x69/0xa0
[  273.862544]  [<c0538040>] _write_lock_bh+0x40/0x60
[  273.862740]  [<c0531389>] tipc_ref_discard+0x29/0xe0
[  273.862958]  [<c052fafa>] tipc_deleteport+0x5a/0x180
[  273.863153]  [<c05317d8>] tipc_create+0x58/0x160
[  273.863347]  [<c04603d2>] __sock_create+0x112/0x280
[  273.863541]  [<c046057a>] sock_create+0x1a/0x20
[  273.863731]  [<c04605a3>] sys_socketpair+0x23/0x1a0
[  273.863922]  [<c0461257>] sys_socketcall+0x137/0x260
[  273.864113]  [<c0102cb0>] syscall_call+0x7/0xb
[  273.864299]  =======================

Steps to reproduce:

enable tipc, recompile, reboot, run sfuzz
(http://www.digitaldwarf.be/products/sfuzz.c) for some time

Comment 1 Anonymous Emailer 2007-02-22 13:16:42 UTC

Reply-To: akpm@linux-foundation.org

> On Thu, 22 Feb 2007 04:20:10 -0800 bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=8054
> 
>            Summary: tipc_ref_discard tipc_deleteport locking dependency
>     Kernel Version: 2.6.21-rc1
>             Status: NEW
>           Severity: normal
>              Owner: acme@conectiva.com.br
>          Submitter: snakebyte@gmx.de
> 
> 
> Most recent kernel where this bug did *NOT* occur: - also seen this on 2.6.20
> Distribution: Gentoo
> Hardware Environment: AMD-K6
> Software Environment: 
> Problem Description:
> 
> [  273.837978] =======================================================
> [  273.838209] [ INFO: possible circular locking dependency detected ]
> [  273.838324] 2.6.21-rc1 #5
> [  273.838427] -------------------------------------------------------
> [  273.838535] sfuzz/3950 is trying to acquire lock:
> [  273.838642]  (ref_table_lock){-+..}, at: [<c0531389>] tipc_ref_discard+0x29/0xe0
> [  273.839116] 
> [  273.839121] but task is already holding lock:
> [  273.839297]  (&table[i].lock){-+..}, at: [<c052fae0>] tipc_deleteport+0x40/0x180
> [  273.839728] 
> [  273.839733] which lock already depends on the new lock.
> [  273.839741] 
> [  273.839998] 
> [  273.840003] the existing dependency chain (in reverse order) is:
> [  273.840182] 
> [  273.840187] -> #1 (&table[i].lock){-+..}:
> [  273.840581]        [<c0138994>] __lock_acquire+0xeb4/0x1020
> [  273.841352]        [<c0138b69>] lock_acquire+0x69/0xa0
> [  273.842041]        [<c0537f80>] _spin_lock_bh+0x40/0x60
> [  273.842801]        [<c05314ab>] tipc_ref_acquire+0x6b/0xe0
> [  273.844131]        [<c052ef93>] tipc_createport_raw+0x33/0x260
> [  273.844823]        [<c0530121>] tipc_createport+0x41/0x120
> [  273.845528]        [<c0529b4c>] tipc_subscr_start+0xcc/0x120
> [  273.846217]        [<c0521d04>] process_signal_queue+0x44/0x80
> [  273.846926]        [<c011df38>] tasklet_action+0x38/0x80
> [  273.847626]        [<c011e1db>] __do_softirq+0x5b/0xc0
> [  273.848374]        [<c0105e48>] do_softirq+0x88/0xe0
> [  273.849630]        [<ffffffff>] 0xffffffff
> [  273.850315] 
> [  273.850319] -> #0 (ref_table_lock){-+..}:
> [  273.850713]        [<c013878d>] __lock_acquire+0xcad/0x1020
> [  273.851416]        [<c0138b69>] lock_acquire+0x69/0xa0
> [  273.852127]        [<c0538040>] _write_lock_bh+0x40/0x60
> [  273.852835]        [<c0531389>] tipc_ref_discard+0x29/0xe0
> [  273.853552]        [<c052fafa>] tipc_deleteport+0x5a/0x180
> [  273.854832]        [<c05317d8>] tipc_create+0x58/0x160
> [  273.855530]        [<c04603d2>] __sock_create+0x112/0x280
> [  273.856237]        [<c046057a>] sock_create+0x1a/0x20
> [  273.856931]        [<c04605a3>] sys_socketpair+0x23/0x1a0
> [  273.857642]        [<c0461257>] sys_socketcall+0x137/0x260
> [  273.858343]        [<c0102cb0>] syscall_call+0x7/0xb
> [  273.859052]        [<ffffffff>] 0xffffffff
> [  273.860333] 
> [  273.860338] other info that might help us debug this:
> [  273.860345] 
> [  273.860596] 1 lock held by sfuzz/3950:
> [  273.860697]  #0:  (&table[i].lock){-+..}, at: [<c052fae0>]
> tipc_deleteport+0x40/0x180
> [  273.861208] 
> [  273.861213] stack backtrace:
> [  273.861391]  [<c01042ba>] show_trace_log_lvl+0x1a/0x40
> [  273.861591]  [<c0104a92>] show_trace+0x12/0x20
> [  273.861779]  [<c0104b99>] dump_stack+0x19/0x20
> [  273.861968]  [<c013686e>] print_circular_bug_tail+0x6e/0x80
> [  273.862162]  [<c013878d>] __lock_acquire+0xcad/0x1020
> [  273.862354]  [<c0138b69>] lock_acquire+0x69/0xa0
> [  273.862544]  [<c0538040>] _write_lock_bh+0x40/0x60
> [  273.862740]  [<c0531389>] tipc_ref_discard+0x29/0xe0
> [  273.862958]  [<c052fafa>] tipc_deleteport+0x5a/0x180
> [  273.863153]  [<c05317d8>] tipc_create+0x58/0x160
> [  273.863347]  [<c04603d2>] __sock_create+0x112/0x280
> [  273.863541]  [<c046057a>] sock_create+0x1a/0x20
> [  273.863731]  [<c04605a3>] sys_socketpair+0x23/0x1a0
> [  273.863922]  [<c0461257>] sys_socketcall+0x137/0x260
> [  273.864113]  [<c0102cb0>] syscall_call+0x7/0xb
> [  273.864299]  =======================
> 
> Steps to reproduce:
> 
> enable tipc, recompile, reboot, run sfuzz
> (http://www.digitaldwarf.be/products/sfuzz.c) for some time
> 
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.

Comment 2 Natalie Protasevich 2007-07-07 18:06:35 UTC

Was this tested with new kernel recently (although I haven't seen any new TIPC patches)?
Thanks.

Comment 3 Jon Paul Maloy 2007-07-09 06:43:48 UTC

In the mail exchange after the issue was raised I explained why this is not a bug. Jarek Poplawski suggested a cosmetic work-around, to avoid lockdep trigging on this, which looked workable. 
I must say this got a low priority, though, both since I was not completely happy with the suggestion, and because this is not about fixing a real bug, but rather about pleasing a less-than-perfect external tool. 
We are right now doing a major overhaul of the TIPC code, and we will look into this issue as a part of that work.

Comment 4 Eric Sesterhenn 2008-01-09 06:23:42 UTC

Just to keep this up to date, still seeing this with 2.6.24-rc7-g3ce54450

Comment 5 Eric Sesterhenn 2008-05-23 02:52:15 UTC

with the tipc testsuite from http://tipc.sourceforge.net/
i also see this lockdep warning. 

Just wondering, how far is the rewrite?

[ 2538.994106] =======================================================
[ 2538.994279] [ INFO: possible circular locking dependency detected ]
[ 2538.994350] 2.6.26-rc3-00308-g75d3bce #30
[ 2538.994350] -------------------------------------------------------
[ 2538.994350] tipcTC/7360 is trying to acquire lock:
[ 2538.994350]  (&topsrv.lock){-+..}, at: [<c06abf3c>] subscr_named_msg_event+0xfc/0x180
[ 2538.994350] 
[ 2538.994350] but task is already holding lock:
[ 2538.994350]  (&entry->lock){-+..}, at: [<c06b215c>] tipc_ref_lock+0x3c/0x70
[ 2538.994350] 
[ 2538.994350] which lock already depends on the new lock.
[ 2538.994350] 
[ 2538.994350] 
[ 2538.994350] the existing dependency chain (in reverse order) is:
[ 2538.994350] 
[ 2538.994350] -> #1 (&entry->lock){-+..}:
[ 2538.994350]        [<c01478a8>] __lock_acquire+0xc78/0x1120
[ 2538.994350]        [<c0147dd0>] lock_acquire+0x80/0xa0
[ 2538.994350]        [<c06c2cee>] _spin_lock_bh+0x3e/0x70
[ 2538.994350]        [<c06b215c>] tipc_ref_lock+0x3c/0x70
[ 2538.994350]        [<c06b05d1>] tipc_createport_raw+0x41/0x1e0
[ 2538.994350]        [<c06b125d>] tipc_createport+0x3d/0xd0
[ 2538.994350]        [<c06ab9c9>] tipc_subscr_start+0xb9/0x110
[ 2538.994350]        [<c06a4414>] process_signal_queue+0x44/0x80
[ 2538.994350]        [<c012a163>] tasklet_action+0x43/0x90
[ 2538.994350]        [<c012a422>] __do_softirq+0x62/0xc0
[ 2538.994350]        [<c010675a>] do_softirq+0x8a/0xd0
[ 2538.994350]        [<ffffffff>] 0xffffffff
[ 2538.994350] 
[ 2538.994350] -> #0 (&topsrv.lock){-+..}:
[ 2538.994350]        [<c01476b0>] __lock_acquire+0xa80/0x1120
[ 2538.994350]        [<c0147dd0>] lock_acquire+0x80/0xa0
[ 2538.994350]        [<c06c2cee>] _spin_lock_bh+0x3e/0x70
[ 2538.994350]        [<c06abf3c>] subscr_named_msg_event+0xfc/0x180
[ 2538.994350]        [<c06b1944>] port_dispatcher_sigh+0x174/0x530
[ 2538.994350]        [<c06a4414>] process_signal_queue+0x44/0x80
[ 2538.994350]        [<c012a163>] tasklet_action+0x43/0x90
[ 2538.994350]        [<c012a422>] __do_softirq+0x62/0xc0
[ 2538.994350]        [<c010675a>] do_softirq+0x8a/0xd0
[ 2538.994350]        [<ffffffff>] 0xffffffff
[ 2538.994350] 
[ 2538.994350] other info that might help us debug this:
[ 2538.994350] 
[ 2538.994350] 2 locks held by tipcTC/7360:
[ 2538.994350]  #0:  (sk_lock-AF_TIPC){--..}, at: [<c06b3bb4>] connect+0x54/0x220
[ 2538.994350]  #1:  (&entry->lock){-+..}, at: [<c06b215c>] tipc_ref_lock+0x3c/0x70
[ 2538.994350] 
[ 2538.994350] stack backtrace:
[ 2538.994350] Pid: 7360, comm: tipcTC Not tainted 2.6.26-rc3-00308-g75d3bce #30
[ 2538.994350]  [<c01452cf>] print_circular_bug_tail+0x6f/0x80
[ 2538.994350]  [<c01476b0>] __lock_acquire+0xa80/0x1120
[ 2538.994350]  [<c01441ad>] ? put_lock_stats+0xd/0x30
[ 2538.994350]  [<c01465db>] ? mark_held_locks+0x4b/0x80
[ 2538.994350]  [<c0147dd0>] lock_acquire+0x80/0xa0
[ 2538.994350]  [<c06abf3c>] ? subscr_named_msg_event+0xfc/0x180
[ 2538.994350]  [<c06c2cee>] _spin_lock_bh+0x3e/0x70
[ 2538.994350]  [<c06abf3c>] ? subscr_named_msg_event+0xfc/0x180
[ 2538.994350]  [<c06abf3c>] subscr_named_msg_event+0xfc/0x180
[ 2538.994350]  [<c06ab8f0>] ? subscr_conn_shutdown_event+0x0/0x20
[ 2538.994350]  [<c06abd80>] ? subscr_conn_msg_event+0x0/0x40
[ 2538.994350]  [<c06b1944>] port_dispatcher_sigh+0x174/0x530
[ 2538.994350]  [<c0354001>] ? txLockAlloc+0x1/0x120
[ 2538.994350]  [<c06abe40>] ? subscr_named_msg_event+0x0/0x180
[ 2538.994350]  [<c014672a>] ? trace_hardirqs_on+0x7a/0x140
[ 2538.994350]  [<c0354005>] ? txLockAlloc+0x5/0x120
[ 2538.994350]  [<c06a4414>] process_signal_queue+0x44/0x80
[ 2538.994350]  [<c012a163>] tasklet_action+0x43/0x90
[ 2538.994350]  [<c012a422>] __do_softirq+0x62/0xc0
[ 2538.994350]  [<c010675a>] do_softirq+0x8a/0xd0
[ 2538.994350]  [<c06afa85>] ? tipc_port_recv_sections+0x1f5/0x290
[ 2538.994350]  [<c012a833>] local_bh_enable_ip+0xa3/0xf0
[ 2538.994350]  [<c06c2b2f>] _spin_unlock_bh+0x2f/0x40
[ 2538.994350]  [<c06afa85>] tipc_port_recv_sections+0x1f5/0x290
[ 2538.994350]  [<c06aff02>] tipc_forward2name+0xe2/0x1a0
[ 2538.994350]  [<c06afff5>] tipc_send2name+0x35/0x40
[ 2538.994350]  [<c0354005>] ? txLockAlloc+0x5/0x120
[ 2538.994350]  [<c06b28b0>] send_msg+0x140/0x2c0
[ 2538.994350]  [<c06c357d>] ? _read_unlock_irq+0x2d/0x50
[ 2538.994350]  [<c06b3c31>] connect+0xd1/0x220
[ 2538.994350]  [<c0432166>] ? copy_from_user+0x46/0x80
[ 2538.994350]  [<c058f2ad>] sys_connect+0x5d/0x90
[ 2538.994350]  [<c010974f>] ? native_sched_clock+0x7f/0xb0
[ 2538.994350]  [<c01441ad>] ? put_lock_stats+0xd/0x30
[ 2538.994350]  [<c0590aab>] sys_socketcall+0x21b/0x260
[ 2538.994350]  [<c0103e67>] ? restore_nocheck+0x12/0x15
[ 2538.994350]  [<c0117fc0>] ? do_page_fault+0x0/0x640
[ 2538.994350]  [<c0103e67>] ? restore_nocheck+0x12/0x15
[ 2538.994350]  [<c0103d7d>] sysenter_past_esp+0x6a/0xb1
[ 2538.994350]  =======================

Comment 6 allan.stephens 2008-06-09 12:11:29 UTC

This latest problem report appears to be yet another false alarm generated by the spinlock dependency code in the Linux kernel.  It also seems to confirm that this issue cannot be resolved in TIPC without changes being made to the lock dependency system as well.

The underlying problem seems to be the tool's assumption that a spinlock, once created, is always used for a single, constant purpose.  Consequently, if the lock dependency code sees a "take spinlock A, take spinlock B" sequence and then later a "take B, take A" sequence, it reports that there is a problem that needs fixing.

Unfortunately, TIPC's reference table violates this assumption -- it hands out its spinlocks to govern access to arbitrary objects that are registered with the table, and then reclaims each spinlock for later re-use when the associated object is deleted.  Because there are a variety of objects used in TIPC, which utilize different sequences of spinlock acquisition (according to the needs of the object), the lock dependency code keeps generating false alarms that cannot be fixed by TIPC using the current spinlock API.

What really seems to be missing here is a way for TIPC to tell the lock dependency tool that it should not track dependencies for a particular spinlock when the spinlock is first initialized.  However, I'm not sure what the proper procedure is for requesting such a change, or if there are alternative approaches that could be taken to address the spinlock re-use issue.