Bug 42918

Summary: fcoe: Enabling VN2VN mode triggers a circular locking complaint
Product: IO/Storage Reporter: Bart Van Assche (bvanassche)
Component: SCSIAssignee: linux-scsi (linux-scsi)
Status: CLOSED CODE_FIX    
Severity: normal CC: alan, florian, robert.w.love
Priority: P1    
Hardware: All   
OS: Linux   
URL: http://comments.gmane.org/gmane.linux.scsi.open-fcoe.devel/11451?set_lines=100000
Kernel Version: 3.3.0-rc7 Subsystem:
Regression: No Bisected commit-id:

Description Bart Van Assche 2012-03-13 12:16:56 UTC
Kernel version: 3.3.0-rc7

How to reproduce:

# modprobe fcoe
# echo eth0 >/sys/module/libfcoe/parameters/create_vn2vn

Result:

# dmesg
device eth0 entered promiscuous mode
scsi3 : FCoE Driver
host3: libfc: Link up on port (000000)

======================================================
[ INFO: possible circular locking dependency detected ]
3.3.0-rc7-scst-debug+ #1 Not tainted
-------------------------------------------------------
kworker/2:0/14 is trying to acquire lock:
 (rtnl_mutex){+.+.+.}, at: [<c13a10c4>] rtnl_lock+0x14/0x20

but task is already holding lock:
 (&fip->ctlr_mutex){+.+...}, at: [<f89713e7>] fcoe_ctlr_timer_work+0x3e7/0xb60 [libfcoe]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&fip->ctlr_mutex){+.+...}:
       [<c1091f70>] lock_acquire+0x80/0x1b0
       [<c147655d>] mutex_lock_nested+0x6d/0x340
       [<f8970c32>] fcoe_ctlr_link_up+0x22/0x180 [libfcoe]
       [<f894620e>] fcoe_create+0x47e/0x6e0 [fcoe]
       [<f8973dd3>] fcoe_transport_create+0x143/0x250 [libfcoe]
       [<c10527e0>] param_attr_store+0x30/0x60
       [<c1052696>] module_attr_store+0x26/0x40
       [<c11a201e>] sysfs_write_file+0xae/0x100
       [<c11449df>] vfs_write+0x8f/0x160
       [<c1144cbd>] sys_write+0x3d/0x70
       [<c147a0c4>] syscall_call+0x7/0xb

-> #0 (rtnl_mutex){+.+.+.}:
       [<c109164b>] __lock_acquire+0x140b/0x1720
       [<c1091f70>] lock_acquire+0x80/0x1b0
       [<c147655d>] mutex_lock_nested+0x6d/0x340
       [<c13a10c4>] rtnl_lock+0x14/0x20
       [<f89445ac>] fcoe_update_src_mac+0x2c/0xb0 [fcoe]
       [<f8971712>] fcoe_ctlr_timer_work+0x712/0xb60 [libfcoe]
       [<c104fb69>] process_one_work+0x179/0x5d0
       [<c10502f1>] worker_thread+0x121/0x2d0
       [<c10550ed>] kthread+0x7d/0x90
       [<c1481a82>] kernel_thread_helper+0x6/0x10

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&fip->ctlr_mutex);
                               lock(rtnl_mutex);
                               lock(&fip->ctlr_mutex);
  lock(rtnl_mutex);

 *** DEADLOCK ***

3 locks held by kworker/2:0/14:
 #0:  (events){.+.+.+}, at: [<c104faf5>] process_one_work+0x105/0x5d0
 #1:  ((&fip->timer_work)){+.+...}, at: [<c104faf5>] process_one_work+0x105/0x5d0
 #2:  (&fip->ctlr_mutex){+.+...}, at: [<f89713e7>] fcoe_ctlr_timer_work+0x3e7/0xb60 [libfcoe]

stack backtrace:
Pid: 14, comm: kworker/2:0 Not tainted 3.3.0-rc7 #1
Call Trace:
 [<c14714a6>] ? printk+0x1d/0x1f
 [<c1471d4f>] print_circular_bug+0x1b4/0x1be
 [<c109164b>] __lock_acquire+0x140b/0x1720
 [<c1091f70>] lock_acquire+0x80/0x1b0
 [<c13a10c4>] ? rtnl_lock+0x14/0x20
 [<c147655d>] mutex_lock_nested+0x6d/0x340
 [<c13a10c4>] ? rtnl_lock+0x14/0x20
 [<c13a10c4>] ? rtnl_lock+0x14/0x20
 [<f89713e7>] ? fcoe_ctlr_timer_work+0x3e7/0xb60 [libfcoe]
 [<c13a10c4>] rtnl_lock+0x14/0x20
 [<f89445ac>] fcoe_update_src_mac+0x2c/0xb0 [fcoe]
 [<f8971712>] fcoe_ctlr_timer_work+0x712/0xb60 [libfcoe]
 [<c106a2c5>] ? local_clock+0x65/0x70
 [<c104faf5>] ? process_one_work+0x105/0x5d0
 [<c10928e4>] ? trace_hardirqs_on_caller+0xf4/0x180
 [<c104fb69>] process_one_work+0x179/0x5d0
 [<c104faf5>] ? process_one_work+0x105/0x5d0
 [<f8971000>] ? fcoe_ctlr_vn_send_claim+0x40/0x40 [libfcoe]
 [<c10502f1>] worker_thread+0x121/0x2d0
 [<c10501d0>] ? rescuer_thread+0x1d0/0x1d0
 [<c10550ed>] kthread+0x7d/0x90
 [<c1055070>] ? __init_kthread_worker+0x60/0x60
 [<c1481a82>] kernel_thread_helper+0x6/0x10
host3: Assigned Port ID 0092b5
Comment 1 Robert Love 2012-03-15 01:21:48 UTC
Fix was posted to linux-scsi here: http://www.spinics.net/lists/linux-scsi/msg58027.html

Note that I Nacked my own patch but that others commented that the patch is correct. Please ignore the Nack; the patch is good.

Can the submitted please verify that this fixes their issue?
Comment 2 Florian Mickler 2012-04-04 14:56:51 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 2280512342ead9a2858b1490b21e5bcaf4f4cfc7
Author: Robert Love <robert.w.love@intel.com>
Date:   Tue Mar 13 18:22:12 2012 -0700

    [SCSI] fcoe: Drop the rtnl_mutex before calling fcoe_ctlr_link_up
Comment 3 Bart Van Assche 2012-04-04 15:21:49 UTC
Please leave this bug report open since the patch referenced in comment #1 hasn't been merged yet.
Comment 4 Florian Mickler 2012-07-01 09:48:29 UTC
A patch referencing this bug report has been merged in Linux v3.5-rc1:

commit 949e71f17d9a5c59fa7b02cce3b548384bff1c92
Author: Robert Love <robert.w.love@intel.com>
Date:   Fri Apr 20 12:16:43 2012 -0700

    [SCSI] fcoe: Don't hold rtnl_mutex in fcoe_update_src_mac