Bug 8377 - "circular locking dependency detected" while doing cp
Summary: "circular locking dependency detected" while doing cp
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: CIFS (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Steve French
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-04-27 07:02 UTC by udo
Modified: 2009-01-24 22:16 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.20.1, 2.6.21.4, 2.6.21.5, 2.6.22.5
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
The logging in the messages file (16.27 KB, text/plain)
2007-04-27 07:03 UTC, udo
Details
Another occurrance (4.96 KB, text/plain)
2007-05-05 07:20 UTC, udo
Details
Again... (4.93 KB, text/plain)
2007-05-06 04:29 UTC, udo
Details
Latest .config with proc but without the cifs stuff (46.29 KB, application/octet-stream)
2007-08-23 11:55 UTC, udo
Details
dmesg with crash and cifs debug info (14.20 KB, text/plain)
2007-08-24 10:40 UTC, udo
Details
Latest dmesg with forcedirectio mount option (3.86 KB, application/x-gzip)
2007-09-05 08:56 UTC, udo
Details
Some relevant messages. (24.06 KB, application/octet-stream)
2007-09-05 09:18 UTC, udo
Details
with smb print (58.60 KB, text/plain)
2007-09-07 06:51 UTC, udo
Details
messages with smb print (24.05 KB, text/plain)
2007-09-07 06:53 UTC, udo
Details
messages with bug and some stuff disabled (17.55 KB, text/plain)
2007-09-24 07:56 UTC, udo
Details
dmesg with bug and some stuff disabled (45.37 KB, text/plain)
2007-09-24 07:56 UTC, udo
Details
patch to test recv race (1.06 KB, patch)
2007-09-29 10:51 UTC, Cyrill Gorcunov
Details | Diff
patch to test recv race (1.05 KB, patch)
2007-09-29 10:57 UTC, Cyrill Gorcunov
Details | Diff
dmesg 2.6.22.8 with patch (68.77 KB, text/plain)
2007-09-30 00:25 UTC, udo
Details
messages 2.6.22.8 with patch (17.84 KB, text/plain)
2007-09-30 00:25 UTC, udo
Details

Description udo 2007-04-27 07:02:07 UTC
Most recent kernel where this bug did *NOT* occur: all previous ones, first ever
occurrance ever is this one
Distribution: fedora core 6
Hardware Environment: via epia EK8000
Software Environment: ?
Problem Description: "possible circular locking dependency detected" in messages
log. see attachment.

Steps to reproduce: Start, run for a while (50 days over here) and do a cp over
samba. (!?)
Comment 1 udo 2007-04-27 07:03:25 UTC
Created attachment 11289 [details]
The logging in the messages file
Comment 2 Stefan Richter 2007-04-27 09:16:59 UTC
I don't know either of the TCP code and the filesystem code, but I get the
impression that this may be false alarm from the lockdep debugging code.
Comment 3 udo 2007-04-28 06:19:01 UTC
This is on a VIA Epia EK8000 which was up 50 days and some minutes when it
happened. The machine has 512 megs of RAM and runs as a firewall/webserver/etc.
I can post more info if deemed relevant.
Comment 4 Stefan Richter 2007-04-30 16:04:23 UTC
If this bug ticket gets stale, I suggest you repost on LKML with a good subject
line, the right excerpts from dmesg etc, and backlink to the bugzilla entry. 
This appears to be a meta-bug (networking vs. filesystem) or may just be false
alarm from lockdep.  Therefore Cc the primary author of lockdep, Ingo Molnar
<mingo@elte.hu>.

(Sorry, perhaps it wasn't such a good idea of mine to point you to bugzilla in
the first place.)
Comment 5 udo 2007-05-05 07:20:02 UTC
Created attachment 11395 [details]
Another occurrance

It happened another time
Comment 6 udo 2007-05-05 07:20:49 UTC
It just happened again while I copied a file to the same windoze box.
Comment 7 udo 2007-05-05 10:49:45 UTC
Also 2.6.21.1 does have an issue: when I try a similar cop the system hangs.
What now!?
Comment 8 Stefan Richter 2007-05-05 11:10:04 UTC
What hangs, the whole kernel and all?  If so, try the test with a text console
instead of X11 in front.  Perhaps you get the kernel panic log when the hang
happens.  If you got a camera around, take a photograph of the panic log and
attach it here.
Comment 9 udo 2007-05-05 11:20:46 UTC
What hangs:
System was headless.
Did not respond to ping anymore.
I do not run X on the firewall box.

I did find a start of the locking thing in the logs:

May  5 19:42:38 epia kernel:
May  5 19:42:38 epia kernel: =======================================================
May  5 19:42:38 epia kernel: [ INFO: possible circular locking dependency detected ]
May  5 19:42:38 epia kernel: 2.6.21.1 #4
May  5 19:42:38 epia kernel: -------------------------------------------------------
May  5 19:42:38 epia kernel: cp/5433 is trying to acquire lock:
May  5 19:42:38 epia kernel:  (sk_lock-AF_INET){--..}, at: [tcp_sendmsg+22/2523]
tcp_sendmsg+0x16/0x9db
May  5 19:42:38 epia kernel:  (sk_lock-AF_INET){--..}, at: [<c02c056f>]
tcp_sendmsg+0x16/0x9db
May  5 19:42:38 epia kernel:
May  5 19:42:38 epia kernel: but task is already holding lock:
May  5 19:42:38 epia kernel:  (&inode->i_alloc_sem){--..}, at:
[notify_change+232/720] notify_change+0xe8/0x2d0
May  5 19:42:38 epia kernel:  (&inode->i_alloc_sem){--..}, at: [<c0169684>]
notify_change+0xe8/0x2d0
May  5 19:42:38 epia kernel:
May  5 19:42:38 epia kernel: which lock already depends on the new lock.
May  5 19:42:38 epia kernel:
May  5 19:42:38 epia kernel:
May  5 19:42:38 epia kernel: the existing dependency chain (in reverse order) is:
May  5 19:42:38 epia kernel:
May  5 19:42:38 epia kernel: -> #3 (&inode->i_alloc_sem){--..}:
May  5 19:42:38 epia kernel:        [__lock_acquire+2529/2949]
__lock_acquire+0x9e1/0xb85
May  5 19:42:38 epia kernel:        [<c012d926>] __lock_acquire+0x9e1/0xb85
May  5 19:42:38 epia kernel:        [lock_acquire+104/130] lock_acquire+0x68/0x82
May  5 19:42:38 epia kernel:        [<c012db32>] lock_acquire+0x68/0x82
May  5 19:42:38 epia kernel:        [down_write+58/83] down_write+0x3a/0x53
May  5 19:42:38 epia kernel:        [<c0126d87>] down_write+0x3a/0x53
May  5 19:42:38 epia kernel:        [notify_change+232/720] notify_change+0xe8/0x2d0
May  5 19:42:38 epia kernel:        [<c0169684>] notify_change+0xe8/0x2d0
May  5 19:42:38 epia kernel:        [do_truncate+83/108] do_truncate+0x53/0x6c
May  5 19:42:38 epia kernel:        [<c0156662>] do_truncate+0x53/0x6c
May  5 19:42:38 epia kernel:        [may_open+442/514] may_open+0x1ba/0x202
May  5 19:42:38 epia kernel:        [<c015eff0>] may_open+0x1ba/0x202
May  5 19:42:38 epia kernel:        [open_namei+639/1455] open_namei+0x27f/0x5af
May  5 19:42:38 epia kernel:        [<c0160f5a>] open_namei+0x27f/0x5af
May  5 19:42:38 epia kernel:        [do_filp_open+38/59] do_filp_open+0x26/0x3b
May  5 19:42:38 epia kernel:        [<c01563c7>] do_filp_open+0x26/0x3b
May  5 19:42:38 epia kernel:        [do_sys_open+67/194] do_sys_open+0x43/0xc2
May  5 19:44:58 epia syslogd 1.4.1: restart (remote reception).
May  5 19:44:58 epia kernel: klogd 1.4.1, log source = /proc/kmsg started.

Comment 10 udo 2007-05-06 04:29:14 UTC
Created attachment 11406 [details]
Again...

Just happened again:
[root@epia linux]# cp .config /mnt/samba/download/linux/config.epia.2.6.21.1
[root@epia linux]# cp tarpit.patch /mnt/samba/download/linux/
<booommm!!>

(tarpit.patch is a textfile of allmost 1.5 Kbyte)
Comment 11 udo 2007-06-21 20:16:17 UTC
Also happens in 2.6.21.4 and 2.6.21.5.
Comment 12 udo 2007-06-21 20:17:21 UTC
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.21.4 #8
-------------------------------------------------------
cp/3088 is trying to acquire lock:
 (sk_lock-AF_INET){--..}, at: [<c02c5c99>] tcp_sendmsg+0x16/0x9cc

but task is already holding lock:
 (&inode->i_alloc_sem){--..}, at: [<c0167e08>] notify_change+0xe8/0x2d0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&inode->i_alloc_sem){--..}:
       [<c012c0c8>] __lock_acquire+0x9e1/0xb85
       [<c012c2d4>] lock_acquire+0x68/0x82
       [<c0126503>] down_write+0x3a/0x53
       [<c0167e08>] notify_change+0xe8/0x2d0
       [<c0154dee>] do_truncate+0x53/0x6c
       [<c015d77c>] may_open+0x1ba/0x202
       [<c015f6e6>] open_namei+0x27f/0x5af
       [<c0154b53>] do_filp_open+0x26/0x3b
       [<c0154bab>] do_sys_open+0x43/0xc2
       [<c0154c62>] sys_open+0x1c/0x1e
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

-> #2 (&sysfs_inode_imutex_key){--..}:
       [<c012c0c8>] __lock_acquire+0x9e1/0xb85
       [<c012c2d4>] lock_acquire+0x68/0x82
       [<c02f7c0a>] __mutex_lock_slowpath+0xdc/0x266
       [<c02f7db0>] mutex_lock+0x1c/0x1f
       [<c018c41d>] create_dir+0x20/0x196
       [<c018c5dd>] sysfs_create_dir+0x4a/0x64
       [<c01d2bd9>] kobject_shadow_add+0xd6/0x189
       [<c01d2c96>] kobject_add+0xa/0xc
       [<c0223ee5>] device_add+0xae/0x62b
       [<c02aaed3>] netdev_register_sysfs+0x5a/0x5f
       [<c02a102a>] register_netdevice+0x22c/0x2e4
       [<c02a2175>] register_netdev+0x40/0x4d
       [<c022c124>] rhine_init_one+0x492/0x64f
       [<c01e1ffd>] pci_device_probe+0x39/0x5b
       [<c0225e0b>] really_probe+0xbd/0x145
       [<c0225f28>] driver_probe_device+0x95/0xa1
       [<c022603d>] __driver_attach+0x6a/0xa1
       [<c0225401>] bus_for_each_dev+0x36/0x5b
       [<c0225c7f>] driver_attach+0x19/0x1b
       [<c02256e8>] bus_add_driver+0x6a/0x170
       [<c022624f>] driver_register+0x79/0x7e
       [<c01e2184>] __pci_register_driver+0x7b/0xa8
       [<c040a0de>] rhine_init+0x5d/0x5f
       [<c03f271c>] init+0x95/0x17a
       [<c01038e7>] kernel_thread_helper+0x7/0x10
       [<ffffffff>] 0xffffffff

-> #1 (rtnl_mutex){--..}:
       [<c012c0c8>] __lock_acquire+0x9e1/0xb85
       [<c012c2d4>] lock_acquire+0x68/0x82
       [<c02f7c0a>] __mutex_lock_slowpath+0xdc/0x266
       [<c02f7db0>] mutex_lock+0x1c/0x1f
       [<c02a8913>] rtnl_lock+0xd/0xf
       [<c02e0e9d>] ip_mc_join_group+0x2c/0xc9
       [<c02c2573>] ip_setsockopt+0x64b/0x9a7
       [<c02d7d10>] udp_setsockopt+0x43/0x4a
       [<c029a01e>] sock_common_setsockopt+0x1e/0x24
       [<c02986e7>] sys_setsockopt+0x7b/0x97
       [<c0299d39>] sys_socketcall+0x1e8/0x241
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

-> #0 (sk_lock-AF_INET){--..}:
       [<c012bfa9>] __lock_acquire+0x8c2/0xb85
       [<c012c2d4>] lock_acquire+0x68/0x82
       [<c029a747>] lock_sock_nested+0xba/0xc7
       [<c02c5c99>] tcp_sendmsg+0x16/0x9cc
       [<c02dd764>] inet_sendmsg+0x3e/0x49
       [<c0298b77>] sock_sendmsg+0xe7/0x104
       [<c0299dba>] kernel_sendmsg+0x28/0x37
       [<ded22eba>] smb_send+0x92/0x11c [cifs]
       [<ded230c3>] SendReceive+0x17f/0x3dc [cifs]
       [<ded12817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs]
       [<ded1e4dd>] cifs_setattr+0x25d/0x900 [cifs]
       [<c0167e50>] notify_change+0x130/0x2d0
       [<c0154dee>] do_truncate+0x53/0x6c
       [<c015d77c>] may_open+0x1ba/0x202
       [<c015f6e6>] open_namei+0x27f/0x5af
       [<c0154b53>] do_filp_open+0x26/0x3b
       [<c0154bab>] do_sys_open+0x43/0xc2
       [<c0154c62>] sys_open+0x1c/0x1e
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

other info that might help us debug this:

2 locks held by cp/3088:
 #0:  (&inode->i_mutex){--..}, at: [<c02f7db0>] mutex_lock+0x1c/0x1f
 #1:  (&inode->i_alloc_sem){--..}, at: [<c0167e08>] notify_change+0xe8/0x2d0

stack backtrace:
 [<c0103c43>] show_trace_log_lvl+0x1a/0x2f
 [<c01042bc>] show_trace+0x12/0x14
 [<c0104359>] dump_stack+0x16/0x18
 [<c012a7e1>] print_circular_bug_tail+0x5f/0x68
 [<c012bfa9>] __lock_acquire+0x8c2/0xb85
 [<c012c2d4>] lock_acquire+0x68/0x82
 [<c029a747>] lock_sock_nested+0xba/0xc7
 [<c02c5c99>] tcp_sendmsg+0x16/0x9cc
 [<c02dd764>] inet_sendmsg+0x3e/0x49
 [<c0298b77>] sock_sendmsg+0xe7/0x104
 [<c0299dba>] kernel_sendmsg+0x28/0x37
 [<ded22eba>] smb_send+0x92/0x11c [cifs]
 [<ded230c3>] SendReceive+0x17f/0x3dc [cifs]
 [<ded12817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs]
 [<ded1e4dd>] cifs_setattr+0x25d/0x900 [cifs]
 [<c0167e50>] notify_change+0x130/0x2d0
 [<c0154dee>] do_truncate+0x53/0x6c
 [<c015d77c>] may_open+0x1ba/0x202
 [<c015f6e6>] open_namei+0x27f/0x5af
 [<c0154b53>] do_filp_open+0x26/0x3b
 [<c0154bab>] do_sys_open+0x43/0xc2
 [<c0154c62>] sys_open+0x1c/0x1e
 [<c01026a6>] sysenter_past_esp+0x5f/0x99
 =======================
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
Clocksource tsc unstable (delta = 18746609409 ns)
Time: acpi_pm clocksource has been installed.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
 CIFS VFS: No response for cmd 50 mid 5023
Comment 13 udo 2007-06-21 20:17:42 UTC
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.21.5 #10
-------------------------------------------------------
cp/3480 is trying to acquire lock:
 (sk_lock-AF_INET){--..}, at: [<c02c5d45>] tcp_sendmsg+0x16/0x9cc

but task is already holding lock:
 (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&inode->i_alloc_sem){--..}:
       [<c012c0e8>] __lock_acquire+0x9e1/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c0126503>] down_write+0x3a/0x53
       [<c0167e28>] notify_change+0xe8/0x2d0
       [<c0154e0e>] do_truncate+0x53/0x6c
       [<c015d79c>] may_open+0x1ba/0x202
       [<c015f706>] open_namei+0x27f/0x5af
       [<c0154b73>] do_filp_open+0x26/0x3b
       [<c0154bcb>] do_sys_open+0x43/0xc2
       [<c0154c82>] sys_open+0x1c/0x1e
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

-> #2 (&sysfs_inode_imutex_key){--..}:
       [<c012c0e8>] __lock_acquire+0x9e1/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266
       [<c02f7f16>] mutex_lock+0x1c/0x1f
       [<c018c43d>] create_dir+0x20/0x196
       [<c018c5fd>] sysfs_create_dir+0x4a/0x64
       [<c01d2bf9>] kobject_shadow_add+0xd6/0x189
       [<c01d2cb6>] kobject_add+0xa/0xc
       [<c0223f21>] device_add+0xae/0x62b
       [<c02aaf6f>] netdev_register_sysfs+0x5a/0x5f
       [<c02a10ba>] register_netdevice+0x22c/0x2e4
       [<c02a220c>] register_netdev+0x40/0x4d
       [<c022c160>] rhine_init_one+0x492/0x64f
       [<c01e201d>] pci_device_probe+0x39/0x5b
       [<c0225e47>] really_probe+0xbd/0x145
       [<c0225f64>] driver_probe_device+0x95/0xa1
       [<c0226079>] __driver_attach+0x6a/0xa1
       [<c022543d>] bus_for_each_dev+0x36/0x5b
       [<c0225cbb>] driver_attach+0x19/0x1b
       [<c0225724>] bus_add_driver+0x6a/0x170
       [<c022628b>] driver_register+0x79/0x7e
       [<c01e21a4>] __pci_register_driver+0x7b/0xa8
       [<c040a0cf>] rhine_init+0x5d/0x5f
       [<c03f271c>] init+0x95/0x17a
       [<c01038e7>] kernel_thread_helper+0x7/0x10
       [<ffffffff>] 0xffffffff

-> #1 (rtnl_mutex){--..}:
       [<c012c0e8>] __lock_acquire+0x9e1/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266
       [<c02f7f16>] mutex_lock+0x1c/0x1f
       [<c02a89af>] rtnl_lock+0xd/0xf
       [<c02e0f5d>] ip_mc_join_group+0x2c/0xc9
       [<c02c261f>] ip_setsockopt+0x64b/0x9a7
       [<c02d7db8>] udp_setsockopt+0x43/0x4a
       [<c029a042>] sock_common_setsockopt+0x1e/0x24
       [<c029870b>] sys_setsockopt+0x7b/0x97
       [<c0299d5d>] sys_socketcall+0x1e8/0x241
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

-> #0 (sk_lock-AF_INET){--..}:
       [<c012bfc9>] __lock_acquire+0x8c2/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c029a76b>] lock_sock_nested+0xba/0xc7
       [<c02c5d45>] tcp_sendmsg+0x16/0x9cc
       [<c02dd824>] inet_sendmsg+0x3e/0x49
       [<c0298b9b>] sock_sendmsg+0xe7/0x104
       [<c0299dde>] kernel_sendmsg+0x28/0x37
       [<dec97eba>] smb_send+0x92/0x11c [cifs]
       [<dec980c3>] SendReceive+0x17f/0x3dc [cifs]
       [<dec87817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs]
       [<dec934dd>] cifs_setattr+0x25d/0x900 [cifs]
       [<c0167e70>] notify_change+0x130/0x2d0
       [<c0154e0e>] do_truncate+0x53/0x6c
       [<c015d79c>] may_open+0x1ba/0x202
       [<c015f706>] open_namei+0x27f/0x5af
       [<c0154b73>] do_filp_open+0x26/0x3b
       [<c0154bcb>] do_sys_open+0x43/0xc2
       [<c0154c82>] sys_open+0x1c/0x1e
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

other info that might help us debug this:

2 locks held by cp/3480:
 #0:  (&inode->i_mutex){--..}, at: [<c02f7f16>] mutex_lock+0x1c/0x1f
 #1:  (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0

stack backtrace:
 [<c0103c43>] show_trace_log_lvl+0x1a/0x2f
 [<c01042bc>] show_trace+0x12/0x14
 [<c0104359>] dump_stack+0x16/0x18
 [<c012a801>] print_circular_bug_tail+0x5f/0x68
 [<c012bfc9>] __lock_acquire+0x8c2/0xb85
 [<c012c2f4>] lock_acquire+0x68/0x82
 [<c029a76b>] lock_sock_nested+0xba/0xc7
 [<c02c5d45>] tcp_sendmsg+0x16/0x9cc
 [<c02dd824>] inet_sendmsg+0x3e/0x49
 [<c0298b9b>] sock_sendmsg+0xe7/0x104
 [<c0299dde>] kernel_sendmsg+0x28/0x37
 [<dec97eba>] smb_send+0x92/0x11c [cifs]
 [<dec980c3>] SendReceive+0x17f/0x3dc [cifs]
 [<dec87817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs]
 [<dec934dd>] cifs_setattr+0x25d/0x900 [cifs]
 [<c0167e70>] notify_change+0x130/0x2d0
 [<c0154e0e>] do_truncate+0x53/0x6c
 [<c015d79c>] may_open+0x1ba/0x202
 [<c015f706>] open_namei+0x27f/0x5af
 [<c0154b73>] do_filp_open+0x26/0x3b
 [<c0154bcb>] do_sys_open+0x43/0xc2
 [<c0154c82>] sys_open+0x1c/0x1e
 [<c01026a6>] sysenter_past_esp+0x5f/0x99
 =======================
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
Comment 14 udo 2007-06-23 08:04:13 UTC
While nzbperl was writing to windows share via cifs:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.21.5 #10
-------------------------------------------------------
uudeview/23403 is trying to acquire lock:
 (sk_lock-AF_INET){--..}, at: [<c02c5d45>] tcp_sendmsg+0x16/0x9cc

but task is already holding lock:
 (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&inode->i_alloc_sem){--..}:
       [<c012c0e8>] __lock_acquire+0x9e1/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c0126503>] down_write+0x3a/0x53
       [<c0167e28>] notify_change+0xe8/0x2d0
       [<c0154e0e>] do_truncate+0x53/0x6c
       [<c015d79c>] may_open+0x1ba/0x202
       [<c015f706>] open_namei+0x27f/0x5af
       [<c0154b73>] do_filp_open+0x26/0x3b
       [<c0154bcb>] do_sys_open+0x43/0xc2
       [<c0154c82>] sys_open+0x1c/0x1e
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

-> #2 (&sysfs_inode_imutex_key){--..}:
       [<c012c0e8>] __lock_acquire+0x9e1/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266
       [<c02f7f16>] mutex_lock+0x1c/0x1f
       [<c018c43d>] create_dir+0x20/0x196
       [<c018c5fd>] sysfs_create_dir+0x4a/0x64
       [<c01d2bf9>] kobject_shadow_add+0xd6/0x189
       [<c01d2cb6>] kobject_add+0xa/0xc
       [<c0223f21>] device_add+0xae/0x62b
       [<c02aaf6f>] netdev_register_sysfs+0x5a/0x5f
       [<c02a10ba>] register_netdevice+0x22c/0x2e4
       [<c02a220c>] register_netdev+0x40/0x4d
       [<c022c160>] rhine_init_one+0x492/0x64f
       [<c01e201d>] pci_device_probe+0x39/0x5b
       [<c0225e47>] really_probe+0xbd/0x145
       [<c0225f64>] driver_probe_device+0x95/0xa1
       [<c0226079>] __driver_attach+0x6a/0xa1
       [<c022543d>] bus_for_each_dev+0x36/0x5b
       [<c0225cbb>] driver_attach+0x19/0x1b
       [<c0225724>] bus_add_driver+0x6a/0x170
       [<c022628b>] driver_register+0x79/0x7e
       [<c01e21a4>] __pci_register_driver+0x7b/0xa8
       [<c040a0cf>] rhine_init+0x5d/0x5f
       [<c03f271c>] init+0x95/0x17a
       [<c01038e7>] kernel_thread_helper+0x7/0x10
       [<ffffffff>] 0xffffffff

-> #1 (rtnl_mutex){--..}:
       [<c012c0e8>] __lock_acquire+0x9e1/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266
       [<c02f7f16>] mutex_lock+0x1c/0x1f
       [<c02a89af>] rtnl_lock+0xd/0xf
       [<c02e0f5d>] ip_mc_join_group+0x2c/0xc9
       [<c02c261f>] ip_setsockopt+0x64b/0x9a7
       [<c02d7db8>] udp_setsockopt+0x43/0x4a
       [<c029a042>] sock_common_setsockopt+0x1e/0x24
       [<c029870b>] sys_setsockopt+0x7b/0x97
       [<c0299d5d>] sys_socketcall+0x1e8/0x241
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

-> #0 (sk_lock-AF_INET){--..}:
       [<c012bfc9>] __lock_acquire+0x8c2/0xb85
       [<c012c2f4>] lock_acquire+0x68/0x82
       [<c029a76b>] lock_sock_nested+0xba/0xc7
       [<c02c5d45>] tcp_sendmsg+0x16/0x9cc
       [<c02dd824>] inet_sendmsg+0x3e/0x49
       [<c0298b9b>] sock_sendmsg+0xe7/0x104
       [<c0299dde>] kernel_sendmsg+0x28/0x37
       [<dec9feba>] smb_send+0x92/0x11c [cifs]
       [<deca00c3>] SendReceive+0x17f/0x3dc [cifs]
       [<dec8f817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs]
       [<dec9b4dd>] cifs_setattr+0x25d/0x900 [cifs]
       [<c0167e70>] notify_change+0x130/0x2d0
       [<c0154e0e>] do_truncate+0x53/0x6c
       [<c015d79c>] may_open+0x1ba/0x202
       [<c015f706>] open_namei+0x27f/0x5af
       [<c0154b73>] do_filp_open+0x26/0x3b
       [<c0154bcb>] do_sys_open+0x43/0xc2
       [<c0154c82>] sys_open+0x1c/0x1e
       [<c01026a6>] sysenter_past_esp+0x5f/0x99
       [<ffffffff>] 0xffffffff

other info that might help us debug this:

2 locks held by uudeview/23403:
 #0:  (&inode->i_mutex){--..}, at: [<c02f7f16>] mutex_lock+0x1c/0x1f
 #1:  (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0

stack backtrace:
 [<c0103c43>] show_trace_log_lvl+0x1a/0x2f
 [<c01042bc>] show_trace+0x12/0x14
 [<c0104359>] dump_stack+0x16/0x18
 [<c012a801>] print_circular_bug_tail+0x5f/0x68
 [<c012bfc9>] __lock_acquire+0x8c2/0xb85
 [<c012c2f4>] lock_acquire+0x68/0x82
 [<c029a76b>] lock_sock_nested+0xba/0xc7
 [<c02c5d45>] tcp_sendmsg+0x16/0x9cc
 [<c02dd824>] inet_sendmsg+0x3e/0x49
 [<c0298b9b>] sock_sendmsg+0xe7/0x104
 [<c0299dde>] kernel_sendmsg+0x28/0x37
 [<dec9feba>] smb_send+0x92/0x11c [cifs]
 [<deca00c3>] SendReceive+0x17f/0x3dc [cifs]
 [<dec8f817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs]
 [<dec9b4dd>] cifs_setattr+0x25d/0x900 [cifs]
 [<c0167e70>] notify_change+0x130/0x2d0
 [<c0154e0e>] do_truncate+0x53/0x6c
 [<c015d79c>] may_open+0x1ba/0x202
 [<c015f706>] open_namei+0x27f/0x5af
 [<c0154b73>] do_filp_open+0x26/0x3b
 [<c0154bcb>] do_sys_open+0x43/0xc2
 [<c0154c82>] sys_open+0x1c/0x1e
 [<c01026a6>] sysenter_past_esp+0x5f/0x99
 =======================
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
Clocksource tsc unstable (delta = 18749374053 ns)
Time: acpi_pm clocksource has been installed.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
pwc: Too many ISOC errors, bailing out.
Comment 15 Natalie Protasevich 2007-06-24 12:01:21 UTC
Have you tried turning off lockdep to verify #4 comment about possible problem with lockdep itself.
Also, since this started in 2.6.20.XX should be possible do bisect search for the patch that broke it.
Comment 16 Cyrill Gorcunov 2007-06-27 11:55:25 UTC
Is it possible to see debug messages from CIFS? Or even dmesg? Btw you may try to play with "max_interrupt_work" and "debug" param for via-rhine module. I mean you may try to set up max_interrupt_work=10 and debug=7 and attach dmesg to the bugzilla report.
Comment 17 Andrew Morton 2007-07-25 16:07:28 UTC
Let's cc Steve - this could be a cifs glitch.
Comment 18 udo 2007-07-29 01:16:22 UTC
What I posted is dmesg output.
How can I provide the CIFS debug messages?
I have the module compiled into the kernel as I use it all the time as part of a firewall. How does the via-rhine interact with cifs?

I did not yet try the lockdep check. Should be a good start.
Perhaps it started around 2.6.20 because I enabled various kernel debuggings.
Comment 19 udo 2007-07-29 02:56:53 UTC
Turned off:
Lock debugging: prove locking correctness  (PROVE_LOCKING)
Lock dependency engine debugging (DEBUG_LOCKDEP)
And compiled 2.6.21.6, rebooted.
Did a fair amount of:

# cp .config /mnt/samba/download/linux/config-2.6.21.6-ld.txt
cp: overwrite `/mnt/samba/download/linux/config-2.6.21.6-ld.txt'? y
#

Normally this would trigger the bug easily. So far no problem. 
Thanks for reminding me, Natalie!
Comment 20 Cyrill Gorcunov 2007-08-09 03:22:54 UTC
To enable additional CIFS debug message you have to get "Enable additional CIFS debugging routines" (CIFS_DEBUG2) turned on at moment of cifs module compilation. So just reconfig your kernel and rebuilt the module. After has it done you should type "echo 1 > /proc/fs/cifs/traceSMB" and you'll get a lot of debug messages in system log file.
Comment 21 udo 2007-08-09 10:46:02 UTC
Thanks.
Does the fact that the problem doesn't occur when the locking debugging is turned off say anything about the problem being in the CIFS area?
Or is it a combiantion?
So I should turn on the CIFS debugging and both PROVE_LOCKING and DEBUG_LOCKDEP?
Comment 22 Steve French 2007-08-09 11:23:43 UTC
Note that CONFIG_CIFS_DEBUG2 is not required in order for cifs to log debug messages although it increases the amount of information logged.

I typically do not enable "echo 1 > /proc/fs/cifs/traceSMB" to log all outgoing requests (and responses too) from the cifs client since I would instead run a network analyzer (e.g. wireshark) on the server machine to capture the network traffic instead.  I often do enable debug messages though ie
"echo 7 > /proc/fs/cifs/cifsFYI"

At first glance, I am puzzled what the circular lock problem is (are you sure that that has anything to do with your hang?).   I assume that this is a false alarm from lockdep.
Comment 23 Steve French 2007-08-09 11:38:50 UTC
my guess is that lockdep is confused between the different types of inodes.  The only case in which I would expect sendmsg to ever recurse back into the filesystem and grab a sem, would be when allocating memory in the network adapters send path and a flush of a dirty inode (writing back cached data) is required to be able to complete the alloc.  That does not appear to be the case here.
Comment 24 Cyrill Gorcunov 2007-08-10 13:15:40 UTC
To be sure about Steve's guess I think you may try to play with "echo 7 > /proc/fs/cifs/cifsFYI" and grab the logs. It would be usefull anyway
Comment 25 udo 2007-08-13 10:24:44 UTC
I enabled "Enable additional CIFS debugging routines" but the /proc/fs/cifs path is not there. Am I missing something? (2.6.22.2 in the mean time)

(I will be back next sunday.)
Comment 26 Steve French 2007-08-13 12:01:48 UTC
/proc won't be there if it is not enabled in the kernel (in the kernel configuration it is under filesystems->Pseudo-filesystems in menuconfig) but it is enabled in just about every kernel.   /proc/fs/cifs won't be there (even if /proc is) if the cifs module is not loaded (to fix that /sbin/insmod cifs.ko)
Comment 27 udo 2007-08-21 10:19:46 UTC
I do have /proc and stuff under there but not the cifs stuff.
Comment 28 Cyrill Gorcunov 2007-08-21 11:17:10 UTC
Btw, could you please post your .config there?
Comment 29 Cyrill Gorcunov 2007-08-23 09:03:13 UTC
any news? what about your .config to try to find out why there is no /proc?
Comment 30 udo 2007-08-23 11:55:27 UTC
Created attachment 12508 [details]
Latest .config with proc but without the cifs stuff
Comment 31 udo 2007-08-23 12:04:40 UTC
BTW: I use modules only for the stuff that is possibly not in use. All the rest is mostly compiled in.
Comment 32 udo 2007-08-24 10:40:23 UTC
Created attachment 12520 [details]
dmesg with crash and cifs debug info

I was able to get things working. 2.6.22.5 with lock debuging and the CIFS debugging on. (after a reboot stuff works better). See the attachment.
Comment 33 udo 2007-08-24 10:43:38 UTC
BTW: stuff happened when I copied config-2.6.22.5.txt for the 2nd time to the same location on the Win2K box.
Comment 34 Cyrill Gorcunov 2007-08-24 12:45:13 UTC
Thanks. Give us time to try to handle bug
Comment 35 Cyrill Gorcunov 2007-08-29 11:34:10 UTC
So you're trying to rewrite file on Win2K server, am I right?
Comment 36 Cyrill Gorcunov 2007-08-30 10:46:56 UTC
Well, I've been trying to cause locking problem almost whole day long - and has no result (but with a major defference - I've no Windows machine home so all is going thru samba). Could you please describe in details what you're doing - I mean what is you mount options? Did you try to use pecific cifs options like `direct`? I'm sorry - I'm not cifs specialist...Steeve, you thoughts?
Comment 37 Steve French 2007-08-30 11:43:56 UTC
You could try to disable the Unix protocol extensions (/proc/fs/cifs/LinuxExtensionsEnabled) before mount (or with current mainline cifs, you can use new mount option  "nounix") - that will cause the behavior on mounts to Samba to be more similar to what you would see if you mounted to Windows or the appliances that do not support the Unix extensions.
Comment 38 Cyrill Gorcunov 2007-08-30 12:00:54 UTC
OK, let me try
Comment 39 Cyrill Gorcunov 2007-08-30 12:09:49 UTC
Well, no hangs at all :( (dunno should I be happy or sad). I hope to get real Windows machine tomorrow... but I have a strong feeling that the problem is in via driver which I haven't to test unfortunelly....
Comment 40 Cyrill Gorcunov 2007-08-30 12:27:09 UTC
I mean i have no via card to test
Comment 41 udo 2007-09-02 09:12:22 UTC
fstab says:

//box/d$          /mnt/samba              cifs    noauto,owner,username=udo,file_mode=0770,dir_mode=0770  0 0

(on one line)

The box is a Win2K `pro` box, it is acting as a `server`.

Any info I could supply to make things easier for you? (just got back home, need to recover from lack of sleep)
Comment 42 Cyrill Gorcunov 2007-09-02 12:36:55 UTC
thanks, write you tomorrow (too busy now), Steeve maybe you may suggest anything to catch the bug easier? As I wrote I've been playing with locking almost two days but didn't get anything - moreover I even introduced a special delay in CIFS SendRecieve function in hope to put CIFS to the knees... but even in such situation CIFS worked without problem, but it was for two Linux machines with Samba on one and I still can't get access to Windows machine unfortunelly (my brother warried about it so much :)
Comment 43 Cyrill Gorcunov 2007-09-03 10:52:29 UTC
well, still trying to catch it by reading the code...
Comment 44 udo 2007-09-03 10:54:42 UTC
If you have any ideas about how I can provide more/other info that might help please let me know.
Comment 45 Natalie Protasevich 2007-09-03 19:14:50 UTC
Udo, have you tried to do the git bisect so far? Seems like you have real good case here, since it all worked before 2.6.20, right? Cyrill, do you think that would help?
Comment 46 udo 2007-09-03 20:14:52 UTC
The bug has been there for a while. 2.6.20.1 was the moment I reported it.
Comment 47 Cyrill Gorcunov 2007-09-04 07:09:26 UTC
Thanks for usefull info, unfortunelly we're moving to a new office - so I've too less time for the CIFS research... Anyway one day we'll catch it ;)
Comment 48 Cyrill Gorcunov 2007-09-04 12:41:09 UTC
Hi udo,

I think the following info would help (for some of this questions you've given answers so just repeat them):

1) Which is the last kernel version without that error (probably 2.6.19)
2) Which is the first one where it happens (seems to be 2.6.20.1)
3) Is it only happens on file overwrite on remote machine? I mean do you overwritre thru cp command or the remote file is previously deleted on Windows machine after being copied
4) Is it happens if your remote directory consist of small number of files?
5) Is it happens only for .config file (maybe size of the file being copied does matter)
6) Attach please more biggest dmesg file (maybe it helps to find another traces).

To Steeve: you know even NFS contains lock_/unlock_kernel calls in nfs_setattr() function. Maybe CIFS should too? Btw, most interesting thing in diff btw 2.6.19 and 2.6.21 kernels for CIFS tree is cifs_writepage() function. Take a look please, maybe there is an error? For example cifs_partialpagewrite() calls for kmap() that could sleep (is it important?).

P.S. sorry for my English ;)
Comment 49 Steve French 2007-09-04 14:25:03 UTC
doing a lock kernel in setattr seems like a performance mistake (expensive call - not just common but it can block).  setattr is called with the i_mutex on the inode already.

BKL is held on put_super and remount and mount and umount (and probably bmap) but clearly it is not meant to be held on common file and inode operations if we can avoid it.

If you want to eliminate writepage being a possible source of problem, you thry the test case over a cifs mount which was mounted with mount option "forcedirectio" (in which case writepage and writepages are avoided, and writes go directly to cifs_write without going through the pagecache).
Comment 50 Cyrill Gorcunov 2007-09-04 22:11:20 UTC
Udo, you should try "forcedirectio" then, 'cause I still have no windows machine.
Comment 51 udo 2007-09-05 08:35:59 UTC
How is the exact method to use 'foredirectio' ?
(call me stupid).

BTW: I can test until saturday, will be away after that.
Comment 52 Cyrill Gorcunov 2007-09-05 08:42:10 UTC
actually there is "direct" option... try it ;)
Comment 53 Cyrill Gorcunov 2007-09-05 08:52:34 UTC
"forcedirectio" option just hasn't been documented yet, so use it as an usual option
Comment 54 udo 2007-09-05 08:54:02 UTC
fstab holds:

//box/d$          /mnt/samba              cifs    forcedirectio,noauto,owner,username=udo,file_mode=0770,dir_mode=0770    0 0
(all on one line)

Remounted. 
Did  echo 7 > /proc/fs/cifs/cifsFYI
Did a similar cp action as last time.
Bug happened.
See attached log.
Comment 55 udo 2007-09-05 08:56:21 UTC
Created attachment 12713 [details]
Latest dmesg with forcedirectio mount option
Comment 56 Cyrill Gorcunov 2007-09-05 08:58:19 UTC
thanks, let me see...
Comment 57 Cyrill Gorcunov 2007-09-05 09:13:05 UTC
may be more long `messages` file does bring some clarity... could you attach that part of /var/log/messages where the mouning begin (and to the end of connection)?
Just cut it from /var/log/messages. Dunno does it help or not but who knows ;)
Comment 58 udo 2007-09-05 09:18:44 UTC
Created attachment 12714 [details]
Some relevant messages.
Comment 59 Cyrill Gorcunov 2007-09-05 09:19:51 UTC
thanks
Comment 60 Cyrill Gorcunov 2007-09-05 09:25:17 UTC
Steeve, your comments?
Comment 61 Cyrill Gorcunov 2007-09-05 09:32:01 UTC
sorry for double 'e' :(
Comment 62 Cyrill Gorcunov 2007-09-05 10:32:02 UTC
Btw, via-rhine module may be loaded with debug=7 option and could get us some additional debug info (in case if an error not in cifs module).
Comment 63 udo 2007-09-05 10:40:27 UTC
Network is not compiled as module here, but can try that if details hint in that direction. Maybe tomorrow.
Comment 64 Cyrill Gorcunov 2007-09-05 10:44:16 UTC
ok, I have access to windowz machine myself so will try to reveal the bug
Comment 65 Cyrill Gorcunov 2007-09-05 10:53:55 UTC
Well, I've been hard playing with WinXP machine - nothing hangs at all. So it seems to be via-rhine bug. (and another differece there is network speed - I've a quite speed net but you seems to lose some UDP packets - it does matter possibly)
Comment 66 Cyrill Gorcunov 2007-09-06 04:07:58 UTC
via-rhine was changed a bit but almost related to nic sleep state, so it couldn't be a source of problem.
Comment 67 Cyrill Gorcunov 2007-09-06 06:23:22 UTC
udo, can you confirm that 2.6.19 (or even 2.6.20) does work without hang? Or you've no other version to answer the question?
Comment 68 udo 2007-09-06 08:36:43 UTC
Oldest kernel around here is 2.6.21.2.
(I keep the list in grub within limits to avoid filling up the root partition with modules)

About via-rhine: maybe ask people on linux-kernel? Or Roger Luethi?
via-rhine also has issues when running half-duplex (full-duplex here), see the linux-kernel archives for my reports.

BTW: What specifically is happening so that the problem occurs? What exactly makes it point towards via-rhine? (I like to understand)
Comment 69 Cyrill Gorcunov 2007-09-06 09:03:28 UTC
Well, my only suspect about via-rhine is caused by experience in file copying procedure in WinXP. I mean I've tested cifs with '2.6.22-gentoo-r5' kernel -> WindowsXP shared folder and couldn't cause any lock problem (even under massive copying). Also I've been playing hard with CIFS and Linux/Samba machine (under massive file copying too) and still no errors. As I wrote I even got a special delay introduced to CIFS to cause locking problem with multiuser access to the files are being copied - and still can't get any errors. I saw the sources for via-rhine and there is too small part is changed (related to NIC sleep state at most) so I don't think if it bring errors to us. It's a good idea to link in kernel-list. But for firts I think we shoul ask Steve for his opinion. Btw Udo it seems I messed a bit with about _last_ kernel version being worked _without_ locking problem. Is it 2.6.21.2? That means - 2.6.21.2 - _is_ the last kernel that works _without_ problem, right?
Comment 70 udo 2007-09-06 09:14:42 UTC
First kernel I mention (see at the top of the page) is 2.6.20.1.
That is when I really noticed the problem.
I don't know about earlier versions of the kernel except that I did not notice the problem. (kernel freezes for a while, beeps, ups etc is upset)
Comment 71 Cyrill Gorcunov 2007-09-06 09:25:27 UTC
hm-m-m, I see... let me think a bit
Comment 72 Cyrill Gorcunov 2007-09-06 09:32:34 UTC
I think we could call for linux-kernel list. So I'm sending request to the list ;)
Comment 73 udo 2007-09-06 09:33:40 UTC
Every now and then I backup the .config to the windoze box. Sometimes more
often than other times. This action triggered the problem. So it is no 100%
hard indication that it starts with 2.6.20.1.
Comment 74 udo 2007-09-06 09:35:08 UTC
BTW: there is a Sweex LA200020 8-port LAN-switch between the firewall box and the Windoze box; maybe we should check this one as well?
I did not really notice any corruption yet.
Comment 75 Cyrill Gorcunov 2007-09-06 09:42:06 UTC
I don't think if that is a problem. Saying "backup" you mean - file overwritting action? (btw what FS's you're using on Linux and windows machines?) maybe it could help the developers who will to look this BUG?
Comment 76 udo 2007-09-06 09:49:32 UTC
ext3 on the Linux
ntfs on the win2K box.

The backup indeed sometimes overwrites an older file.
Comment 77 Cyrill Gorcunov 2007-09-06 10:16:24 UTC
Well udo, it seems I have only two possible things to do to reveal a bug:

1) Cause this locking with 'echo 7 > /proc/fs/cifs/cifsFYI' and 'echo 1 > /proc/fs/cifs/traceSMB' to see raw packets (I know Steve mentoined that he prefers to use normal sniffers)
2) Try to use kernel_lock/kernel_unlock() in cifs_setattr function. Yes Steve I know that is a performance hit but that could serialize cif_setattr access...

Udo, I'm not mature kernel developer and unfortunelly can't tell you what is a real problem here :(
Comment 78 Cyrill Gorcunov 2007-09-06 10:29:13 UTC
So udo try to trace SMB packets then - maybe it show us something interesting ;)
Comment 79 Cyrill Gorcunov 2007-09-07 02:45:51 UTC
Hi udo did you try anithing (I mean SMB tracing)?
Comment 80 udo 2007-09-07 06:51:41 UTC
Created attachment 12754 [details]
with smb print
Comment 81 udo 2007-09-07 06:53:25 UTC
Created attachment 12755 [details]
messages with smb print
Comment 82 udo 2007-09-07 06:54:03 UTC
I did try to trace SMB packets. See attachments.
Comment 83 Cyrill Gorcunov 2007-09-07 08:11:40 UTC
Thanks a lot and sorry for boring things you've been had to do ;)
Comment 84 Cyrill Gorcunov 2007-09-07 08:39:33 UTC
I wrote to Steve about this and waiting for response.
Comment 85 Steve French 2007-09-07 09:21:24 UTC
Seems like the key issue is still going to be wading through the lock dependency data a little more before posting to the Linux network (or lkml) mailing
lists for suggestions on why it is a problem that tcp grabs the lock in

tcp_sendmsg (net/ipv4/tcp.c) line 673 does a lock_sock call on the sk

Entry #3 has a truncate on an inode getting the i_alloc_sem (in fs/attr.c) - note that cifs does not grab the i_alloc_sem anywhere.

Not sure why sysfs (in particular sysfs create dir) shows up in the dependency chain, entry #2) - cifs does not use sysfs - perhaps their tcp driver or vpn does.

Any idea what the messages about "pptp_gre.c" are about?  I don't seem
to have that c file (it seems to be part of a vpn).  This may be worth
digging into.

Is the cifs mount over a vpn?

I don't see much cifs related here, either in the original problem or
in the log (Comment #58).   I expect that the circular lock dependency
either does not cause the hang mentioned in comment #7 or that the
hang is due to a unique interaction of tcp locking and the eth drivers
under it (not due to the cifs or vfs code above it in this case).
Comment 86 Cyrill Gorcunov 2007-09-07 09:28:16 UTC
Udo, what about VPN - do you use it? if yes, could you please try to cp wthout it.
Comment 87 udo 2007-09-07 09:35:51 UTC
When the machine freezes for a while the ppp/pptp connection to my ISP gives the pptp error messages. They are a side effect and have nothign to do with the smb cp issue since pptp is running on the other interface of the EK8000.
Comment 88 Cyrill Gorcunov 2007-09-07 09:58:37 UTC
I see... Well, let me digg via code then (btw, if you have a machine with other LAN nic - you may try to play with another eth driver to eliminate via as an issue source)
Comment 89 Natalie Protasevich 2007-09-07 11:07:59 UTC
Cyrill, and Udo,
Just wanted to mention - thanks much for being such a great team and being so very proactive in your parts. I wish all linux gurus and users and enthusiasts were this way :) Good luck..
Comment 90 Cyrill Gorcunov 2007-09-07 11:34:48 UTC
Thanks Natalie for so kind-hearted words ;) Unfortunelly we still can't eliminate the problem :( I have too small linux kernel experience. All usefull things (such as comments and advices) are done by Steve, and all tests are done by Udo ;)
Comment 91 Cyrill Gorcunov 2007-09-08 12:51:31 UTC
Hi Udo, could you please try to do the following:
- unplug your Webcam (it seems to be some Philips mod)
- boot with selinux=0 bootparam
- unplug second eth wire (so only one interface that do connect to win share is remained)

If neither help we could try to produce more debug messages with cifs module (by adding that keypoint markers into cifs sources)
Comment 92 Cyrill Gorcunov 2007-09-08 13:01:57 UTC
Btw, it could be even better if you tried to play with win share while all services (like firewall and so on) are tured on - just the kernel with minimum services.
Comment 93 Cyrill Gorcunov 2007-09-11 11:57:00 UTC
Hi Udo, what news? ;)
Comment 94 Cyrill Gorcunov 2007-09-18 04:53:16 UTC
Hi Udo, did you try to play with selinux disabled, firewall turned off and etc? I'm still trying to isolate the problem but it's not as simple as I thought ;)
Comment 95 udo 2007-09-22 05:52:11 UTC
I just got back from London.
I do not use selinux. (selinux=0)
What influence can tehfirewall have? The ads4linux firewall works (only?) on the external interface and this bug occurs on the internal interface. (trying to understand here)
But this weekend I cna give that a try.
Comment 96 Cyrill Gorcunov 2007-09-22 09:17:30 UTC
Btw, could you post _whole_ msg history starting from a bootstrap. Why I'm doing notice about firewall and webcam etc... 'cause I'm trying to compress an area of error searching circle.
Comment 97 Cyrill Gorcunov 2007-09-23 11:49:13 UTC
Hi Udo, what news? Still hangs happens (I mean with _all_stuff_turned_off_)?
Comment 98 udo 2007-09-24 07:56:12 UTC
Created attachment 12916 [details]
messages with bug and some stuff disabled
Comment 99 udo 2007-09-24 07:56:41 UTC
Created attachment 12917 [details]
dmesg with bug and some stuff disabled
Comment 100 udo 2007-09-24 07:57:27 UTC
I attached some new logging, created by running this set of commands:

adsl stop
ifconfig eth1 down
service motion stop
rmmod pwc
echo 7 > /proc/fs/cifs/cifsFYI
echo 1 > /proc/fs/cifs/traceSMB
cd /usr/src/linux
echo - - - - - -
cp .config /mnt/samba/download/linux/config-epia-2.6.22.6
echo - - - - - -
dmesg
echo klaar
Comment 101 Cyrill Gorcunov 2007-09-24 08:17:48 UTC
Thanks Udo! Give me some time to process this stuff.
Comment 102 Cyrill Gorcunov 2007-09-29 10:49:33 UTC
Hi Udo, I think you could try the patch enveloped. Actually it _does_ really slowdown file reading/listing procedure so test it with *cp only* and don't use it for everyday work. If the patch eliminate your problem it let us to know in which direction we should digg.

Steve, I know the patch is quite silly, but even _old_ smb driver uses semaphores to recv data from socket (and I think we have a race there). But I could be _just_wrong_.

The patch is over 2.6.22.5 tree.
Comment 103 Cyrill Gorcunov 2007-09-29 10:51:56 UTC
Created attachment 12988 [details]
patch to test recv race

Patch is over 2.6.22.5 tree
Comment 104 Cyrill Gorcunov 2007-09-29 10:57:36 UTC
Created attachment 12989 [details]
patch to test recv race

The patch is over 2.6.22.5 tree.
Comment 105 udo 2007-09-30 00:25:26 UTC
Created attachment 12996 [details]
dmesg 2.6.22.8 with patch
Comment 106 udo 2007-09-30 00:25:54 UTC
Created attachment 12997 [details]
messages 2.6.22.8 with patch
Comment 107 udo 2007-09-30 00:27:08 UTC
Applied the patch to 2.6.22.8.
started the following small script:
adsl stop
ifconfig eth1 down
service motion stop
rmmod pwc
echo 7 > /proc/fs/cifs/cifsFYI
echo 1 > /proc/fs/cifs/traceSMB
cd /usr/src/linux
echo - - - - - -
cp .config /mnt/samba/download/linux/config-epia-2.6.22.8
cp .config /mnt/samba/download/linux/config-epia-2.6.22.8
echo - - - - - -
dmesg
echo klaar

Attached results.
Comment 108 Cyrill Gorcunov 2007-09-30 01:42:38 UTC
Thanks Udo for testing. Still digging... Write you any news as only get it.
Comment 109 Cyrill Gorcunov 2007-10-26 08:35:08 UTC
Hi, Udo,
it seems I found the patch we need, please test it

http://programming.kicks-ass.net/kernel-patches/lockdep/nfs-socket-lockdep.patch

and write me if it help
Comment 110 udo 2007-10-26 09:21:00 UTC
Thanks.
The patch does not apply well on 2.6.23.1. :-(
Comment 111 Cyrill Gorcunov 2007-10-26 10:31:59 UTC
Oops, will come with fixed patch a bit later
Comment 112 Cyrill Gorcunov 2007-10-26 10:50:57 UTC
Oops for second time, Udo. This patch is already applied in 2.6.23. Forget what I proposed... sorry for disturbing you :(
Comment 113 Cyrill Gorcunov 2007-11-25 10:09:36 UTC
Hi Udo, since we haven't fixed that bug lets try to continue ;) A few questions. Does the problem appear with the latest kernel? Did you try to recompile the kernel with via-rhine as module to load it with debugging enabled (that should produce debug prints that may help)?
Comment 114 udo 2007-11-25 10:39:00 UTC
I am currently using 2.6.23.1.
I do have 2.6.23.8 compiled for next boot.
Just did a test and the bug did NOT appear. (yes, lockdep debugging is on)
# gzip -dc /proc/config.gz |grep DEBUG|grep -v ^#
CONFIG_SLUB_DEBUG=y
CONFIG_USB_DEBUG=y
CONFIG_USB_STORAGE_DEBUG=y
CONFIG_CIFS_DEBUG2=y
CONFIG_DEBUG_KERNEL=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_STACKOVERFLOW=y
Comment 115 Cyrill Gorcunov 2007-11-25 10:42:56 UTC
Whoa!!! Even I'm completely stupid and couldn't have helped you but real linux programmers are so good ;) So can we close the bug then?
Comment 116 Steve French 2009-01-24 20:22:37 UTC
Fixed since 2.6.23
Comment 117 Steve French 2009-01-24 20:22:53 UTC
Fixed now
Comment 118 udo 2009-01-24 22:16:46 UTC
Thanks for reminding.
My setup here has changed but will try to test to be able to confirm.

Note You need to log in before you can comment on or make changes to this bug.