Bug 8377
Summary: | "circular locking dependency detected" while doing cp | ||
---|---|---|---|
Product: | File System | Reporter: | udo (udovdh) |
Component: | CIFS | Assignee: | Steve French (sfrench) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | akpm, bunk, gorcunov, protasnb, sfrench |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.20.1, 2.6.21.4, 2.6.21.5, 2.6.22.5 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
The logging in the messages file
Another occurrance Again... Latest .config with proc but without the cifs stuff dmesg with crash and cifs debug info Latest dmesg with forcedirectio mount option Some relevant messages. with smb print messages with smb print messages with bug and some stuff disabled dmesg with bug and some stuff disabled patch to test recv race patch to test recv race dmesg 2.6.22.8 with patch messages 2.6.22.8 with patch |
Description
udo
2007-04-27 07:02:07 UTC
Created attachment 11289 [details]
The logging in the messages file
I don't know either of the TCP code and the filesystem code, but I get the impression that this may be false alarm from the lockdep debugging code. This is on a VIA Epia EK8000 which was up 50 days and some minutes when it happened. The machine has 512 megs of RAM and runs as a firewall/webserver/etc. I can post more info if deemed relevant. If this bug ticket gets stale, I suggest you repost on LKML with a good subject line, the right excerpts from dmesg etc, and backlink to the bugzilla entry. This appears to be a meta-bug (networking vs. filesystem) or may just be false alarm from lockdep. Therefore Cc the primary author of lockdep, Ingo Molnar <mingo@elte.hu>. (Sorry, perhaps it wasn't such a good idea of mine to point you to bugzilla in the first place.) Created attachment 11395 [details]
Another occurrance
It happened another time
It just happened again while I copied a file to the same windoze box. Also 2.6.21.1 does have an issue: when I try a similar cop the system hangs. What now!? What hangs, the whole kernel and all? If so, try the test with a text console instead of X11 in front. Perhaps you get the kernel panic log when the hang happens. If you got a camera around, take a photograph of the panic log and attach it here. What hangs: System was headless. Did not respond to ping anymore. I do not run X on the firewall box. I did find a start of the locking thing in the logs: May 5 19:42:38 epia kernel: May 5 19:42:38 epia kernel: ======================================================= May 5 19:42:38 epia kernel: [ INFO: possible circular locking dependency detected ] May 5 19:42:38 epia kernel: 2.6.21.1 #4 May 5 19:42:38 epia kernel: ------------------------------------------------------- May 5 19:42:38 epia kernel: cp/5433 is trying to acquire lock: May 5 19:42:38 epia kernel: (sk_lock-AF_INET){--..}, at: [tcp_sendmsg+22/2523] tcp_sendmsg+0x16/0x9db May 5 19:42:38 epia kernel: (sk_lock-AF_INET){--..}, at: [<c02c056f>] tcp_sendmsg+0x16/0x9db May 5 19:42:38 epia kernel: May 5 19:42:38 epia kernel: but task is already holding lock: May 5 19:42:38 epia kernel: (&inode->i_alloc_sem){--..}, at: [notify_change+232/720] notify_change+0xe8/0x2d0 May 5 19:42:38 epia kernel: (&inode->i_alloc_sem){--..}, at: [<c0169684>] notify_change+0xe8/0x2d0 May 5 19:42:38 epia kernel: May 5 19:42:38 epia kernel: which lock already depends on the new lock. May 5 19:42:38 epia kernel: May 5 19:42:38 epia kernel: May 5 19:42:38 epia kernel: the existing dependency chain (in reverse order) is: May 5 19:42:38 epia kernel: May 5 19:42:38 epia kernel: -> #3 (&inode->i_alloc_sem){--..}: May 5 19:42:38 epia kernel: [__lock_acquire+2529/2949] __lock_acquire+0x9e1/0xb85 May 5 19:42:38 epia kernel: [<c012d926>] __lock_acquire+0x9e1/0xb85 May 5 19:42:38 epia kernel: [lock_acquire+104/130] lock_acquire+0x68/0x82 May 5 19:42:38 epia kernel: [<c012db32>] lock_acquire+0x68/0x82 May 5 19:42:38 epia kernel: [down_write+58/83] down_write+0x3a/0x53 May 5 19:42:38 epia kernel: [<c0126d87>] down_write+0x3a/0x53 May 5 19:42:38 epia kernel: [notify_change+232/720] notify_change+0xe8/0x2d0 May 5 19:42:38 epia kernel: [<c0169684>] notify_change+0xe8/0x2d0 May 5 19:42:38 epia kernel: [do_truncate+83/108] do_truncate+0x53/0x6c May 5 19:42:38 epia kernel: [<c0156662>] do_truncate+0x53/0x6c May 5 19:42:38 epia kernel: [may_open+442/514] may_open+0x1ba/0x202 May 5 19:42:38 epia kernel: [<c015eff0>] may_open+0x1ba/0x202 May 5 19:42:38 epia kernel: [open_namei+639/1455] open_namei+0x27f/0x5af May 5 19:42:38 epia kernel: [<c0160f5a>] open_namei+0x27f/0x5af May 5 19:42:38 epia kernel: [do_filp_open+38/59] do_filp_open+0x26/0x3b May 5 19:42:38 epia kernel: [<c01563c7>] do_filp_open+0x26/0x3b May 5 19:42:38 epia kernel: [do_sys_open+67/194] do_sys_open+0x43/0xc2 May 5 19:44:58 epia syslogd 1.4.1: restart (remote reception). May 5 19:44:58 epia kernel: klogd 1.4.1, log source = /proc/kmsg started. Created attachment 11406 [details]
Again...
Just happened again:
[root@epia linux]# cp .config /mnt/samba/download/linux/config.epia.2.6.21.1
[root@epia linux]# cp tarpit.patch /mnt/samba/download/linux/
<booommm!!>
(tarpit.patch is a textfile of allmost 1.5 Kbyte)
Also happens in 2.6.21.4 and 2.6.21.5. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.21.4 #8 ------------------------------------------------------- cp/3088 is trying to acquire lock: (sk_lock-AF_INET){--..}, at: [<c02c5c99>] tcp_sendmsg+0x16/0x9cc but task is already holding lock: (&inode->i_alloc_sem){--..}, at: [<c0167e08>] notify_change+0xe8/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&inode->i_alloc_sem){--..}: [<c012c0c8>] __lock_acquire+0x9e1/0xb85 [<c012c2d4>] lock_acquire+0x68/0x82 [<c0126503>] down_write+0x3a/0x53 [<c0167e08>] notify_change+0xe8/0x2d0 [<c0154dee>] do_truncate+0x53/0x6c [<c015d77c>] may_open+0x1ba/0x202 [<c015f6e6>] open_namei+0x27f/0x5af [<c0154b53>] do_filp_open+0x26/0x3b [<c0154bab>] do_sys_open+0x43/0xc2 [<c0154c62>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff -> #2 (&sysfs_inode_imutex_key){--..}: [<c012c0c8>] __lock_acquire+0x9e1/0xb85 [<c012c2d4>] lock_acquire+0x68/0x82 [<c02f7c0a>] __mutex_lock_slowpath+0xdc/0x266 [<c02f7db0>] mutex_lock+0x1c/0x1f [<c018c41d>] create_dir+0x20/0x196 [<c018c5dd>] sysfs_create_dir+0x4a/0x64 [<c01d2bd9>] kobject_shadow_add+0xd6/0x189 [<c01d2c96>] kobject_add+0xa/0xc [<c0223ee5>] device_add+0xae/0x62b [<c02aaed3>] netdev_register_sysfs+0x5a/0x5f [<c02a102a>] register_netdevice+0x22c/0x2e4 [<c02a2175>] register_netdev+0x40/0x4d [<c022c124>] rhine_init_one+0x492/0x64f [<c01e1ffd>] pci_device_probe+0x39/0x5b [<c0225e0b>] really_probe+0xbd/0x145 [<c0225f28>] driver_probe_device+0x95/0xa1 [<c022603d>] __driver_attach+0x6a/0xa1 [<c0225401>] bus_for_each_dev+0x36/0x5b [<c0225c7f>] driver_attach+0x19/0x1b [<c02256e8>] bus_add_driver+0x6a/0x170 [<c022624f>] driver_register+0x79/0x7e [<c01e2184>] __pci_register_driver+0x7b/0xa8 [<c040a0de>] rhine_init+0x5d/0x5f [<c03f271c>] init+0x95/0x17a [<c01038e7>] kernel_thread_helper+0x7/0x10 [<ffffffff>] 0xffffffff -> #1 (rtnl_mutex){--..}: [<c012c0c8>] __lock_acquire+0x9e1/0xb85 [<c012c2d4>] lock_acquire+0x68/0x82 [<c02f7c0a>] __mutex_lock_slowpath+0xdc/0x266 [<c02f7db0>] mutex_lock+0x1c/0x1f [<c02a8913>] rtnl_lock+0xd/0xf [<c02e0e9d>] ip_mc_join_group+0x2c/0xc9 [<c02c2573>] ip_setsockopt+0x64b/0x9a7 [<c02d7d10>] udp_setsockopt+0x43/0x4a [<c029a01e>] sock_common_setsockopt+0x1e/0x24 [<c02986e7>] sys_setsockopt+0x7b/0x97 [<c0299d39>] sys_socketcall+0x1e8/0x241 [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff -> #0 (sk_lock-AF_INET){--..}: [<c012bfa9>] __lock_acquire+0x8c2/0xb85 [<c012c2d4>] lock_acquire+0x68/0x82 [<c029a747>] lock_sock_nested+0xba/0xc7 [<c02c5c99>] tcp_sendmsg+0x16/0x9cc [<c02dd764>] inet_sendmsg+0x3e/0x49 [<c0298b77>] sock_sendmsg+0xe7/0x104 [<c0299dba>] kernel_sendmsg+0x28/0x37 [<ded22eba>] smb_send+0x92/0x11c [cifs] [<ded230c3>] SendReceive+0x17f/0x3dc [cifs] [<ded12817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs] [<ded1e4dd>] cifs_setattr+0x25d/0x900 [cifs] [<c0167e50>] notify_change+0x130/0x2d0 [<c0154dee>] do_truncate+0x53/0x6c [<c015d77c>] may_open+0x1ba/0x202 [<c015f6e6>] open_namei+0x27f/0x5af [<c0154b53>] do_filp_open+0x26/0x3b [<c0154bab>] do_sys_open+0x43/0xc2 [<c0154c62>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff other info that might help us debug this: 2 locks held by cp/3088: #0: (&inode->i_mutex){--..}, at: [<c02f7db0>] mutex_lock+0x1c/0x1f #1: (&inode->i_alloc_sem){--..}, at: [<c0167e08>] notify_change+0xe8/0x2d0 stack backtrace: [<c0103c43>] show_trace_log_lvl+0x1a/0x2f [<c01042bc>] show_trace+0x12/0x14 [<c0104359>] dump_stack+0x16/0x18 [<c012a7e1>] print_circular_bug_tail+0x5f/0x68 [<c012bfa9>] __lock_acquire+0x8c2/0xb85 [<c012c2d4>] lock_acquire+0x68/0x82 [<c029a747>] lock_sock_nested+0xba/0xc7 [<c02c5c99>] tcp_sendmsg+0x16/0x9cc [<c02dd764>] inet_sendmsg+0x3e/0x49 [<c0298b77>] sock_sendmsg+0xe7/0x104 [<c0299dba>] kernel_sendmsg+0x28/0x37 [<ded22eba>] smb_send+0x92/0x11c [cifs] [<ded230c3>] SendReceive+0x17f/0x3dc [cifs] [<ded12817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs] [<ded1e4dd>] cifs_setattr+0x25d/0x900 [cifs] [<c0167e50>] notify_change+0x130/0x2d0 [<c0154dee>] do_truncate+0x53/0x6c [<c015d77c>] may_open+0x1ba/0x202 [<c015f6e6>] open_namei+0x27f/0x5af [<c0154b53>] do_filp_open+0x26/0x3b [<c0154bab>] do_sys_open+0x43/0xc2 [<c0154c62>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 ======================= pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. Clocksource tsc unstable (delta = 18746609409 ns) Time: acpi_pm clocksource has been installed. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. CIFS VFS: No response for cmd 50 mid 5023 ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.21.5 #10 ------------------------------------------------------- cp/3480 is trying to acquire lock: (sk_lock-AF_INET){--..}, at: [<c02c5d45>] tcp_sendmsg+0x16/0x9cc but task is already holding lock: (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&inode->i_alloc_sem){--..}: [<c012c0e8>] __lock_acquire+0x9e1/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c0126503>] down_write+0x3a/0x53 [<c0167e28>] notify_change+0xe8/0x2d0 [<c0154e0e>] do_truncate+0x53/0x6c [<c015d79c>] may_open+0x1ba/0x202 [<c015f706>] open_namei+0x27f/0x5af [<c0154b73>] do_filp_open+0x26/0x3b [<c0154bcb>] do_sys_open+0x43/0xc2 [<c0154c82>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff -> #2 (&sysfs_inode_imutex_key){--..}: [<c012c0e8>] __lock_acquire+0x9e1/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266 [<c02f7f16>] mutex_lock+0x1c/0x1f [<c018c43d>] create_dir+0x20/0x196 [<c018c5fd>] sysfs_create_dir+0x4a/0x64 [<c01d2bf9>] kobject_shadow_add+0xd6/0x189 [<c01d2cb6>] kobject_add+0xa/0xc [<c0223f21>] device_add+0xae/0x62b [<c02aaf6f>] netdev_register_sysfs+0x5a/0x5f [<c02a10ba>] register_netdevice+0x22c/0x2e4 [<c02a220c>] register_netdev+0x40/0x4d [<c022c160>] rhine_init_one+0x492/0x64f [<c01e201d>] pci_device_probe+0x39/0x5b [<c0225e47>] really_probe+0xbd/0x145 [<c0225f64>] driver_probe_device+0x95/0xa1 [<c0226079>] __driver_attach+0x6a/0xa1 [<c022543d>] bus_for_each_dev+0x36/0x5b [<c0225cbb>] driver_attach+0x19/0x1b [<c0225724>] bus_add_driver+0x6a/0x170 [<c022628b>] driver_register+0x79/0x7e [<c01e21a4>] __pci_register_driver+0x7b/0xa8 [<c040a0cf>] rhine_init+0x5d/0x5f [<c03f271c>] init+0x95/0x17a [<c01038e7>] kernel_thread_helper+0x7/0x10 [<ffffffff>] 0xffffffff -> #1 (rtnl_mutex){--..}: [<c012c0e8>] __lock_acquire+0x9e1/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266 [<c02f7f16>] mutex_lock+0x1c/0x1f [<c02a89af>] rtnl_lock+0xd/0xf [<c02e0f5d>] ip_mc_join_group+0x2c/0xc9 [<c02c261f>] ip_setsockopt+0x64b/0x9a7 [<c02d7db8>] udp_setsockopt+0x43/0x4a [<c029a042>] sock_common_setsockopt+0x1e/0x24 [<c029870b>] sys_setsockopt+0x7b/0x97 [<c0299d5d>] sys_socketcall+0x1e8/0x241 [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff -> #0 (sk_lock-AF_INET){--..}: [<c012bfc9>] __lock_acquire+0x8c2/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c029a76b>] lock_sock_nested+0xba/0xc7 [<c02c5d45>] tcp_sendmsg+0x16/0x9cc [<c02dd824>] inet_sendmsg+0x3e/0x49 [<c0298b9b>] sock_sendmsg+0xe7/0x104 [<c0299dde>] kernel_sendmsg+0x28/0x37 [<dec97eba>] smb_send+0x92/0x11c [cifs] [<dec980c3>] SendReceive+0x17f/0x3dc [cifs] [<dec87817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs] [<dec934dd>] cifs_setattr+0x25d/0x900 [cifs] [<c0167e70>] notify_change+0x130/0x2d0 [<c0154e0e>] do_truncate+0x53/0x6c [<c015d79c>] may_open+0x1ba/0x202 [<c015f706>] open_namei+0x27f/0x5af [<c0154b73>] do_filp_open+0x26/0x3b [<c0154bcb>] do_sys_open+0x43/0xc2 [<c0154c82>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff other info that might help us debug this: 2 locks held by cp/3480: #0: (&inode->i_mutex){--..}, at: [<c02f7f16>] mutex_lock+0x1c/0x1f #1: (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0 stack backtrace: [<c0103c43>] show_trace_log_lvl+0x1a/0x2f [<c01042bc>] show_trace+0x12/0x14 [<c0104359>] dump_stack+0x16/0x18 [<c012a801>] print_circular_bug_tail+0x5f/0x68 [<c012bfc9>] __lock_acquire+0x8c2/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c029a76b>] lock_sock_nested+0xba/0xc7 [<c02c5d45>] tcp_sendmsg+0x16/0x9cc [<c02dd824>] inet_sendmsg+0x3e/0x49 [<c0298b9b>] sock_sendmsg+0xe7/0x104 [<c0299dde>] kernel_sendmsg+0x28/0x37 [<dec97eba>] smb_send+0x92/0x11c [cifs] [<dec980c3>] SendReceive+0x17f/0x3dc [cifs] [<dec87817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs] [<dec934dd>] cifs_setattr+0x25d/0x900 [cifs] [<c0167e70>] notify_change+0x130/0x2d0 [<c0154e0e>] do_truncate+0x53/0x6c [<c015d79c>] may_open+0x1ba/0x202 [<c015f706>] open_namei+0x27f/0x5af [<c0154b73>] do_filp_open+0x26/0x3b [<c0154bcb>] do_sys_open+0x43/0xc2 [<c0154c82>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 ======================= pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. While nzbperl was writing to windows share via cifs: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.21.5 #10 ------------------------------------------------------- uudeview/23403 is trying to acquire lock: (sk_lock-AF_INET){--..}, at: [<c02c5d45>] tcp_sendmsg+0x16/0x9cc but task is already holding lock: (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&inode->i_alloc_sem){--..}: [<c012c0e8>] __lock_acquire+0x9e1/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c0126503>] down_write+0x3a/0x53 [<c0167e28>] notify_change+0xe8/0x2d0 [<c0154e0e>] do_truncate+0x53/0x6c [<c015d79c>] may_open+0x1ba/0x202 [<c015f706>] open_namei+0x27f/0x5af [<c0154b73>] do_filp_open+0x26/0x3b [<c0154bcb>] do_sys_open+0x43/0xc2 [<c0154c82>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff -> #2 (&sysfs_inode_imutex_key){--..}: [<c012c0e8>] __lock_acquire+0x9e1/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266 [<c02f7f16>] mutex_lock+0x1c/0x1f [<c018c43d>] create_dir+0x20/0x196 [<c018c5fd>] sysfs_create_dir+0x4a/0x64 [<c01d2bf9>] kobject_shadow_add+0xd6/0x189 [<c01d2cb6>] kobject_add+0xa/0xc [<c0223f21>] device_add+0xae/0x62b [<c02aaf6f>] netdev_register_sysfs+0x5a/0x5f [<c02a10ba>] register_netdevice+0x22c/0x2e4 [<c02a220c>] register_netdev+0x40/0x4d [<c022c160>] rhine_init_one+0x492/0x64f [<c01e201d>] pci_device_probe+0x39/0x5b [<c0225e47>] really_probe+0xbd/0x145 [<c0225f64>] driver_probe_device+0x95/0xa1 [<c0226079>] __driver_attach+0x6a/0xa1 [<c022543d>] bus_for_each_dev+0x36/0x5b [<c0225cbb>] driver_attach+0x19/0x1b [<c0225724>] bus_add_driver+0x6a/0x170 [<c022628b>] driver_register+0x79/0x7e [<c01e21a4>] __pci_register_driver+0x7b/0xa8 [<c040a0cf>] rhine_init+0x5d/0x5f [<c03f271c>] init+0x95/0x17a [<c01038e7>] kernel_thread_helper+0x7/0x10 [<ffffffff>] 0xffffffff -> #1 (rtnl_mutex){--..}: [<c012c0e8>] __lock_acquire+0x9e1/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c02f7d70>] __mutex_lock_slowpath+0xdc/0x266 [<c02f7f16>] mutex_lock+0x1c/0x1f [<c02a89af>] rtnl_lock+0xd/0xf [<c02e0f5d>] ip_mc_join_group+0x2c/0xc9 [<c02c261f>] ip_setsockopt+0x64b/0x9a7 [<c02d7db8>] udp_setsockopt+0x43/0x4a [<c029a042>] sock_common_setsockopt+0x1e/0x24 [<c029870b>] sys_setsockopt+0x7b/0x97 [<c0299d5d>] sys_socketcall+0x1e8/0x241 [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff -> #0 (sk_lock-AF_INET){--..}: [<c012bfc9>] __lock_acquire+0x8c2/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c029a76b>] lock_sock_nested+0xba/0xc7 [<c02c5d45>] tcp_sendmsg+0x16/0x9cc [<c02dd824>] inet_sendmsg+0x3e/0x49 [<c0298b9b>] sock_sendmsg+0xe7/0x104 [<c0299dde>] kernel_sendmsg+0x28/0x37 [<dec9feba>] smb_send+0x92/0x11c [cifs] [<deca00c3>] SendReceive+0x17f/0x3dc [cifs] [<dec8f817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs] [<dec9b4dd>] cifs_setattr+0x25d/0x900 [cifs] [<c0167e70>] notify_change+0x130/0x2d0 [<c0154e0e>] do_truncate+0x53/0x6c [<c015d79c>] may_open+0x1ba/0x202 [<c015f706>] open_namei+0x27f/0x5af [<c0154b73>] do_filp_open+0x26/0x3b [<c0154bcb>] do_sys_open+0x43/0xc2 [<c0154c82>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 [<ffffffff>] 0xffffffff other info that might help us debug this: 2 locks held by uudeview/23403: #0: (&inode->i_mutex){--..}, at: [<c02f7f16>] mutex_lock+0x1c/0x1f #1: (&inode->i_alloc_sem){--..}, at: [<c0167e28>] notify_change+0xe8/0x2d0 stack backtrace: [<c0103c43>] show_trace_log_lvl+0x1a/0x2f [<c01042bc>] show_trace+0x12/0x14 [<c0104359>] dump_stack+0x16/0x18 [<c012a801>] print_circular_bug_tail+0x5f/0x68 [<c012bfc9>] __lock_acquire+0x8c2/0xb85 [<c012c2f4>] lock_acquire+0x68/0x82 [<c029a76b>] lock_sock_nested+0xba/0xc7 [<c02c5d45>] tcp_sendmsg+0x16/0x9cc [<c02dd824>] inet_sendmsg+0x3e/0x49 [<c0298b9b>] sock_sendmsg+0xe7/0x104 [<c0299dde>] kernel_sendmsg+0x28/0x37 [<dec9feba>] smb_send+0x92/0x11c [cifs] [<deca00c3>] SendReceive+0x17f/0x3dc [cifs] [<dec8f817>] CIFSSMBSetEOF+0x1e6/0x229 [cifs] [<dec9b4dd>] cifs_setattr+0x25d/0x900 [cifs] [<c0167e70>] notify_change+0x130/0x2d0 [<c0154e0e>] do_truncate+0x53/0x6c [<c015d79c>] may_open+0x1ba/0x202 [<c015f706>] open_namei+0x27f/0x5af [<c0154b73>] do_filp_open+0x26/0x3b [<c0154bcb>] do_sys_open+0x43/0xc2 [<c0154c82>] sys_open+0x1c/0x1e [<c01026a6>] sysenter_past_esp+0x5f/0x99 ======================= pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. Clocksource tsc unstable (delta = 18749374053 ns) Time: acpi_pm clocksource has been installed. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. pwc: Too many ISOC errors, bailing out. Have you tried turning off lockdep to verify #4 comment about possible problem with lockdep itself. Also, since this started in 2.6.20.XX should be possible do bisect search for the patch that broke it. Is it possible to see debug messages from CIFS? Or even dmesg? Btw you may try to play with "max_interrupt_work" and "debug" param for via-rhine module. I mean you may try to set up max_interrupt_work=10 and debug=7 and attach dmesg to the bugzilla report. Let's cc Steve - this could be a cifs glitch. What I posted is dmesg output. How can I provide the CIFS debug messages? I have the module compiled into the kernel as I use it all the time as part of a firewall. How does the via-rhine interact with cifs? I did not yet try the lockdep check. Should be a good start. Perhaps it started around 2.6.20 because I enabled various kernel debuggings. Turned off: Lock debugging: prove locking correctness (PROVE_LOCKING) Lock dependency engine debugging (DEBUG_LOCKDEP) And compiled 2.6.21.6, rebooted. Did a fair amount of: # cp .config /mnt/samba/download/linux/config-2.6.21.6-ld.txt cp: overwrite `/mnt/samba/download/linux/config-2.6.21.6-ld.txt'? y # Normally this would trigger the bug easily. So far no problem. Thanks for reminding me, Natalie! To enable additional CIFS debug message you have to get "Enable additional CIFS debugging routines" (CIFS_DEBUG2) turned on at moment of cifs module compilation. So just reconfig your kernel and rebuilt the module. After has it done you should type "echo 1 > /proc/fs/cifs/traceSMB" and you'll get a lot of debug messages in system log file. Thanks. Does the fact that the problem doesn't occur when the locking debugging is turned off say anything about the problem being in the CIFS area? Or is it a combiantion? So I should turn on the CIFS debugging and both PROVE_LOCKING and DEBUG_LOCKDEP? Note that CONFIG_CIFS_DEBUG2 is not required in order for cifs to log debug messages although it increases the amount of information logged. I typically do not enable "echo 1 > /proc/fs/cifs/traceSMB" to log all outgoing requests (and responses too) from the cifs client since I would instead run a network analyzer (e.g. wireshark) on the server machine to capture the network traffic instead. I often do enable debug messages though ie "echo 7 > /proc/fs/cifs/cifsFYI" At first glance, I am puzzled what the circular lock problem is (are you sure that that has anything to do with your hang?). I assume that this is a false alarm from lockdep. my guess is that lockdep is confused between the different types of inodes. The only case in which I would expect sendmsg to ever recurse back into the filesystem and grab a sem, would be when allocating memory in the network adapters send path and a flush of a dirty inode (writing back cached data) is required to be able to complete the alloc. That does not appear to be the case here. To be sure about Steve's guess I think you may try to play with "echo 7 > /proc/fs/cifs/cifsFYI" and grab the logs. It would be usefull anyway I enabled "Enable additional CIFS debugging routines" but the /proc/fs/cifs path is not there. Am I missing something? (2.6.22.2 in the mean time) (I will be back next sunday.) /proc won't be there if it is not enabled in the kernel (in the kernel configuration it is under filesystems->Pseudo-filesystems in menuconfig) but it is enabled in just about every kernel. /proc/fs/cifs won't be there (even if /proc is) if the cifs module is not loaded (to fix that /sbin/insmod cifs.ko) I do have /proc and stuff under there but not the cifs stuff. Btw, could you please post your .config there? any news? what about your .config to try to find out why there is no /proc? Created attachment 12508 [details]
Latest .config with proc but without the cifs stuff
BTW: I use modules only for the stuff that is possibly not in use. All the rest is mostly compiled in. Created attachment 12520 [details]
dmesg with crash and cifs debug info
I was able to get things working. 2.6.22.5 with lock debuging and the CIFS debugging on. (after a reboot stuff works better). See the attachment.
BTW: stuff happened when I copied config-2.6.22.5.txt for the 2nd time to the same location on the Win2K box. Thanks. Give us time to try to handle bug So you're trying to rewrite file on Win2K server, am I right? Well, I've been trying to cause locking problem almost whole day long - and has no result (but with a major defference - I've no Windows machine home so all is going thru samba). Could you please describe in details what you're doing - I mean what is you mount options? Did you try to use pecific cifs options like `direct`? I'm sorry - I'm not cifs specialist...Steeve, you thoughts? You could try to disable the Unix protocol extensions (/proc/fs/cifs/LinuxExtensionsEnabled) before mount (or with current mainline cifs, you can use new mount option "nounix") - that will cause the behavior on mounts to Samba to be more similar to what you would see if you mounted to Windows or the appliances that do not support the Unix extensions. OK, let me try Well, no hangs at all :( (dunno should I be happy or sad). I hope to get real Windows machine tomorrow... but I have a strong feeling that the problem is in via driver which I haven't to test unfortunelly.... I mean i have no via card to test fstab says: //box/d$ /mnt/samba cifs noauto,owner,username=udo,file_mode=0770,dir_mode=0770 0 0 (on one line) The box is a Win2K `pro` box, it is acting as a `server`. Any info I could supply to make things easier for you? (just got back home, need to recover from lack of sleep) thanks, write you tomorrow (too busy now), Steeve maybe you may suggest anything to catch the bug easier? As I wrote I've been playing with locking almost two days but didn't get anything - moreover I even introduced a special delay in CIFS SendRecieve function in hope to put CIFS to the knees... but even in such situation CIFS worked without problem, but it was for two Linux machines with Samba on one and I still can't get access to Windows machine unfortunelly (my brother warried about it so much :) well, still trying to catch it by reading the code... If you have any ideas about how I can provide more/other info that might help please let me know. Udo, have you tried to do the git bisect so far? Seems like you have real good case here, since it all worked before 2.6.20, right? Cyrill, do you think that would help? The bug has been there for a while. 2.6.20.1 was the moment I reported it. Thanks for usefull info, unfortunelly we're moving to a new office - so I've too less time for the CIFS research... Anyway one day we'll catch it ;) Hi udo, I think the following info would help (for some of this questions you've given answers so just repeat them): 1) Which is the last kernel version without that error (probably 2.6.19) 2) Which is the first one where it happens (seems to be 2.6.20.1) 3) Is it only happens on file overwrite on remote machine? I mean do you overwritre thru cp command or the remote file is previously deleted on Windows machine after being copied 4) Is it happens if your remote directory consist of small number of files? 5) Is it happens only for .config file (maybe size of the file being copied does matter) 6) Attach please more biggest dmesg file (maybe it helps to find another traces). To Steeve: you know even NFS contains lock_/unlock_kernel calls in nfs_setattr() function. Maybe CIFS should too? Btw, most interesting thing in diff btw 2.6.19 and 2.6.21 kernels for CIFS tree is cifs_writepage() function. Take a look please, maybe there is an error? For example cifs_partialpagewrite() calls for kmap() that could sleep (is it important?). P.S. sorry for my English ;) doing a lock kernel in setattr seems like a performance mistake (expensive call - not just common but it can block). setattr is called with the i_mutex on the inode already. BKL is held on put_super and remount and mount and umount (and probably bmap) but clearly it is not meant to be held on common file and inode operations if we can avoid it. If you want to eliminate writepage being a possible source of problem, you thry the test case over a cifs mount which was mounted with mount option "forcedirectio" (in which case writepage and writepages are avoided, and writes go directly to cifs_write without going through the pagecache). Udo, you should try "forcedirectio" then, 'cause I still have no windows machine. How is the exact method to use 'foredirectio' ? (call me stupid). BTW: I can test until saturday, will be away after that. actually there is "direct" option... try it ;) "forcedirectio" option just hasn't been documented yet, so use it as an usual option fstab holds: //box/d$ /mnt/samba cifs forcedirectio,noauto,owner,username=udo,file_mode=0770,dir_mode=0770 0 0 (all on one line) Remounted. Did echo 7 > /proc/fs/cifs/cifsFYI Did a similar cp action as last time. Bug happened. See attached log. Created attachment 12713 [details]
Latest dmesg with forcedirectio mount option
thanks, let me see... may be more long `messages` file does bring some clarity... could you attach that part of /var/log/messages where the mouning begin (and to the end of connection)? Just cut it from /var/log/messages. Dunno does it help or not but who knows ;) Created attachment 12714 [details]
Some relevant messages.
thanks Steeve, your comments? sorry for double 'e' :( Btw, via-rhine module may be loaded with debug=7 option and could get us some additional debug info (in case if an error not in cifs module). Network is not compiled as module here, but can try that if details hint in that direction. Maybe tomorrow. ok, I have access to windowz machine myself so will try to reveal the bug Well, I've been hard playing with WinXP machine - nothing hangs at all. So it seems to be via-rhine bug. (and another differece there is network speed - I've a quite speed net but you seems to lose some UDP packets - it does matter possibly) via-rhine was changed a bit but almost related to nic sleep state, so it couldn't be a source of problem. udo, can you confirm that 2.6.19 (or even 2.6.20) does work without hang? Or you've no other version to answer the question? Oldest kernel around here is 2.6.21.2. (I keep the list in grub within limits to avoid filling up the root partition with modules) About via-rhine: maybe ask people on linux-kernel? Or Roger Luethi? via-rhine also has issues when running half-duplex (full-duplex here), see the linux-kernel archives for my reports. BTW: What specifically is happening so that the problem occurs? What exactly makes it point towards via-rhine? (I like to understand) Well, my only suspect about via-rhine is caused by experience in file copying procedure in WinXP. I mean I've tested cifs with '2.6.22-gentoo-r5' kernel -> WindowsXP shared folder and couldn't cause any lock problem (even under massive copying). Also I've been playing hard with CIFS and Linux/Samba machine (under massive file copying too) and still no errors. As I wrote I even got a special delay introduced to CIFS to cause locking problem with multiuser access to the files are being copied - and still can't get any errors. I saw the sources for via-rhine and there is too small part is changed (related to NIC sleep state at most) so I don't think if it bring errors to us. It's a good idea to link in kernel-list. But for firts I think we shoul ask Steve for his opinion. Btw Udo it seems I messed a bit with about _last_ kernel version being worked _without_ locking problem. Is it 2.6.21.2? That means - 2.6.21.2 - _is_ the last kernel that works _without_ problem, right? First kernel I mention (see at the top of the page) is 2.6.20.1. That is when I really noticed the problem. I don't know about earlier versions of the kernel except that I did not notice the problem. (kernel freezes for a while, beeps, ups etc is upset) hm-m-m, I see... let me think a bit I think we could call for linux-kernel list. So I'm sending request to the list ;) Every now and then I backup the .config to the windoze box. Sometimes more often than other times. This action triggered the problem. So it is no 100% hard indication that it starts with 2.6.20.1. BTW: there is a Sweex LA200020 8-port LAN-switch between the firewall box and the Windoze box; maybe we should check this one as well? I did not really notice any corruption yet. I don't think if that is a problem. Saying "backup" you mean - file overwritting action? (btw what FS's you're using on Linux and windows machines?) maybe it could help the developers who will to look this BUG? ext3 on the Linux ntfs on the win2K box. The backup indeed sometimes overwrites an older file. Well udo, it seems I have only two possible things to do to reveal a bug: 1) Cause this locking with 'echo 7 > /proc/fs/cifs/cifsFYI' and 'echo 1 > /proc/fs/cifs/traceSMB' to see raw packets (I know Steve mentoined that he prefers to use normal sniffers) 2) Try to use kernel_lock/kernel_unlock() in cifs_setattr function. Yes Steve I know that is a performance hit but that could serialize cif_setattr access... Udo, I'm not mature kernel developer and unfortunelly can't tell you what is a real problem here :( So udo try to trace SMB packets then - maybe it show us something interesting ;) Hi udo did you try anithing (I mean SMB tracing)? Created attachment 12754 [details]
with smb print
Created attachment 12755 [details]
messages with smb print
I did try to trace SMB packets. See attachments. Thanks a lot and sorry for boring things you've been had to do ;) I wrote to Steve about this and waiting for response. Seems like the key issue is still going to be wading through the lock dependency data a little more before posting to the Linux network (or lkml) mailing lists for suggestions on why it is a problem that tcp grabs the lock in tcp_sendmsg (net/ipv4/tcp.c) line 673 does a lock_sock call on the sk Entry #3 has a truncate on an inode getting the i_alloc_sem (in fs/attr.c) - note that cifs does not grab the i_alloc_sem anywhere. Not sure why sysfs (in particular sysfs create dir) shows up in the dependency chain, entry #2) - cifs does not use sysfs - perhaps their tcp driver or vpn does. Any idea what the messages about "pptp_gre.c" are about? I don't seem to have that c file (it seems to be part of a vpn). This may be worth digging into. Is the cifs mount over a vpn? I don't see much cifs related here, either in the original problem or in the log (Comment #58). I expect that the circular lock dependency either does not cause the hang mentioned in comment #7 or that the hang is due to a unique interaction of tcp locking and the eth drivers under it (not due to the cifs or vfs code above it in this case). Udo, what about VPN - do you use it? if yes, could you please try to cp wthout it. When the machine freezes for a while the ppp/pptp connection to my ISP gives the pptp error messages. They are a side effect and have nothign to do with the smb cp issue since pptp is running on the other interface of the EK8000. I see... Well, let me digg via code then (btw, if you have a machine with other LAN nic - you may try to play with another eth driver to eliminate via as an issue source) Cyrill, and Udo, Just wanted to mention - thanks much for being such a great team and being so very proactive in your parts. I wish all linux gurus and users and enthusiasts were this way :) Good luck.. Thanks Natalie for so kind-hearted words ;) Unfortunelly we still can't eliminate the problem :( I have too small linux kernel experience. All usefull things (such as comments and advices) are done by Steve, and all tests are done by Udo ;) Hi Udo, could you please try to do the following: - unplug your Webcam (it seems to be some Philips mod) - boot with selinux=0 bootparam - unplug second eth wire (so only one interface that do connect to win share is remained) If neither help we could try to produce more debug messages with cifs module (by adding that keypoint markers into cifs sources) Btw, it could be even better if you tried to play with win share while all services (like firewall and so on) are tured on - just the kernel with minimum services. Hi Udo, what news? ;) Hi Udo, did you try to play with selinux disabled, firewall turned off and etc? I'm still trying to isolate the problem but it's not as simple as I thought ;) I just got back from London. I do not use selinux. (selinux=0) What influence can tehfirewall have? The ads4linux firewall works (only?) on the external interface and this bug occurs on the internal interface. (trying to understand here) But this weekend I cna give that a try. Btw, could you post _whole_ msg history starting from a bootstrap. Why I'm doing notice about firewall and webcam etc... 'cause I'm trying to compress an area of error searching circle. Hi Udo, what news? Still hangs happens (I mean with _all_stuff_turned_off_)? Created attachment 12916 [details]
messages with bug and some stuff disabled
Created attachment 12917 [details]
dmesg with bug and some stuff disabled
I attached some new logging, created by running this set of commands: adsl stop ifconfig eth1 down service motion stop rmmod pwc echo 7 > /proc/fs/cifs/cifsFYI echo 1 > /proc/fs/cifs/traceSMB cd /usr/src/linux echo - - - - - - cp .config /mnt/samba/download/linux/config-epia-2.6.22.6 echo - - - - - - dmesg echo klaar Thanks Udo! Give me some time to process this stuff. Hi Udo, I think you could try the patch enveloped. Actually it _does_ really slowdown file reading/listing procedure so test it with *cp only* and don't use it for everyday work. If the patch eliminate your problem it let us to know in which direction we should digg. Steve, I know the patch is quite silly, but even _old_ smb driver uses semaphores to recv data from socket (and I think we have a race there). But I could be _just_wrong_. The patch is over 2.6.22.5 tree. Created attachment 12988 [details]
patch to test recv race
Patch is over 2.6.22.5 tree
Created attachment 12989 [details]
patch to test recv race
The patch is over 2.6.22.5 tree.
Created attachment 12996 [details]
dmesg 2.6.22.8 with patch
Created attachment 12997 [details]
messages 2.6.22.8 with patch
Applied the patch to 2.6.22.8. started the following small script: adsl stop ifconfig eth1 down service motion stop rmmod pwc echo 7 > /proc/fs/cifs/cifsFYI echo 1 > /proc/fs/cifs/traceSMB cd /usr/src/linux echo - - - - - - cp .config /mnt/samba/download/linux/config-epia-2.6.22.8 cp .config /mnt/samba/download/linux/config-epia-2.6.22.8 echo - - - - - - dmesg echo klaar Attached results. Thanks Udo for testing. Still digging... Write you any news as only get it. Hi, Udo, it seems I found the patch we need, please test it http://programming.kicks-ass.net/kernel-patches/lockdep/nfs-socket-lockdep.patch and write me if it help Thanks. The patch does not apply well on 2.6.23.1. :-( Oops, will come with fixed patch a bit later Oops for second time, Udo. This patch is already applied in 2.6.23. Forget what I proposed... sorry for disturbing you :( Hi Udo, since we haven't fixed that bug lets try to continue ;) A few questions. Does the problem appear with the latest kernel? Did you try to recompile the kernel with via-rhine as module to load it with debugging enabled (that should produce debug prints that may help)? I am currently using 2.6.23.1. I do have 2.6.23.8 compiled for next boot. Just did a test and the bug did NOT appear. (yes, lockdep debugging is on) # gzip -dc /proc/config.gz |grep DEBUG|grep -v ^# CONFIG_SLUB_DEBUG=y CONFIG_USB_DEBUG=y CONFIG_USB_STORAGE_DEBUG=y CONFIG_CIFS_DEBUG2=y CONFIG_DEBUG_KERNEL=y CONFIG_SCHED_DEBUG=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_SPINLOCK_SLEEP=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_STACKOVERFLOW=y Whoa!!! Even I'm completely stupid and couldn't have helped you but real linux programmers are so good ;) So can we close the bug then? Fixed since 2.6.23 Fixed now Thanks for reminding. My setup here has changed but will try to test to be able to confirm. |