Bug 48421

Summary: BUG: __rpc_pipefs_event: failed to remove clnt dir
Product: File System Reporter: Cyril B. (cbay)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, florian, skinsbursky
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.6.0, 3.5.x Subsystem:
Regression: Yes Bisected commit-id:
Attachments: SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info
[PATCH v2] SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info
SUNRPC: continue over the look of client instead of break on PiepFS event

Description Cyril B. 2012-10-05 11:49:35 UTC
When stopping my server, I get the trace below (every single time). It also happens on 3.5 (I've just tested it), and I'm pretty confident it worked fine in 3.4 and below.

The kernel is then stuck (sync is in D state) and I have to hard reboot (although I can continue working on the machine).

If I manually stop the services before rebooting (nfs-kernel-server, nfs-common, portmap), I don't get the bug.

Thanks,


Oct  5 13:07:35 kernel: [  187.753827] NFS: __rpc_pipefs_event: failed to remove clnt dir!
Oct  5 13:07:35 kernel: [  187.753914] BUG: Dentry ffff8805fdd3b2c0{i=6e,n=info} still in use (1) [unmount of rpc_pipefs rpc_pipefs]
Oct  5 13:07:35 kernel: [  187.754012] ------------[ cut here ]------------
Oct  5 13:07:35 kernel: [  187.754075] Kernel BUG at ffffffff81123d7e [verbose debug info unavailable]
Oct  5 13:07:35 kernel: [  187.754139] invalid opcode: 0000 [#1] SMP 
Oct  5 13:07:35 kernel: [  187.754270] Modules linked in: xt_NFLOG nfnetlink_log nfnetlink nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp ioatdma i2c_i801 hwmon usbhid dca ehci_hcd uhci_hcd usbcore usb_common netconsole e1000e
Oct  5 13:07:35 kernel: [  187.755318] CPU 4 
Oct  5 13:07:35 kernel: [  187.755361] Pid: 5864, comm: umount Not tainted 3.6.0 #2 Supermicro X8STi/X8STi
Oct  5 13:07:35 kernel: [  187.755489] RIP: 0010:[<ffffffff81123d7e>]  [<ffffffff81123d7e>] shrink_dcache_for_umount_subtree+0x1ee/0x200
Oct  5 13:07:35 kernel: [  187.755612] RSP: 0018:ffff8805f8d23d88  EFLAGS: 00010292
Oct  5 13:07:35 kernel: [  187.755672] RAX: 000000000000005d RBX: ffff8805fdd3b2c0 RCX: 0000000000000086
Oct  5 13:07:35 kernel: [  187.755735] RDX: 000000000000004f RSI: 0000000000000046 RDI: 0000000000000246
Oct  5 13:07:35 kernel: [  187.755798] RBP: ffff8805f8d23da8 R08: 0000000000000000 R09: 0000000000000000
Oct  5 13:07:35 kernel: [  187.755862] R10: ffff8806004c06c0 R11: 0000000000000000 R12: ffff88061cf58e00
Oct  5 13:07:35 kernel: [  187.755925] R13: ffffffff81904f80 R14: ffff8805fe890000 R15: ffff8805fe890000
Oct  5 13:07:35 kernel: [  187.755988] FS:  00007f74dac07730(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000
Oct  5 13:07:35 kernel: [  187.756067] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct  5 13:07:35 kernel: [  187.756128] CR2: 00007f74da2d4ad5 CR3: 00000005f8c92000 CR4: 00000000000007e0
Oct  5 13:07:35 kernel: [  187.756191] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  5 13:07:35 kernel: [  187.756254] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  5 13:07:35 kernel: [  187.756318] Process umount (pid: 5864, threadinfo ffff8805f8d22000, task ffff8805f8d745c0)
Oct  5 13:07:35 kernel: [  187.756397] Stack:
Oct  5 13:07:35 kernel: [  187.756450]  ffff8805fe890350 ffffffff8112083a ffff8805fe890000 ffff8806004e13c0
Oct  5 13:07:35 kernel: [  187.756671]  ffff8805f8d23dc8 ffffffff81123dc2 ffff88061cf58ea0 ffff8805fe890000
Oct  5 13:07:35 kernel: [  187.756892]  ffff8805f8d23df8 ffffffff8110ded7 ffff88061cf588c0 0000000000000011
Oct  5 13:07:35 kernel: [  187.757114] Call Trace:
Oct  5 13:07:35 kernel: [  187.757172]  [<ffffffff8112083a>] ? try_to_ascend+0x3a/0x90
Oct  5 13:07:35 kernel: [  187.757234]  [<ffffffff81123dc2>] shrink_dcache_for_umount+0x32/0x60
Oct  5 13:07:35 kernel: [  187.757297]  [<ffffffff8110ded7>] generic_shutdown_super+0x27/0xe0
Oct  5 13:07:35 kernel: [  187.757360]  [<ffffffff8110e021>] kill_anon_super+0x11/0x20
Oct  5 13:07:35 kernel: [  187.757421]  [<ffffffff8110e052>] kill_litter_super+0x22/0x30
Oct  5 13:07:35 kernel: [  187.757484]  [<ffffffff815bab14>] rpc_kill_sb+0xb4/0xd0
Oct  5 13:07:35 kernel: [  187.757545]  [<ffffffff8110e2e5>] deactivate_locked_super+0x45/0x70
Oct  5 13:07:35 kernel: [  187.757608]  [<ffffffff8110f1f1>] deactivate_super+0x61/0x70
Oct  5 13:07:35 kernel: [  187.757669]  [<ffffffff81129f67>] mntput_no_expire+0x147/0x1b0
Oct  5 13:07:35 kernel: [  187.757731]  [<ffffffff8112ad66>] sys_umount+0x76/0x3a0
Oct  5 13:07:35 kernel: [  187.757792]  [<ffffffff8160b839>] system_call_fastpath+0x16/0x1b
Oct  5 13:07:35 kernel: [  187.757854] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 50 03 00 00 48 c7 c7 88 9d a2 81 48 89 de 48 89 04 24 31 c0 e8 b9 e2 4d 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
Oct  5 13:07:35 kernel: [  187.760310] RIP  [<ffffffff81123d7e>] shrink_dcache_for_umount_subtree+0x1ee/0x200
Oct  5 13:07:35 kernel: [  187.760425]  RSP <ffff8805f8d23d88>
Oct  5 13:07:35 kernel: [  187.760483] ---[ end trace 780521204c9e167e ]---
Oct  5 13:07:38 kernel: [  190.408072] nfsd: last server has exited, flushing export cache
Comment 1 Cyril B. 2012-11-08 14:28:28 UTC
Happens on 3.4.18 as well.
Comment 2 Cyril B. 2012-11-08 15:19:35 UTC
I don't get the bug on 3.3.8, and still get it on 3.4.4. Anything I can do to help?
Comment 3 Trond Myklebust 2012-11-08 15:21:24 UTC
Created attachment 85941 [details]
SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info
Comment 4 Trond Myklebust 2012-11-08 15:23:14 UTC
Does the above patch help? It fixes a couple of pointer validity issues
that may be pertinent to the above Oops.
Comment 5 Trond Myklebust 2012-11-08 16:50:07 UTC
Created attachment 85951 [details]
[PATCH v2] SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info

The previous patch had a theoretical race if 2 processes within the same
net namespace are trying to mount the rpc_pipefs filesystem, and the first
attempt fails.
Comment 6 Cyril B. 2012-11-08 18:33:14 UTC
Thanks, I can confirm that it fixes the problem (tested on 3.4.18).
Comment 7 Trond Myklebust 2012-11-08 19:07:33 UTC
Thanks for testing! Was this the second patch that you tested?
Comment 8 Cyril B. 2012-11-08 19:38:56 UTC
Yes, I didn't even try the first one.
Comment 9 Trond Myklebust 2012-11-08 19:48:30 UTC
Perfect. Thanks again!

Marking this bug as resolved. I will close it once the patch is upstream.
Comment 10 Cyril B. 2012-11-09 14:13:25 UTC
Whoa, I've just hit the bug again. Either I tested something wrong yesterday, or the bug disappeared for no reason and reappeared.

I'll have to recompile kernels and retry everything. I'm very sorry.
Comment 11 Cyril B. 2012-11-09 14:49:50 UTC
Okay, so the bug is definitely not fixed, but it doesn't happen every single time I reboot, as I initially thought. When I tested the patched kernel yesterday for the first time, the bug didn't occur, which is why I thought it was fixed.

I've just rebooted a patched kernel (3.4.18) several times in a row, and sometimes I get the bug, sometimes I don't. I couldn't find a pattern; all I'm doing is logging in over SSH and typing 'reboot'.

More details: I'm on a Debian Lenny with the following versions:

portmap 6.0-9
nfs-common 1:1.2.0-2
nfs-kernel-server 1:1.2.0-2

This machine is both an NFS server and client. It actually mounts 4 different NFSv4 filesystems, and I get:

NFS: __rpc_pipefs_event: failed to remove clnt dir!

actually 4 times before the stack trace appears.
Comment 12 Cyril B. 2012-11-09 15:24:34 UTC
I think I have found a pattern: since I normally use kexec, 'reboot' will actually do a kexec-reboot.

After a cold reboot (going through POST), a 'reboot' will make the kernel crash.
After a kexec-reboot, a 'reboot' won't hit the bug.

Manually umounting the NFS filesystems before rebooting will always avoid the bug.

I have no idea why kexec doesn't trigger the bug. Maybe because of delays? A kexec-reboot is very fast (~ 20 seconds), whereas a cold reboot is longer (~ 2 minutes).

At least I have a way to reliably test new patches (unless my pattern theory is pure luck, but I have tested it more than 4 times).
Comment 13 Cyril B. 2012-11-29 12:30:54 UTC
Oops, I hadn't changed the status.
Comment 14 Cyril B. 2012-12-03 23:28:15 UTC
I've finally managed to bisect it, here's the culprit:

7aab449e5a2ebfa9c5116e87e16536bc4195e4de
SUNRPC: skip clients with program without PipeFS entries
Comment 15 Stanislav Kinsbursky 2012-12-12 10:05:05 UTC
This is most probably a race...
Could you provide more crash traces here?
Comment 16 Cyril B. 2012-12-12 10:32:04 UTC
Traces look quite similar to me. Here are a few more, from my bisect session:

Dec  3 11:10:54 spare1 kernel: [   83.562889] Kernel BUG at ffffffff81119c1e [verbose debug info unavailable]
Dec  3 11:10:54 spare1 kernel: [   83.562961] invalid opcode: 0000 [#1] SMP 
Dec  3 11:10:54 spare1 kernel: [   83.563119] CPU 5 
Dec  3 11:10:54 spare1 kernel: [   83.563169] Modules linked in: xt_NFLOG nfnetlink_log nfnetlink nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 xt_recent n
f_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport sb_edac edac_core i2c_i801 coretemp hwmon ioatdma usbhid ehci_hcd usbcore usb_common netcons
ole igb dca [last unloaded: scsi_wait_scan]
Dec  3 11:10:54 spare1 kernel: [   83.564621] 
Dec  3 11:10:54 spare1 kernel: [   83.564686] Pid: 13083, comm: umount Not tainted 3.4.18-alwaysdata #1 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F
/X9SRi/X9SRi-3F
Dec  3 11:10:54 spare1 kernel: [   83.564917] RIP: 0010:[<ffffffff81119c1e>]  [<ffffffff81119c1e>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 11:10:54 spare1 kernel: [   83.565056] RSP: 0018:ffff88102d6fbd78  EFLAGS: 00010296
Dec  3 11:10:54 spare1 kernel: [   83.565124] RAX: 0000000000000073 RBX: ffff881078cffcc0 RCX: 0000000000000082
Dec  3 11:10:54 spare1 kernel: [   83.565196] RDX: 000000000000006b RSI: 0000000000000046 RDI: 0000000000000246
Dec  3 11:10:54 spare1 kernel: [   83.565267] RBP: ffff88102d6fbd98 R08: 0000000000000000 R09: 0000000000000000
Dec  3 11:10:54 spare1 kernel: [   83.565338] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078f84c80
Dec  3 11:10:54 spare1 kernel: [   83.565409] R13: ffffffff81701cc0 R14: ffff88102e7ca400 R15: ffff88102e7ca400
Dec  3 11:10:54 spare1 kernel: [   83.565481] FS:  00007fcec3b19730(0000) GS:ffff88107fd40000(0000) knlGS:0000000000000000
Dec  3 11:10:54 spare1 kernel: [   83.565567] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  3 11:10:54 spare1 kernel: [   83.565637] CR2: 00007fcec31e6ad5 CR3: 0000001027ced000 CR4: 00000000000407e0
Dec  3 11:10:54 spare1 kernel: [   83.565708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  3 11:10:54 spare1 kernel: [   83.565779] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  3 11:10:54 spare1 kernel: [   83.565851] Process umount (pid: 13083, threadinfo ffff88102d6fa000, task ffff88102c19e840)
Dec  3 11:10:54 spare1 kernel: [   83.565938] Stack:
Dec  3 11:10:54 spare1 kernel: [   83.566001]  ffff88102e7ca6c0 ffffffff8111625a ffff88102e7ca400 ffff88102fede6f0
Dec  3 11:10:54 spare1 kernel: [   83.566266]  ffff88102d6fbdb8 ffffffff81119c62 ffff881078f84d20 ffff88102e7ca400
Dec  3 11:10:54 spare1 kernel: [   83.566530]  ffff88102d6fbde8 ffffffff81104577 ffff881078ccd200 0000000000000011
Dec  3 11:10:54 spare1 kernel: [   83.566795] Call Trace:
Dec  3 11:10:54 spare1 kernel: [   83.566863]  [<ffffffff8111625a>] ? try_to_ascend+0x3a/0x90
Dec  3 11:10:54 spare1 kernel: [   83.566933]  [<ffffffff81119c62>] shrink_dcache_for_umount+0x32/0x60
Dec  3 11:10:54 spare1 kernel: [   83.567006]  [<ffffffff81104577>] generic_shutdown_super+0x27/0xe0
Dec  3 11:10:54 spare1 kernel: [   83.567076]  [<ffffffff811046c1>] kill_anon_super+0x11/0x20
Dec  3 11:10:54 spare1 kernel: [   83.567144]  [<ffffffff811046f2>] kill_litter_super+0x22/0x30
Dec  3 11:10:54 spare1 kernel: [   83.567216]  [<ffffffff8159c09e>] rpc_kill_sb+0x6e/0xf0
Dec  3 11:10:54 spare1 kernel: [   83.567284]  [<ffffffff811049a5>] deactivate_locked_super+0x45/0x70
Dec  3 11:10:54 spare1 kernel: [   83.567354]  [<ffffffff81105721>] deactivate_super+0x61/0x70
Dec  3 11:10:54 spare1 kernel: [   83.567425]  [<ffffffff8111fe5f>] mntput_no_expire+0x16f/0x1b0
Dec  3 11:10:54 spare1 kernel: [   83.567495]  [<ffffffff81120c36>] sys_umount+0x76/0x380
Dec  3 11:10:54 spare1 kernel: [   83.567566]  [<ffffffff815eb939>] system_call_fastpath+0x16/0x1b
Dec  3 11:10:54 spare1 kernel: [   83.567635] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 a8 0b 82 81 48 89 de 48 89 04 24 31 c0 e8 0f 86 4c 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
Dec  3 11:10:54 spare1 kernel: [   83.570649] RIP  [<ffffffff81119c1e>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 11:10:54 spare1 kernel: [   83.570780]  RSP <ffff88102d6fbd78>
Dec  3 11:10:54 spare1 kernel: [   83.570865] ---[ end trace 5aed5d79590e2681 ]---


Dec  3 18:11:09 spare1 kernel: [ 1142.183765] Kernel BUG at ffffffff811192ee [verbose debug info unavailable]
Dec  3 18:11:09 spare1 kernel: [ 1142.183836] invalid opcode: 0000 [#1] SMP 
Dec  3 18:11:09 spare1 kernel: [ 1142.183996] CPU 4 
Dec  3 18:11:09 spare1 kernel: [ 1142.184045] Modules linked in: xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp hwmon u
sbhid sb_edac edac_core ioatdma i2c_i801 ehci_hcd usbcore usb_common netconsole igb dca [last unloaded: scsi_wait_scan]
Dec  3 18:11:09 spare1 kernel: [ 1142.185251] 
Dec  3 18:11:09 spare1 kernel: [ 1142.185315] Pid: 15410, comm: umount Not tainted 3.4.0-rc4-alwaysdata-bug-48421-5+ #5 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3
F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
Dec  3 18:11:09 spare1 kernel: [ 1142.185545] RIP: 0010:[<ffffffff811192ee>]  [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 18:11:09 spare1 kernel: [ 1142.185684] RSP: 0018:ffff88102c5e1d88  EFLAGS: 00010292
Dec  3 18:11:09 spare1 kernel: [ 1142.185751] RAX: 0000000000000073 RBX: ffff88102c871b40 RCX: 0000000000000082
Dec  3 18:11:09 spare1 kernel: [ 1142.185822] RDX: 00000000000000b2 RSI: 0000000000000046 RDI: 0000000000000246
Dec  3 18:11:09 spare1 kernel: [ 1142.185893] RBP: ffff88102c5e1da8 R08: 0000000000000000 R09: 0000000000000000
Dec  3 18:11:09 spare1 kernel: [ 1142.185963] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078f6dc00
Dec  3 18:11:09 spare1 kernel: [ 1142.186034] R13: ffffffff81700000 R14: ffff88102ba3a400 R15: ffff88102e921ca0
Dec  3 18:11:09 spare1 kernel: [ 1142.186105] FS:  00007fb2bd9ac730(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000
Dec  3 18:11:09 spare1 kernel: [ 1142.186191] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  3 18:11:09 spare1 kernel: [ 1142.186260] CR2: 00007fb2bd079ad5 CR3: 000000102d45f000 CR4: 00000000000407e0
Dec  3 18:11:09 spare1 kernel: [ 1142.186330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  3 18:11:09 spare1 kernel: [ 1142.186401] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  3 18:11:09 spare1 kernel: [ 1142.186472] Process umount (pid: 15410, threadinfo ffff88102c5e0000, task ffff88102fe78900)
Dec  3 18:11:09 spare1 kernel: [ 1142.186558] Stack:
Dec  3 18:11:09 spare1 kernel: [ 1142.186620]  ffff88102ba3a6c0 ffffffff8111599a ffff88102ba3a400 ffff88102fd9b6c0
Dec  3 18:11:09 spare1 kernel: [ 1142.186887]  ffff88102c5e1dc8 ffffffff81119332 ffff881078f6dca0 ffff88102ba3a400
Dec  3 18:11:09 spare1 kernel: [ 1142.187204]  ffff88102c5e1df8 ffffffff81103d27 ffff881078f2db40 0000000000000011
Dec  3 18:11:09 spare1 kernel: [ 1142.187465] Call Trace:
Dec  3 18:11:09 spare1 kernel: [ 1142.187531]  [<ffffffff8111599a>] ? try_to_ascend+0x3a/0x90
Dec  3 18:11:09 spare1 kernel: [ 1142.187600]  [<ffffffff81119332>] shrink_dcache_for_umount+0x32/0x60
Dec  3 18:11:09 spare1 kernel: [ 1142.187672]  [<ffffffff81103d27>] generic_shutdown_super+0x27/0xe0
Dec  3 18:11:09 spare1 kernel: [ 1142.187741]  [<ffffffff81103e71>] kill_anon_super+0x11/0x20
Dec  3 18:11:09 spare1 kernel: [ 1142.187810]  [<ffffffff81103ea2>] kill_litter_super+0x22/0x30
Dec  3 18:11:09 spare1 kernel: [ 1142.187880]  [<ffffffff81589b4c>] rpc_kill_sb+0x7c/0xa0
Dec  3 18:11:09 spare1 kernel: [ 1142.187948]  [<ffffffff81104155>] deactivate_locked_super+0x45/0x70
Dec  3 18:11:09 spare1 kernel: [ 1142.188017]  [<ffffffff81104ed1>] deactivate_super+0x61/0x70
Dec  3 18:11:09 spare1 kernel: [ 1142.188088]  [<ffffffff8111f4af>] mntput_no_expire+0x16f/0x1b0
Dec  3 18:11:09 spare1 kernel: [ 1142.188157]  [<ffffffff81120286>] sys_umount+0x76/0x380
Dec  3 18:11:09 spare1 kernel: [ 1142.188228]  [<ffffffff815d8539>] system_call_fastpath+0x16/0x1b
Dec  3 18:11:09 spare1 kernel: [ 1142.188297] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 80 d4 81 81 48 89 de 48 89 04 24 31 c0 e8 df 5a 4b 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
Dec  3 18:11:09 spare1 kernel: [ 1142.191314] RIP  [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 18:11:09 spare1 kernel: [ 1142.191445]  RSP <ffff88102c5e1d88>


Dec  3 18:17:06 spare1 kernel: [  148.145579] Kernel BUG at ffffffff811192ee [verbose debug info unavailable]
Dec  3 18:17:06 spare1 kernel: [  148.145650] invalid opcode: 0000 [#1] SMP 
Dec  3 18:17:06 spare1 kernel: [  148.145806] CPU 4 
Dec  3 18:17:06 spare1 kernel: [  148.145854] Modules linked in: xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp sb_edac
 hwmon edac_core i2c_i801 usbhid ioatdma ehci_hcd usbcore usb_common netconsole igb dca [last unloaded: scsi_wait_scan]
Dec  3 18:17:06 spare1 kernel: [  148.147006] 
Dec  3 18:17:06 spare1 kernel: [  148.147070] Pid: 13090, comm: umount Not tainted 3.4.0-rc4-alwaysdata-bug-48421-6+ #6 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3
F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
Dec  3 18:17:06 spare1 kernel: [  148.147302] RIP: 0010:[<ffffffff811192ee>]  [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 18:17:06 spare1 kernel: [  148.147443] RSP: 0018:ffff88102b6c7d88  EFLAGS: 00010292
Dec  3 18:17:06 spare1 kernel: [  148.147512] RAX: 0000000000000073 RBX: ffff88102c519840 RCX: 0000000000000082
Dec  3 18:17:06 spare1 kernel: [  148.147583] RDX: 000000000000009e RSI: 0000000000000046 RDI: 0000000000000246
Dec  3 18:17:06 spare1 kernel: [  148.147654] RBP: ffff88102b6c7da8 R08: 0000000000000000 R09: 0000000000000000
Dec  3 18:17:06 spare1 kernel: [  148.147728] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078c27e00
Dec  3 18:17:06 spare1 kernel: [  148.147799] R13: ffffffff81700000 R14: ffff88102fd24c00 R15: ffff88102e77dce0
Dec  3 18:17:06 spare1 kernel: [  148.147870] FS:  00007f7fefc24730(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000
Dec  3 18:17:06 spare1 kernel: [  148.147957] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  3 18:17:06 spare1 kernel: [  148.148026] CR2: 00007f7fef2f1ad5 CR3: 000000102d67f000 CR4: 00000000000407e0
Dec  3 18:17:06 spare1 kernel: [  148.148097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  3 18:17:06 spare1 kernel: [  148.148168] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  3 18:17:06 spare1 kernel: [  148.148239] Process umount (pid: 13090, threadinfo ffff88102b6c6000, task ffff88102cdb6300)
Dec  3 18:17:06 spare1 kernel: [  148.148326] Stack:
Dec  3 18:17:06 spare1 kernel: [  148.148388]  ffff88102fd24ec0 ffffffff8111599a ffff88102fd24c00 ffff88102fd4e6c0
Dec  3 18:17:06 spare1 kernel: [  148.148651]  ffff88102b6c7dc8 ffffffff81119332 ffff881078c27ea0 ffff88102fd24c00
Dec  3 18:17:06 spare1 kernel: [  148.148916]  ffff88102b6c7df8 ffffffff81103d27 ffff881078c2b740 0000000000000011
Dec  3 18:17:06 spare1 kernel: [  148.149189] Call Trace:
Dec  3 18:17:06 spare1 kernel: [  148.149258]  [<ffffffff8111599a>] ? try_to_ascend+0x3a/0x90
Dec  3 18:17:06 spare1 kernel: [  148.149327]  [<ffffffff81119332>] shrink_dcache_for_umount+0x32/0x60
Dec  3 18:17:06 spare1 kernel: [  148.149398]  [<ffffffff81103d27>] generic_shutdown_super+0x27/0xe0
Dec  3 18:17:06 spare1 kernel: [  148.149470]  [<ffffffff81103e71>] kill_anon_super+0x11/0x20
Dec  3 18:17:06 spare1 kernel: [  148.149539]  [<ffffffff81103ea2>] kill_litter_super+0x22/0x30
Dec  3 18:17:06 spare1 kernel: [  148.149610]  [<ffffffff81589b8c>] rpc_kill_sb+0x7c/0xa0
Dec  3 18:17:06 spare1 kernel: [  148.149678]  [<ffffffff81104155>] deactivate_locked_super+0x45/0x70
Dec  3 18:17:06 spare1 kernel: [  148.149748]  [<ffffffff81104ed1>] deactivate_super+0x61/0x70
Dec  3 18:17:06 spare1 kernel: [  148.149818]  [<ffffffff8111f4af>] mntput_no_expire+0x16f/0x1b0
Dec  3 18:17:06 spare1 kernel: [  148.149888]  [<ffffffff81120286>] sys_umount+0x76/0x380
Dec  3 18:17:06 spare1 kernel: [  148.149959]  [<ffffffff815d8579>] system_call_fastpath+0x16/0x1b
Dec  3 18:17:06 spare1 kernel: [  148.150027] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 a8 d4 81 81 48 89 de 48 89 04 24 31 c0 e8 1f 5b 4b 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
Dec  3 18:17:06 spare1 kernel: [  148.153054] RIP  [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 18:17:06 spare1 kernel: [  148.153186]  RSP <ffff88102b6c7d88>


Dec  3 18:21:08 spare1 kernel: [  131.681864] Kernel BUG at ffffffff811192ee [verbose debug info unavailable]
Dec  3 18:21:08 spare1 kernel: [  131.681936] invalid opcode: 0000 [#1] SMP 
Dec  3 18:21:08 spare1 kernel: [  131.682094] CPU 6 
Dec  3 18:21:08 spare1 kernel: [  131.682144] Modules linked in: ehci_hcd usbcore usb_common netconsole igb dca [last unloaded: scsi_wait_scan]
Dec  3 18:21:08 spare1 kernel: [  131.682655] 
Dec  3 18:21:08 spare1 kernel: [  131.682718] Pid: 12982, comm: umount Not tainted 3.4.0-rc4-alwaysdata-bug-48421-6+ #6 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3
F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
Dec  3 18:21:08 spare1 kernel: [  131.682948] RIP: 0010:[<ffffffff811192ee>]  [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 18:21:08 spare1 kernel: [  131.683090] RSP: 0018:ffff88102b9e7d88  EFLAGS: 00010292
Dec  3 18:21:08 spare1 kernel: [  131.683159] RAX: 0000000000000073 RBX: ffff88102ca7b2c0 RCX: 0000000000000082
Dec  3 18:21:08 spare1 kernel: [  131.683230] RDX: 00000000000000e9 RSI: 0000000000000046 RDI: 0000000000000246
Dec  3 18:21:08 spare1 kernel: [  131.683302] RBP: ffff88102b9e7da8 R08: 0000000000000000 R09: 0000000000000000
Dec  3 18:21:08 spare1 kernel: [  131.683372] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078f8d600
Dec  3 18:21:08 spare1 kernel: [  131.683444] R13: ffffffff81700000 R14: ffff88102f20e800 R15: ffff881030bcfbe0
Dec  3 18:21:08 spare1 kernel: [  131.683515] FS:  00007ff6970bd730(0000) GS:ffff88107fd80000(0000) knlGS:0000000000000000
Dec  3 18:21:08 spare1 kernel: [  131.683602] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  3 18:21:08 spare1 kernel: [  131.683673] CR2: 00007ff69678aad5 CR3: 000000102c38e000 CR4: 00000000000407e0
Dec  3 18:21:08 spare1 kernel: [  131.683744] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  3 18:21:08 spare1 kernel: [  131.683815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  3 18:21:08 spare1 kernel: [  131.683886] Process umount (pid: 12982, threadinfo ffff88102b9e6000, task ffff88102bf84140)
Dec  3 18:21:08 spare1 kernel: [  131.683973] Stack:
Dec  3 18:21:08 spare1 kernel: [  131.684035]  ffff88102f20eac0 ffffffff8111599a ffff88102f20e800 ffff88102fee26c0
Dec  3 18:21:08 spare1 kernel: [  131.684297]  ffff88102b9e7dc8 ffffffff81119332 ffff881078f8d6a0 ffff88102f20e800
Dec  3 18:21:08 spare1 kernel: [  131.684560]  ffff88102b9e7df8 ffffffff81103d27 ffff881078c56200 0000000000000011
Dec  3 18:21:08 spare1 kernel: [  131.684823] Call Trace:
Dec  3 18:21:08 spare1 kernel: [  131.684889]  [<ffffffff8111599a>] ? try_to_ascend+0x3a/0x90
Dec  3 18:21:08 spare1 kernel: [  131.684957]  [<ffffffff81119332>] shrink_dcache_for_umount+0x32/0x60
Dec  3 18:21:08 spare1 kernel: [  131.685030]  [<ffffffff81103d27>] generic_shutdown_super+0x27/0xe0
Dec  3 18:21:08 spare1 kernel: [  131.685100]  [<ffffffff81103e71>] kill_anon_super+0x11/0x20
Dec  3 18:21:08 spare1 kernel: [  131.685171]  [<ffffffff81103ea2>] kill_litter_super+0x22/0x30
Dec  3 18:21:08 spare1 kernel: [  131.685243]  [<ffffffff81589b8c>] rpc_kill_sb+0x7c/0xa0
Dec  3 18:21:08 spare1 kernel: [  131.685311]  [<ffffffff81104155>] deactivate_locked_super+0x45/0x70
Dec  3 18:21:08 spare1 kernel: [  131.685381]  [<ffffffff81104ed1>] deactivate_super+0x61/0x70
Dec  3 18:21:08 spare1 kernel: [  131.685451]  [<ffffffff8111f4af>] mntput_no_expire+0x16f/0x1b0
Dec  3 18:21:08 spare1 kernel: [  131.685520]  [<ffffffff81120286>] sys_umount+0x76/0x380
Dec  3 18:21:08 spare1 kernel: [  131.685591]  [<ffffffff815d8579>] system_call_fastpath+0x16/0x1b
Dec  3 18:21:08 spare1 kernel: [  131.685660] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 a8 d4 81 81 48 89 de 48 89 04 24 31 c0 e8 1f 5b 4b 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
Dec  3 18:21:08 spare1 kernel: [  131.688672] RIP  [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0
Dec  3 18:21:08 spare1 kernel: [  131.688802]  RSP <ffff88102b9e7d88>
Comment 17 Stanislav Kinsbursky 2012-12-12 10:37:47 UTC
Yes, this is understandable.
Could you build a kernel if I'll provide you a debug patch?
Look like we have some tiny races here, which I can't find in the code for now.
If yes, please, specify your kernel sources version: is it still 3.4.4?
Comment 18 Cyril B. 2012-12-12 10:43:03 UTC
Yes you can send me patches. I was working on a 3.4.18, but you can send me patches for other versions if you prefer. I also tested reverting the commit on a recent 3.6, it worked fine.
Comment 19 Stanislav Kinsbursky 2012-12-12 11:21:10 UTC
Created attachment 88971 [details]
SUNRPC: continue over the look of client instead of break on PiepFS event

Please, try this patch (not debug one).
Comment 20 Stanislav Kinsbursky 2012-12-12 11:43:17 UTC
This patch is against current 3.7 kernel.
Comment 21 Cyril B. 2012-12-17 15:19:02 UTC
Yes, it fixes the bug. I first made sure that the vanilla 3.7 still exhibited the bug, and after patching it, I cannot hit it again, despite multiple reboots and coldreboots.
Comment 22 Trond Myklebust 2012-12-17 15:59:56 UTC
OK. Stanislav, can you please resend that patch with a fixed up comment
(so that we know why it is correct)?

Also, should we Cc: stable? If so, which kernel versions are affected?
Comment 23 Stanislav Kinsbursky 2012-12-17 17:22:50 UTC
Patch titled "SUNRPC: continue run over clients list on PipeFS event instead of break" in nfs-lkml.
Comment 24 Florian Mickler 2012-12-22 09:31:13 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit 642fe4d00db56d65060ce2fd4c105884414acb16
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Thu Nov 8 10:01:26 2012 -0500

    SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info