When stopping my server, I get the trace below (every single time). It also happens on 3.5 (I've just tested it), and I'm pretty confident it worked fine in 3.4 and below. The kernel is then stuck (sync is in D state) and I have to hard reboot (although I can continue working on the machine). If I manually stop the services before rebooting (nfs-kernel-server, nfs-common, portmap), I don't get the bug. Thanks, Oct 5 13:07:35 kernel: [ 187.753827] NFS: __rpc_pipefs_event: failed to remove clnt dir! Oct 5 13:07:35 kernel: [ 187.753914] BUG: Dentry ffff8805fdd3b2c0{i=6e,n=info} still in use (1) [unmount of rpc_pipefs rpc_pipefs] Oct 5 13:07:35 kernel: [ 187.754012] ------------[ cut here ]------------ Oct 5 13:07:35 kernel: [ 187.754075] Kernel BUG at ffffffff81123d7e [verbose debug info unavailable] Oct 5 13:07:35 kernel: [ 187.754139] invalid opcode: 0000 [#1] SMP Oct 5 13:07:35 kernel: [ 187.754270] Modules linked in: xt_NFLOG nfnetlink_log nfnetlink nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp ioatdma i2c_i801 hwmon usbhid dca ehci_hcd uhci_hcd usbcore usb_common netconsole e1000e Oct 5 13:07:35 kernel: [ 187.755318] CPU 4 Oct 5 13:07:35 kernel: [ 187.755361] Pid: 5864, comm: umount Not tainted 3.6.0 #2 Supermicro X8STi/X8STi Oct 5 13:07:35 kernel: [ 187.755489] RIP: 0010:[<ffffffff81123d7e>] [<ffffffff81123d7e>] shrink_dcache_for_umount_subtree+0x1ee/0x200 Oct 5 13:07:35 kernel: [ 187.755612] RSP: 0018:ffff8805f8d23d88 EFLAGS: 00010292 Oct 5 13:07:35 kernel: [ 187.755672] RAX: 000000000000005d RBX: ffff8805fdd3b2c0 RCX: 0000000000000086 Oct 5 13:07:35 kernel: [ 187.755735] RDX: 000000000000004f RSI: 0000000000000046 RDI: 0000000000000246 Oct 5 13:07:35 kernel: [ 187.755798] RBP: ffff8805f8d23da8 R08: 0000000000000000 R09: 0000000000000000 Oct 5 13:07:35 kernel: [ 187.755862] R10: ffff8806004c06c0 R11: 0000000000000000 R12: ffff88061cf58e00 Oct 5 13:07:35 kernel: [ 187.755925] R13: ffffffff81904f80 R14: ffff8805fe890000 R15: ffff8805fe890000 Oct 5 13:07:35 kernel: [ 187.755988] FS: 00007f74dac07730(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000 Oct 5 13:07:35 kernel: [ 187.756067] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 5 13:07:35 kernel: [ 187.756128] CR2: 00007f74da2d4ad5 CR3: 00000005f8c92000 CR4: 00000000000007e0 Oct 5 13:07:35 kernel: [ 187.756191] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 5 13:07:35 kernel: [ 187.756254] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Oct 5 13:07:35 kernel: [ 187.756318] Process umount (pid: 5864, threadinfo ffff8805f8d22000, task ffff8805f8d745c0) Oct 5 13:07:35 kernel: [ 187.756397] Stack: Oct 5 13:07:35 kernel: [ 187.756450] ffff8805fe890350 ffffffff8112083a ffff8805fe890000 ffff8806004e13c0 Oct 5 13:07:35 kernel: [ 187.756671] ffff8805f8d23dc8 ffffffff81123dc2 ffff88061cf58ea0 ffff8805fe890000 Oct 5 13:07:35 kernel: [ 187.756892] ffff8805f8d23df8 ffffffff8110ded7 ffff88061cf588c0 0000000000000011 Oct 5 13:07:35 kernel: [ 187.757114] Call Trace: Oct 5 13:07:35 kernel: [ 187.757172] [<ffffffff8112083a>] ? try_to_ascend+0x3a/0x90 Oct 5 13:07:35 kernel: [ 187.757234] [<ffffffff81123dc2>] shrink_dcache_for_umount+0x32/0x60 Oct 5 13:07:35 kernel: [ 187.757297] [<ffffffff8110ded7>] generic_shutdown_super+0x27/0xe0 Oct 5 13:07:35 kernel: [ 187.757360] [<ffffffff8110e021>] kill_anon_super+0x11/0x20 Oct 5 13:07:35 kernel: [ 187.757421] [<ffffffff8110e052>] kill_litter_super+0x22/0x30 Oct 5 13:07:35 kernel: [ 187.757484] [<ffffffff815bab14>] rpc_kill_sb+0xb4/0xd0 Oct 5 13:07:35 kernel: [ 187.757545] [<ffffffff8110e2e5>] deactivate_locked_super+0x45/0x70 Oct 5 13:07:35 kernel: [ 187.757608] [<ffffffff8110f1f1>] deactivate_super+0x61/0x70 Oct 5 13:07:35 kernel: [ 187.757669] [<ffffffff81129f67>] mntput_no_expire+0x147/0x1b0 Oct 5 13:07:35 kernel: [ 187.757731] [<ffffffff8112ad66>] sys_umount+0x76/0x3a0 Oct 5 13:07:35 kernel: [ 187.757792] [<ffffffff8160b839>] system_call_fastpath+0x16/0x1b Oct 5 13:07:35 kernel: [ 187.757854] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 50 03 00 00 48 c7 c7 88 9d a2 81 48 89 de 48 89 04 24 31 c0 e8 b9 e2 4d 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 Oct 5 13:07:35 kernel: [ 187.760310] RIP [<ffffffff81123d7e>] shrink_dcache_for_umount_subtree+0x1ee/0x200 Oct 5 13:07:35 kernel: [ 187.760425] RSP <ffff8805f8d23d88> Oct 5 13:07:35 kernel: [ 187.760483] ---[ end trace 780521204c9e167e ]--- Oct 5 13:07:38 kernel: [ 190.408072] nfsd: last server has exited, flushing export cache
Happens on 3.4.18 as well.
I don't get the bug on 3.3.8, and still get it on 3.4.4. Anything I can do to help?
Created attachment 85941 [details] SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info
Does the above patch help? It fixes a couple of pointer validity issues that may be pertinent to the above Oops.
Created attachment 85951 [details] [PATCH v2] SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info The previous patch had a theoretical race if 2 processes within the same net namespace are trying to mount the rpc_pipefs filesystem, and the first attempt fails.
Thanks, I can confirm that it fixes the problem (tested on 3.4.18).
Thanks for testing! Was this the second patch that you tested?
Yes, I didn't even try the first one.
Perfect. Thanks again! Marking this bug as resolved. I will close it once the patch is upstream.
Whoa, I've just hit the bug again. Either I tested something wrong yesterday, or the bug disappeared for no reason and reappeared. I'll have to recompile kernels and retry everything. I'm very sorry.
Okay, so the bug is definitely not fixed, but it doesn't happen every single time I reboot, as I initially thought. When I tested the patched kernel yesterday for the first time, the bug didn't occur, which is why I thought it was fixed. I've just rebooted a patched kernel (3.4.18) several times in a row, and sometimes I get the bug, sometimes I don't. I couldn't find a pattern; all I'm doing is logging in over SSH and typing 'reboot'. More details: I'm on a Debian Lenny with the following versions: portmap 6.0-9 nfs-common 1:1.2.0-2 nfs-kernel-server 1:1.2.0-2 This machine is both an NFS server and client. It actually mounts 4 different NFSv4 filesystems, and I get: NFS: __rpc_pipefs_event: failed to remove clnt dir! actually 4 times before the stack trace appears.
I think I have found a pattern: since I normally use kexec, 'reboot' will actually do a kexec-reboot. After a cold reboot (going through POST), a 'reboot' will make the kernel crash. After a kexec-reboot, a 'reboot' won't hit the bug. Manually umounting the NFS filesystems before rebooting will always avoid the bug. I have no idea why kexec doesn't trigger the bug. Maybe because of delays? A kexec-reboot is very fast (~ 20 seconds), whereas a cold reboot is longer (~ 2 minutes). At least I have a way to reliably test new patches (unless my pattern theory is pure luck, but I have tested it more than 4 times).
Oops, I hadn't changed the status.
I've finally managed to bisect it, here's the culprit: 7aab449e5a2ebfa9c5116e87e16536bc4195e4de SUNRPC: skip clients with program without PipeFS entries
This is most probably a race... Could you provide more crash traces here?
Traces look quite similar to me. Here are a few more, from my bisect session: Dec 3 11:10:54 spare1 kernel: [ 83.562889] Kernel BUG at ffffffff81119c1e [verbose debug info unavailable] Dec 3 11:10:54 spare1 kernel: [ 83.562961] invalid opcode: 0000 [#1] SMP Dec 3 11:10:54 spare1 kernel: [ 83.563119] CPU 5 Dec 3 11:10:54 spare1 kernel: [ 83.563169] Modules linked in: xt_NFLOG nfnetlink_log nfnetlink nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 xt_recent n f_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport sb_edac edac_core i2c_i801 coretemp hwmon ioatdma usbhid ehci_hcd usbcore usb_common netcons ole igb dca [last unloaded: scsi_wait_scan] Dec 3 11:10:54 spare1 kernel: [ 83.564621] Dec 3 11:10:54 spare1 kernel: [ 83.564686] Pid: 13083, comm: umount Not tainted 3.4.18-alwaysdata #1 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F /X9SRi/X9SRi-3F Dec 3 11:10:54 spare1 kernel: [ 83.564917] RIP: 0010:[<ffffffff81119c1e>] [<ffffffff81119c1e>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 11:10:54 spare1 kernel: [ 83.565056] RSP: 0018:ffff88102d6fbd78 EFLAGS: 00010296 Dec 3 11:10:54 spare1 kernel: [ 83.565124] RAX: 0000000000000073 RBX: ffff881078cffcc0 RCX: 0000000000000082 Dec 3 11:10:54 spare1 kernel: [ 83.565196] RDX: 000000000000006b RSI: 0000000000000046 RDI: 0000000000000246 Dec 3 11:10:54 spare1 kernel: [ 83.565267] RBP: ffff88102d6fbd98 R08: 0000000000000000 R09: 0000000000000000 Dec 3 11:10:54 spare1 kernel: [ 83.565338] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078f84c80 Dec 3 11:10:54 spare1 kernel: [ 83.565409] R13: ffffffff81701cc0 R14: ffff88102e7ca400 R15: ffff88102e7ca400 Dec 3 11:10:54 spare1 kernel: [ 83.565481] FS: 00007fcec3b19730(0000) GS:ffff88107fd40000(0000) knlGS:0000000000000000 Dec 3 11:10:54 spare1 kernel: [ 83.565567] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 3 11:10:54 spare1 kernel: [ 83.565637] CR2: 00007fcec31e6ad5 CR3: 0000001027ced000 CR4: 00000000000407e0 Dec 3 11:10:54 spare1 kernel: [ 83.565708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 3 11:10:54 spare1 kernel: [ 83.565779] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 3 11:10:54 spare1 kernel: [ 83.565851] Process umount (pid: 13083, threadinfo ffff88102d6fa000, task ffff88102c19e840) Dec 3 11:10:54 spare1 kernel: [ 83.565938] Stack: Dec 3 11:10:54 spare1 kernel: [ 83.566001] ffff88102e7ca6c0 ffffffff8111625a ffff88102e7ca400 ffff88102fede6f0 Dec 3 11:10:54 spare1 kernel: [ 83.566266] ffff88102d6fbdb8 ffffffff81119c62 ffff881078f84d20 ffff88102e7ca400 Dec 3 11:10:54 spare1 kernel: [ 83.566530] ffff88102d6fbde8 ffffffff81104577 ffff881078ccd200 0000000000000011 Dec 3 11:10:54 spare1 kernel: [ 83.566795] Call Trace: Dec 3 11:10:54 spare1 kernel: [ 83.566863] [<ffffffff8111625a>] ? try_to_ascend+0x3a/0x90 Dec 3 11:10:54 spare1 kernel: [ 83.566933] [<ffffffff81119c62>] shrink_dcache_for_umount+0x32/0x60 Dec 3 11:10:54 spare1 kernel: [ 83.567006] [<ffffffff81104577>] generic_shutdown_super+0x27/0xe0 Dec 3 11:10:54 spare1 kernel: [ 83.567076] [<ffffffff811046c1>] kill_anon_super+0x11/0x20 Dec 3 11:10:54 spare1 kernel: [ 83.567144] [<ffffffff811046f2>] kill_litter_super+0x22/0x30 Dec 3 11:10:54 spare1 kernel: [ 83.567216] [<ffffffff8159c09e>] rpc_kill_sb+0x6e/0xf0 Dec 3 11:10:54 spare1 kernel: [ 83.567284] [<ffffffff811049a5>] deactivate_locked_super+0x45/0x70 Dec 3 11:10:54 spare1 kernel: [ 83.567354] [<ffffffff81105721>] deactivate_super+0x61/0x70 Dec 3 11:10:54 spare1 kernel: [ 83.567425] [<ffffffff8111fe5f>] mntput_no_expire+0x16f/0x1b0 Dec 3 11:10:54 spare1 kernel: [ 83.567495] [<ffffffff81120c36>] sys_umount+0x76/0x380 Dec 3 11:10:54 spare1 kernel: [ 83.567566] [<ffffffff815eb939>] system_call_fastpath+0x16/0x1b Dec 3 11:10:54 spare1 kernel: [ 83.567635] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 a8 0b 82 81 48 89 de 48 89 04 24 31 c0 e8 0f 86 4c 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 Dec 3 11:10:54 spare1 kernel: [ 83.570649] RIP [<ffffffff81119c1e>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 11:10:54 spare1 kernel: [ 83.570780] RSP <ffff88102d6fbd78> Dec 3 11:10:54 spare1 kernel: [ 83.570865] ---[ end trace 5aed5d79590e2681 ]--- Dec 3 18:11:09 spare1 kernel: [ 1142.183765] Kernel BUG at ffffffff811192ee [verbose debug info unavailable] Dec 3 18:11:09 spare1 kernel: [ 1142.183836] invalid opcode: 0000 [#1] SMP Dec 3 18:11:09 spare1 kernel: [ 1142.183996] CPU 4 Dec 3 18:11:09 spare1 kernel: [ 1142.184045] Modules linked in: xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp hwmon u sbhid sb_edac edac_core ioatdma i2c_i801 ehci_hcd usbcore usb_common netconsole igb dca [last unloaded: scsi_wait_scan] Dec 3 18:11:09 spare1 kernel: [ 1142.185251] Dec 3 18:11:09 spare1 kernel: [ 1142.185315] Pid: 15410, comm: umount Not tainted 3.4.0-rc4-alwaysdata-bug-48421-5+ #5 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3 F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F Dec 3 18:11:09 spare1 kernel: [ 1142.185545] RIP: 0010:[<ffffffff811192ee>] [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 18:11:09 spare1 kernel: [ 1142.185684] RSP: 0018:ffff88102c5e1d88 EFLAGS: 00010292 Dec 3 18:11:09 spare1 kernel: [ 1142.185751] RAX: 0000000000000073 RBX: ffff88102c871b40 RCX: 0000000000000082 Dec 3 18:11:09 spare1 kernel: [ 1142.185822] RDX: 00000000000000b2 RSI: 0000000000000046 RDI: 0000000000000246 Dec 3 18:11:09 spare1 kernel: [ 1142.185893] RBP: ffff88102c5e1da8 R08: 0000000000000000 R09: 0000000000000000 Dec 3 18:11:09 spare1 kernel: [ 1142.185963] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078f6dc00 Dec 3 18:11:09 spare1 kernel: [ 1142.186034] R13: ffffffff81700000 R14: ffff88102ba3a400 R15: ffff88102e921ca0 Dec 3 18:11:09 spare1 kernel: [ 1142.186105] FS: 00007fb2bd9ac730(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000 Dec 3 18:11:09 spare1 kernel: [ 1142.186191] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 3 18:11:09 spare1 kernel: [ 1142.186260] CR2: 00007fb2bd079ad5 CR3: 000000102d45f000 CR4: 00000000000407e0 Dec 3 18:11:09 spare1 kernel: [ 1142.186330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 3 18:11:09 spare1 kernel: [ 1142.186401] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 3 18:11:09 spare1 kernel: [ 1142.186472] Process umount (pid: 15410, threadinfo ffff88102c5e0000, task ffff88102fe78900) Dec 3 18:11:09 spare1 kernel: [ 1142.186558] Stack: Dec 3 18:11:09 spare1 kernel: [ 1142.186620] ffff88102ba3a6c0 ffffffff8111599a ffff88102ba3a400 ffff88102fd9b6c0 Dec 3 18:11:09 spare1 kernel: [ 1142.186887] ffff88102c5e1dc8 ffffffff81119332 ffff881078f6dca0 ffff88102ba3a400 Dec 3 18:11:09 spare1 kernel: [ 1142.187204] ffff88102c5e1df8 ffffffff81103d27 ffff881078f2db40 0000000000000011 Dec 3 18:11:09 spare1 kernel: [ 1142.187465] Call Trace: Dec 3 18:11:09 spare1 kernel: [ 1142.187531] [<ffffffff8111599a>] ? try_to_ascend+0x3a/0x90 Dec 3 18:11:09 spare1 kernel: [ 1142.187600] [<ffffffff81119332>] shrink_dcache_for_umount+0x32/0x60 Dec 3 18:11:09 spare1 kernel: [ 1142.187672] [<ffffffff81103d27>] generic_shutdown_super+0x27/0xe0 Dec 3 18:11:09 spare1 kernel: [ 1142.187741] [<ffffffff81103e71>] kill_anon_super+0x11/0x20 Dec 3 18:11:09 spare1 kernel: [ 1142.187810] [<ffffffff81103ea2>] kill_litter_super+0x22/0x30 Dec 3 18:11:09 spare1 kernel: [ 1142.187880] [<ffffffff81589b4c>] rpc_kill_sb+0x7c/0xa0 Dec 3 18:11:09 spare1 kernel: [ 1142.187948] [<ffffffff81104155>] deactivate_locked_super+0x45/0x70 Dec 3 18:11:09 spare1 kernel: [ 1142.188017] [<ffffffff81104ed1>] deactivate_super+0x61/0x70 Dec 3 18:11:09 spare1 kernel: [ 1142.188088] [<ffffffff8111f4af>] mntput_no_expire+0x16f/0x1b0 Dec 3 18:11:09 spare1 kernel: [ 1142.188157] [<ffffffff81120286>] sys_umount+0x76/0x380 Dec 3 18:11:09 spare1 kernel: [ 1142.188228] [<ffffffff815d8539>] system_call_fastpath+0x16/0x1b Dec 3 18:11:09 spare1 kernel: [ 1142.188297] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 80 d4 81 81 48 89 de 48 89 04 24 31 c0 e8 df 5a 4b 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 Dec 3 18:11:09 spare1 kernel: [ 1142.191314] RIP [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 18:11:09 spare1 kernel: [ 1142.191445] RSP <ffff88102c5e1d88> Dec 3 18:17:06 spare1 kernel: [ 148.145579] Kernel BUG at ffffffff811192ee [verbose debug info unavailable] Dec 3 18:17:06 spare1 kernel: [ 148.145650] invalid opcode: 0000 [#1] SMP Dec 3 18:17:06 spare1 kernel: [ 148.145806] CPU 4 Dec 3 18:17:06 spare1 kernel: [ 148.145854] Modules linked in: xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp sb_edac hwmon edac_core i2c_i801 usbhid ioatdma ehci_hcd usbcore usb_common netconsole igb dca [last unloaded: scsi_wait_scan] Dec 3 18:17:06 spare1 kernel: [ 148.147006] Dec 3 18:17:06 spare1 kernel: [ 148.147070] Pid: 13090, comm: umount Not tainted 3.4.0-rc4-alwaysdata-bug-48421-6+ #6 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3 F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F Dec 3 18:17:06 spare1 kernel: [ 148.147302] RIP: 0010:[<ffffffff811192ee>] [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 18:17:06 spare1 kernel: [ 148.147443] RSP: 0018:ffff88102b6c7d88 EFLAGS: 00010292 Dec 3 18:17:06 spare1 kernel: [ 148.147512] RAX: 0000000000000073 RBX: ffff88102c519840 RCX: 0000000000000082 Dec 3 18:17:06 spare1 kernel: [ 148.147583] RDX: 000000000000009e RSI: 0000000000000046 RDI: 0000000000000246 Dec 3 18:17:06 spare1 kernel: [ 148.147654] RBP: ffff88102b6c7da8 R08: 0000000000000000 R09: 0000000000000000 Dec 3 18:17:06 spare1 kernel: [ 148.147728] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078c27e00 Dec 3 18:17:06 spare1 kernel: [ 148.147799] R13: ffffffff81700000 R14: ffff88102fd24c00 R15: ffff88102e77dce0 Dec 3 18:17:06 spare1 kernel: [ 148.147870] FS: 00007f7fefc24730(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000 Dec 3 18:17:06 spare1 kernel: [ 148.147957] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 3 18:17:06 spare1 kernel: [ 148.148026] CR2: 00007f7fef2f1ad5 CR3: 000000102d67f000 CR4: 00000000000407e0 Dec 3 18:17:06 spare1 kernel: [ 148.148097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 3 18:17:06 spare1 kernel: [ 148.148168] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 3 18:17:06 spare1 kernel: [ 148.148239] Process umount (pid: 13090, threadinfo ffff88102b6c6000, task ffff88102cdb6300) Dec 3 18:17:06 spare1 kernel: [ 148.148326] Stack: Dec 3 18:17:06 spare1 kernel: [ 148.148388] ffff88102fd24ec0 ffffffff8111599a ffff88102fd24c00 ffff88102fd4e6c0 Dec 3 18:17:06 spare1 kernel: [ 148.148651] ffff88102b6c7dc8 ffffffff81119332 ffff881078c27ea0 ffff88102fd24c00 Dec 3 18:17:06 spare1 kernel: [ 148.148916] ffff88102b6c7df8 ffffffff81103d27 ffff881078c2b740 0000000000000011 Dec 3 18:17:06 spare1 kernel: [ 148.149189] Call Trace: Dec 3 18:17:06 spare1 kernel: [ 148.149258] [<ffffffff8111599a>] ? try_to_ascend+0x3a/0x90 Dec 3 18:17:06 spare1 kernel: [ 148.149327] [<ffffffff81119332>] shrink_dcache_for_umount+0x32/0x60 Dec 3 18:17:06 spare1 kernel: [ 148.149398] [<ffffffff81103d27>] generic_shutdown_super+0x27/0xe0 Dec 3 18:17:06 spare1 kernel: [ 148.149470] [<ffffffff81103e71>] kill_anon_super+0x11/0x20 Dec 3 18:17:06 spare1 kernel: [ 148.149539] [<ffffffff81103ea2>] kill_litter_super+0x22/0x30 Dec 3 18:17:06 spare1 kernel: [ 148.149610] [<ffffffff81589b8c>] rpc_kill_sb+0x7c/0xa0 Dec 3 18:17:06 spare1 kernel: [ 148.149678] [<ffffffff81104155>] deactivate_locked_super+0x45/0x70 Dec 3 18:17:06 spare1 kernel: [ 148.149748] [<ffffffff81104ed1>] deactivate_super+0x61/0x70 Dec 3 18:17:06 spare1 kernel: [ 148.149818] [<ffffffff8111f4af>] mntput_no_expire+0x16f/0x1b0 Dec 3 18:17:06 spare1 kernel: [ 148.149888] [<ffffffff81120286>] sys_umount+0x76/0x380 Dec 3 18:17:06 spare1 kernel: [ 148.149959] [<ffffffff815d8579>] system_call_fastpath+0x16/0x1b Dec 3 18:17:06 spare1 kernel: [ 148.150027] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 a8 d4 81 81 48 89 de 48 89 04 24 31 c0 e8 1f 5b 4b 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 Dec 3 18:17:06 spare1 kernel: [ 148.153054] RIP [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 18:17:06 spare1 kernel: [ 148.153186] RSP <ffff88102b6c7d88> Dec 3 18:21:08 spare1 kernel: [ 131.681864] Kernel BUG at ffffffff811192ee [verbose debug info unavailable] Dec 3 18:21:08 spare1 kernel: [ 131.681936] invalid opcode: 0000 [#1] SMP Dec 3 18:21:08 spare1 kernel: [ 131.682094] CPU 6 Dec 3 18:21:08 spare1 kernel: [ 131.682144] Modules linked in: ehci_hcd usbcore usb_common netconsole igb dca [last unloaded: scsi_wait_scan] Dec 3 18:21:08 spare1 kernel: [ 131.682655] Dec 3 18:21:08 spare1 kernel: [ 131.682718] Pid: 12982, comm: umount Not tainted 3.4.0-rc4-alwaysdata-bug-48421-6+ #6 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3 F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F Dec 3 18:21:08 spare1 kernel: [ 131.682948] RIP: 0010:[<ffffffff811192ee>] [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 18:21:08 spare1 kernel: [ 131.683090] RSP: 0018:ffff88102b9e7d88 EFLAGS: 00010292 Dec 3 18:21:08 spare1 kernel: [ 131.683159] RAX: 0000000000000073 RBX: ffff88102ca7b2c0 RCX: 0000000000000082 Dec 3 18:21:08 spare1 kernel: [ 131.683230] RDX: 00000000000000e9 RSI: 0000000000000046 RDI: 0000000000000246 Dec 3 18:21:08 spare1 kernel: [ 131.683302] RBP: ffff88102b9e7da8 R08: 0000000000000000 R09: 0000000000000000 Dec 3 18:21:08 spare1 kernel: [ 131.683372] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881078f8d600 Dec 3 18:21:08 spare1 kernel: [ 131.683444] R13: ffffffff81700000 R14: ffff88102f20e800 R15: ffff881030bcfbe0 Dec 3 18:21:08 spare1 kernel: [ 131.683515] FS: 00007ff6970bd730(0000) GS:ffff88107fd80000(0000) knlGS:0000000000000000 Dec 3 18:21:08 spare1 kernel: [ 131.683602] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 3 18:21:08 spare1 kernel: [ 131.683673] CR2: 00007ff69678aad5 CR3: 000000102c38e000 CR4: 00000000000407e0 Dec 3 18:21:08 spare1 kernel: [ 131.683744] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 3 18:21:08 spare1 kernel: [ 131.683815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 3 18:21:08 spare1 kernel: [ 131.683886] Process umount (pid: 12982, threadinfo ffff88102b9e6000, task ffff88102bf84140) Dec 3 18:21:08 spare1 kernel: [ 131.683973] Stack: Dec 3 18:21:08 spare1 kernel: [ 131.684035] ffff88102f20eac0 ffffffff8111599a ffff88102f20e800 ffff88102fee26c0 Dec 3 18:21:08 spare1 kernel: [ 131.684297] ffff88102b9e7dc8 ffffffff81119332 ffff881078f8d6a0 ffff88102f20e800 Dec 3 18:21:08 spare1 kernel: [ 131.684560] ffff88102b9e7df8 ffffffff81103d27 ffff881078c56200 0000000000000011 Dec 3 18:21:08 spare1 kernel: [ 131.684823] Call Trace: Dec 3 18:21:08 spare1 kernel: [ 131.684889] [<ffffffff8111599a>] ? try_to_ascend+0x3a/0x90 Dec 3 18:21:08 spare1 kernel: [ 131.684957] [<ffffffff81119332>] shrink_dcache_for_umount+0x32/0x60 Dec 3 18:21:08 spare1 kernel: [ 131.685030] [<ffffffff81103d27>] generic_shutdown_super+0x27/0xe0 Dec 3 18:21:08 spare1 kernel: [ 131.685100] [<ffffffff81103e71>] kill_anon_super+0x11/0x20 Dec 3 18:21:08 spare1 kernel: [ 131.685171] [<ffffffff81103ea2>] kill_litter_super+0x22/0x30 Dec 3 18:21:08 spare1 kernel: [ 131.685243] [<ffffffff81589b8c>] rpc_kill_sb+0x7c/0xa0 Dec 3 18:21:08 spare1 kernel: [ 131.685311] [<ffffffff81104155>] deactivate_locked_super+0x45/0x70 Dec 3 18:21:08 spare1 kernel: [ 131.685381] [<ffffffff81104ed1>] deactivate_super+0x61/0x70 Dec 3 18:21:08 spare1 kernel: [ 131.685451] [<ffffffff8111f4af>] mntput_no_expire+0x16f/0x1b0 Dec 3 18:21:08 spare1 kernel: [ 131.685520] [<ffffffff81120286>] sys_umount+0x76/0x380 Dec 3 18:21:08 spare1 kernel: [ 131.685591] [<ffffffff815d8579>] system_call_fastpath+0x16/0x1b Dec 3 18:21:08 spare1 kernel: [ 131.685660] Code: 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 8d 86 c0 02 00 00 48 c7 c7 a8 d4 81 81 48 89 de 48 89 04 24 31 c0 e8 1f 5b 4b 00 <0f> 0b eb fe 0f 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 Dec 3 18:21:08 spare1 kernel: [ 131.688672] RIP [<ffffffff811192ee>] shrink_dcache_for_umount_subtree+0x1de/0x1f0 Dec 3 18:21:08 spare1 kernel: [ 131.688802] RSP <ffff88102b9e7d88>
Yes, this is understandable. Could you build a kernel if I'll provide you a debug patch? Look like we have some tiny races here, which I can't find in the code for now. If yes, please, specify your kernel sources version: is it still 3.4.4?
Yes you can send me patches. I was working on a 3.4.18, but you can send me patches for other versions if you prefer. I also tested reverting the commit on a recent 3.6, it worked fine.
Created attachment 88971 [details] SUNRPC: continue over the look of client instead of break on PiepFS event Please, try this patch (not debug one).
This patch is against current 3.7 kernel.
Yes, it fixes the bug. I first made sure that the vanilla 3.7 still exhibited the bug, and after patching it, I cannot hit it again, despite multiple reboots and coldreboots.
OK. Stanislav, can you please resend that patch with a fixed up comment (so that we know why it is correct)? Also, should we Cc: stable? If so, which kernel versions are affected?
Patch titled "SUNRPC: continue run over clients list on PipeFS event instead of break" in nfs-lkml.
A patch referencing this bug report has been merged in Linux v3.8-rc1: commit 642fe4d00db56d65060ce2fd4c105884414acb16 Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Thu Nov 8 10:01:26 2012 -0500 SUNRPC: Fix validity issues with rpc_pipefs sb->s_fs_info