Bug 20262

Summary: nfsd crashed every 3 minutes with kernel BUG at fs/nfsd/nfsfh.h:199
Product: File System Reporter: Marius Tolzmann (marius)
Component: NFSAssignee: bfields
Status: CLOSED CODE_FIX    
Severity: high CC: chuck.lever, florian, rjw, trondmy, wwwutz
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.36-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16444    
Attachments: (ponk) Packet-Of-NFS-Kernelbug tcpdump format
nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink

Description Marius Tolzmann 2010-10-13 13:02:56 UTC
hi..

we are exporting very large (16 TB) XFS filesystems via nfs.. 

starting with linux-2.6.36-rc4 we keep getting the following kernel BUG messages. currently 2.6.36-rc6 is running on the server.

this BUG was not triggered with linux-2.6.35..

when the BUG was triggered once it returns (exactly) every 3 Minutes..

we will try 2.6.36-rc7 in the next minutes..

regards,
     marius tolzmann & peter marquardt
 

------------[ cut here ]------------
kernel BUG at fs/nfsd/nfsfh.h:199!
invalid opcode: 0000 [#13] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:80/0000:80:0d.0/0000:8b:00.0/host5/scsi_host/host5/stats
CPU 0
Modules linked in: 3w_9xxx

Pid: 3841, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
RIP: 0010:[<ffffffff81226172>]  [<ffffffff81226172>] nfsd3_proc_remove+0x122/0x130
RSP: 0018:ffff880434f25de0  EFLAGS: 00010246
RAX: 0000000046000000 RBX: ffff8814365d6c00 RCX: ffffc9000da920a8
RDX: ffffffff818aab00 RSI: ffffffff81220e50 RDI: ffff880432254d98
RBP: ffff880434f25e10 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000004000 R11: dead000000200200 R12: ffff8814365d6c08
R13: ffff881434760000 R14: ffff8814365da000 R15: ffff8803fe48d000
FS:  00007f7cb66586f0(0000) GS:ffff880002000000(0000) knlGS:00000000f76156c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000514c58 CR3: 0000000c3596d000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process nfsd (pid: 3841, threadinfo ffff880434f24000, task ffff880435a7b340)
Stack:
 ffff881434760000 ffff8814365da000 ffff881434760000 ffffffff81c9d000
<0> ffff8803fe48d018 000000000000001c ffff880434f25e50 ffffffff81218b15
<0> ffffffff81c9d000 ffff881434760000 ffffffff81c9d000 ffff881435a69bc0
Call Trace:
 [<ffffffff81218b15>] nfsd_dispatch+0xb5/0x250
 [<ffffffff81811d26>] svc_process+0x4b6/0x820
 [<ffffffff81219180>] ? nfsd+0x0/0x160
 [<ffffffff8121923d>] nfsd+0xbd/0x160
 [<ffffffff810588f6>] kthread+0x96/0xa0
 [<ffffffff81003154>] kernel_thread_helper+0x4/0x10
 [<ffffffff81058860>] ? kthread+0x0/0xa0
 [<ffffffff81003150>] ? kernel_thread_helper+0x0/0x10
Code: 00 00 48 89 f7 48 89 4d d8 e8 eb 55 ff ff 48 8b 4d d8 48 89 c6 44 89 e2 48 c7 c7 74 86 b0 81 31 c0 e8 1a ed 64 00 e9 0c ff ff ff <0f> 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
RIP  [<ffffffff81226172>] nfsd3_proc_remove+0x122/0x130
 RSP <ffff880434f25de0>
---[ end trace 066d5aa4346b0b50 ]---

# egrep "(: Linux version|Tainted)" /var/log/messages
Sep  9 10:26:47 tiaotiao kernel: Linux version 2.6.35.x86_64.48 (root@venustraphobie.molgen.mpg.de) (gcc version 4.4.2 (GCC) ) #1 SMP PREEMPT Mon Aug 2 16:35:40 CEST 2010
Sep 15 08:47:15 tiaotiao kernel: Linux version 2.6.36-rc4.x86_64.50 (root@venustraphobie.molgen.mpg.de) (gcc version 4.4.2 (GCC) ) #1 SMP PREEMPT Mon Sep 13 12:41:33 CEST 2010
Sep 24 09:25:08 tiaotiao kernel: Linux version 2.6.36-rc4.x86_64.50 (root@venustraphobie.molgen.mpg.de) (gcc version 4.4.2 (GCC) ) #1 SMP PREEMPT Mon Sep 13 12:41:33 CEST 2010
Oct 12 14:41:57 tiaotiao kernel: Pid: 3857, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 14:44:57 tiaotiao kernel: Pid: 3861, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 14:47:57 tiaotiao kernel: Pid: 3868, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 14:50:57 tiaotiao kernel: Pid: 3846, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 14:53:57 tiaotiao kernel: Pid: 3849, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 14:56:58 tiaotiao kernel: Pid: 3852, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 14:59:58 tiaotiao kernel: Pid: 3847, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:02:58 tiaotiao kernel: Pid: 3876, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:05:58 tiaotiao kernel: Pid: 3867, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:08:58 tiaotiao kernel: Pid: 3856, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:11:59 tiaotiao kernel: Pid: 3871, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:14:59 tiaotiao kernel: Pid: 3872, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:17:59 tiaotiao kernel: Pid: 3848, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:20:59 tiaotiao kernel: Pid: 3870, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:23:59 tiaotiao kernel: Pid: 3851, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:27:00 tiaotiao kernel: Pid: 3863, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:30:00 tiaotiao kernel: Pid: 3866, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:33:00 tiaotiao kernel: Pid: 3864, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:36:00 tiaotiao kernel: Pid: 3877, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:39:00 tiaotiao kernel: Pid: 3853, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:42:01 tiaotiao kernel: Pid: 3873, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:45:01 tiaotiao kernel: Pid: 3858, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:48:01 tiaotiao kernel: Pid: 3854, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:51:01 tiaotiao kernel: Pid: 3850, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:54:01 tiaotiao kernel: Pid: 3862, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 15:57:01 tiaotiao kernel: Pid: 3874, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 16:00:02 tiaotiao kernel: Pid: 3875, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 16:03:02 tiaotiao kernel: Pid: 3855, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 16:06:02 tiaotiao kernel: Pid: 3859, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 16:09:02 tiaotiao kernel: Pid: 3865, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 12 16:12:02 tiaotiao kernel: Pid: 3869, comm: nfsd Tainted: G      D     2.6.36-rc4.x86_64.50 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:08:35 tiaotiao kernel: Linux version 2.6.36-rc6.x86_64.51 (root@venustraphobie.molgen.mpg.de) (gcc version 4.4.2 (GCC) ) #1 SMP PREEMPT Fri Oct 1 09:55:18 CEST 2010
Oct 13 14:11:40 tiaotiao kernel: Pid: 3854, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:14:40 tiaotiao kernel: Pid: 3851, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:17:40 tiaotiao kernel: Pid: 3855, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:20:40 tiaotiao kernel: Pid: 3850, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:23:40 tiaotiao kernel: Pid: 3856, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:26:41 tiaotiao kernel: Pid: 3853, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:29:41 tiaotiao kernel: Pid: 3849, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:32:41 tiaotiao kernel: Pid: 3846, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:35:41 tiaotiao kernel: Pid: 3847, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:38:41 tiaotiao kernel: Pid: 3840, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:41:41 tiaotiao kernel: Pid: 3839, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:44:42 tiaotiao kernel: Pid: 3841, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Oct 13 14:47:42 tiaotiao kernel: Pid: 3838, comm: nfsd Tainted: G      D     2.6.36-rc6.x86_64.51 #1 Tyan Transport VX50-B4985/CK8
Comment 1 Trond Myklebust 2010-10-13 13:06:35 UTC
Server issue. Reassigning to Bruce...
Comment 2 Peter Marquardt 2010-10-13 13:25:53 UTC
It looks like it is related to a different server trying to delete a nonexistant .nfs-file:

at least thats what we found repeatedly on the network

14:47:42.295359 IP (tos 0x0, ttl  64, id 62547, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a"
14:47:42.295383 IP (tos 0x0, ttl  64, id 22783, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x19d7), 3624702097:3624702097(0) ack 1703194159 win 213 <nop,nop,timestamp 2066835 3809351168>
14:50:42.490048 IP (tos 0x0, ttl  64, id 62549, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a"
14:50:42.490067 IP (tos 0x0, ttl  64, id 22785, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x986c), 3624702097:3624702097(0) ack 1703194495 win 230 <nop,nop,timestamp 2247063 3809531392>
14:53:42.684327 IP (tos 0x0, ttl  64, id 62550, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a"
14:53:42.684353 IP (tos 0x0, ttl  64, id 22786, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x17b3), 3624702097:3624702097(0) ack 1703194663 win 238 <nop,nop,timestamp 2427291 3809711616>
14:59:43.074311 IP (tos 0x0, ttl  64, id 62553, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a"
14:59:43.074328 IP (tos 0x0, ttl  64, id 22789, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x158d), 3624702097:3624702097(0
Comment 3 Peter Marquardt 2010-10-13 13:54:02 UTC
Created attachment 33482 [details]
(ponk) Packet-Of-NFS-Kernelbug  tcpdump format

We suspect this network packet to trigger the bug. decoded it tries to remove the file ".nfs000000005b00f0d70000022a"
Comment 4 bfields 2010-10-13 14:16:38 UTC
This was probably introduced by 43a9aa64a2f4330a9cb59aaf5c5636566bce067c "NFSD: Fill in WCC data for REMOVE, RMDIR, MKNOD, and MKDIR".

We either need to be more careful about when we call unlock (maybe move it inside nfsd_create() and nfsd_unlink()?), or maybe move the BUG_ON() inside the fhp->fh_locked check.
Comment 5 bfields 2010-10-13 18:54:52 UTC
Created attachment 33492 [details]
nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink

This should be a fix, but I'd like to confirm that.  It's more difficult than expected to reproduce the problem--I may need an artificial NFSv3 client.  If someone else can confirm it that would be great.
Comment 6 bfields 2010-10-13 19:20:24 UTC
OK, I found that

client# touch FOO
server# exportfs -ua
client# rm FOO

in rapid succession would reproduce the problem, and didn't see any bad effects
from removing the BUG().

So I'll run a few more tests and then pass this along to Linus.

Thanks for catching this!
Comment 7 Rafael J. Wysocki 2010-10-13 19:57:13 UTC
Handled-By : Bruce Fields <bfields@fieldses.org>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=33492
Comment 8 Peter Marquardt 2010-10-15 12:39:19 UTC
just for the files: every call to this assertion killed a [nfsd]. so with 32 nfsd procs running our system stopped serving nfs after exactly 96 minutes 8)

thanks for fixing it so fast.
Comment 9 Rafael J. Wysocki 2010-10-17 19:45:20 UTC
Fixed by commit b1e86db1de2e8bc2be9fb94fae3451c2a776e8c1 .