Bug 20262
Summary: | nfsd crashed every 3 minutes with kernel BUG at fs/nfsd/nfsfh.h:199 | ||
---|---|---|---|
Product: | File System | Reporter: | Marius Tolzmann (marius) |
Component: | NFS | Assignee: | bfields |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | chuck.lever, florian, rjw, trondmy, wwwutz |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.36-rc6 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 16444 | ||
Attachments: |
(ponk) Packet-Of-NFS-Kernelbug tcpdump format
nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink |
Description
Marius Tolzmann
2010-10-13 13:02:56 UTC
Server issue. Reassigning to Bruce... It looks like it is related to a different server trying to delete a nonexistant .nfs-file: at least thats what we found repeatedly on the network 14:47:42.295359 IP (tos 0x0, ttl 64, id 62547, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a" 14:47:42.295383 IP (tos 0x0, ttl 64, id 22783, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x19d7), 3624702097:3624702097(0) ack 1703194159 win 213 <nop,nop,timestamp 2066835 3809351168> 14:50:42.490048 IP (tos 0x0, ttl 64, id 62549, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a" 14:50:42.490067 IP (tos 0x0, ttl 64, id 22785, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x986c), 3624702097:3624702097(0) ack 1703194495 win 230 <nop,nop,timestamp 2247063 3809531392> 14:53:42.684327 IP (tos 0x0, ttl 64, id 62550, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a" 14:53:42.684353 IP (tos 0x0, ttl 64, id 22786, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x17b3), 3624702097:3624702097(0) ack 1703194663 win 238 <nop,nop,timestamp 2427291 3809711616> 14:59:43.074311 IP (tos 0x0, ttl 64, id 62553, offset 0, flags [DF], proto: TCP (6), length: 220) floccinaucinihi.1463474626 > tiaotiao.molgen.nfs: 168 remove fh[1000001:20000800:1000100:5b00f0cf:4e99ca52] ".nfs000000005b00f0d70000022a" 14:59:43.074328 IP (tos 0x0, ttl 64, id 22789, offset 0, flags [DF], proto: TCP (6), length: 52) tiaotiao.molgen.mpg.de.nfs > floccinaucinihilipilification.molgen.mpg.de.831: ., cksum 0x51e4 (incorrect (-> 0x158d), 3624702097:3624702097(0 Created attachment 33482 [details]
(ponk) Packet-Of-NFS-Kernelbug tcpdump format
We suspect this network packet to trigger the bug. decoded it tries to remove the file ".nfs000000005b00f0d70000022a"
This was probably introduced by 43a9aa64a2f4330a9cb59aaf5c5636566bce067c "NFSD: Fill in WCC data for REMOVE, RMDIR, MKNOD, and MKDIR". We either need to be more careful about when we call unlock (maybe move it inside nfsd_create() and nfsd_unlink()?), or maybe move the BUG_ON() inside the fhp->fh_locked check. Created attachment 33492 [details]
nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink
This should be a fix, but I'd like to confirm that. It's more difficult than expected to reproduce the problem--I may need an artificial NFSv3 client. If someone else can confirm it that would be great.
OK, I found that client# touch FOO server# exportfs -ua client# rm FOO in rapid succession would reproduce the problem, and didn't see any bad effects from removing the BUG(). So I'll run a few more tests and then pass this along to Linus. Thanks for catching this! Handled-By : Bruce Fields <bfields@fieldses.org> Patch : https://bugzilla.kernel.org/attachment.cgi?id=33492 just for the files: every call to this assertion killed a [nfsd]. so with 32 nfsd procs running our system stopped serving nfs after exactly 96 minutes 8) thanks for fixing it so fast. Fixed by commit b1e86db1de2e8bc2be9fb94fae3451c2a776e8c1 . |