Created attachment 283121 [details] Stack trace for kernel 5.1.6 After updating our production servers to FC30 and recent kernels we're still having reliability issues with NFS, which makes it unusable unfortunately... 1. It takes more time to crash the server but it still crashes, happens after 2-3 minutes on production. Then server is totally unresponsive and requires restart. Sometimes the load goes up indefinitely to several hunderds where there's no I/O nor CPU activity taking place 2. Attaching the crash log, it was also crashing 5.0.14, with different log but was unable to capture it 3. Still it seems there are no issues with nfs 4.0 and nfs 3 4. Reproducer ( https://bugzilla.kernel.org/show_bug.cgi?id=203363 ) is no longer able to crash the server / make kernel to segfault, instead it makes the nfs server to stall, requiring nfsd restart, producing ton of "RPC request reserved 116 but used 320" errors in dmesg (on 4.2 and 4.1 only). And it needs from 20-30 seconds to several minutes to do so... seems random. 5. On the clients we started to get "FS-Cache: Duplicate cookie detected" errors in dmesg (not sure if that's bad, seems not causing any issues) 6. Seems it's different bug than in previous report because stack trace is different, and i'm not able to reproduce it on VM Maybe would be possible to introduce some options for the client so it would be possible to disable specific 4.1 features, so it could be tracked to single feature. In "Linux version 4.10.9-200.fc25.x86_64" nfs server seems to be working fine. Any suggestions for production? On 5.0.14 the stack trace ended at something like "list_del corruption. next->prev should be XXX, was YYY". Unfortunately can't provide more details, will try to replicate...
The lock callback encoding crashed, so this is probably related to that work which went in after 4.10 (I think) and is a NFSv4.1 only feature. If you have some debugging-fu, it would be nice to nail down the line where it crashed. In my case: $ gdb fs/nfsd/nfsd.ko ... (gdb) list *(nfs4_xdr_enc_cb_notify_lock+0x9a) 0x2c84a is in nfs4_xdr_enc_cb_notify_lock (fs/nfsd/nfs4callback.c:648). 643 encode_cb_sequence4args(xdr, cb, &hdr); 644 645 p = xdr_reserve_space(xdr, 4); 646 *p = cpu_to_be32(OP_CB_NOTIFY_LOCK); 647 encode_nfs_fh4(xdr, &nbl->nbl_fh); 648 encode_stateowner(xdr, &lo->lo_owner); 649 hdr.nops++; 650 651 encode_cb_nops(&hdr); 652 } ...but the offsets in your kernel may be different. My guess is that "lo" turned out to be NULL here for some reason, and that led to the crash. Maybe a refcounting bug of some sort?
If this happens again, it might be nice to get a vmcore via kdump. That might help track down the cause.
Thanks for the update, this is production unfortunately and it crashed twice very shortly after enabling it. Reproducer no longer works... will try to figure out something to be able to reproduce it some other way because unfortunately it's too dangerous and always results in downtime. Will see about adding some other operations to reproducer...
I filed a bug in debian for kernel package 4.19.0-18. The system crashes after writing to an nfs mounted filesystem. See: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004070
(In reply to Friedrich Beckmann from comment #4) > I filed a bug in debian for kernel package 4.19.0-18. The system crashes > after writing to an nfs mounted filesystem. > > See: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004070 That debian bug seems to indicate a problem on the client, where FSCACHE operates. This bug (203827) is a server bug. These two issues appear to be unrelated.
(In reply to Chuck Lever from comment #5) > That debian bug seems to indicate a problem on the client, where FSCACHE > operates. This bug (203827) is a server bug. These two issues appear to be > unrelated. Thanks for looking at the bug. It is indeed on the client side. Sorry for the confusion.
No activity on the original report in a couple years, so I'm assuming this is no longer an issue. Reopen if it is.