Bug 203827 - NFS v4.1 and v4.2 still unstable
Summary: NFS v4.1 and v4.2 still unstable
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-05 22:19 UTC by Slawomir Pryczek
Modified: 2022-01-21 21:40 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.1.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Stack trace for kernel 5.1.6 (3.97 KB, text/plain)
2019-06-05 22:19 UTC, Slawomir Pryczek
Details

Description Slawomir Pryczek 2019-06-05 22:19:56 UTC
Created attachment 283121 [details]
Stack trace for kernel 5.1.6

After updating our production servers to FC30 and recent kernels we're still having reliability issues with NFS, which makes it unusable unfortunately...

1. It takes more time to crash the server but it still crashes, happens after 2-3 minutes on production. Then server is totally unresponsive and requires restart. Sometimes the load goes up indefinitely to several hunderds where there's no I/O nor CPU activity taking place
2. Attaching the crash log, it was also crashing 5.0.14, with different log but was unable to capture it
3. Still it seems there are no issues with nfs 4.0 and nfs 3
4. Reproducer ( https://bugzilla.kernel.org/show_bug.cgi?id=203363 ) is no longer able to crash the server / make kernel to segfault, instead it makes the nfs server to stall, requiring nfsd restart, producing ton of "RPC request reserved 116 but used 320" errors in dmesg (on 4.2 and 4.1 only). And it needs from 20-30 seconds to several minutes to do so... seems random.
5. On the clients we started to get "FS-Cache: Duplicate cookie detected" errors in dmesg (not sure if that's bad, seems not causing any issues)
6. Seems it's different bug than in previous report because stack trace is different, and i'm not able to reproduce it on VM

Maybe would be possible to introduce some options for the client so it would be possible to disable specific 4.1 features, so it could be tracked to single feature. In "Linux version 4.10.9-200.fc25.x86_64" nfs server seems to be working fine. Any suggestions for production?

On 5.0.14 the stack trace ended at something like "list_del corruption. next->prev should be XXX, was YYY". Unfortunately can't provide more details, will try to replicate...
Comment 1 Jeff Layton 2019-07-10 15:08:39 UTC
The lock callback encoding crashed, so this is probably related to that work which went in after 4.10 (I think) and is a NFSv4.1 only feature. If you have some debugging-fu, it would be nice to nail down the line where it crashed. In my case:

$ gdb fs/nfsd/nfsd.ko
...
(gdb) list *(nfs4_xdr_enc_cb_notify_lock+0x9a)
0x2c84a is in nfs4_xdr_enc_cb_notify_lock (fs/nfsd/nfs4callback.c:648).
643		encode_cb_sequence4args(xdr, cb, &hdr);
644	
645		p = xdr_reserve_space(xdr, 4);
646		*p = cpu_to_be32(OP_CB_NOTIFY_LOCK);
647		encode_nfs_fh4(xdr, &nbl->nbl_fh);
648		encode_stateowner(xdr, &lo->lo_owner);
649		hdr.nops++;
650	
651		encode_cb_nops(&hdr);
652	}

...but the offsets in your kernel may be different. My guess is that "lo" turned out to be NULL here for some reason, and that led to the crash. Maybe a refcounting bug of some sort?
Comment 2 Jeff Layton 2019-07-10 15:11:32 UTC
If this happens again, it might be nice to get a vmcore via kdump. That might help track down the cause.
Comment 3 Slawomir Pryczek 2019-07-10 15:50:36 UTC
Thanks for the update, this is production unfortunately and it crashed twice very shortly after enabling it. Reproducer no longer works... will try to figure out something to be able to reproduce it some other way because unfortunately it's too dangerous and always results in downtime.

Will see about adding some other operations to reproducer...
Comment 4 Friedrich Beckmann 2022-01-20 12:24:29 UTC
I filed a bug in debian for kernel package 4.19.0-18. The system crashes after writing to an nfs mounted filesystem.

See: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004070
Comment 5 Chuck Lever 2022-01-20 16:55:05 UTC
(In reply to Friedrich Beckmann from comment #4)
> I filed a bug in debian for kernel package 4.19.0-18. The system crashes
> after writing to an nfs mounted filesystem.
> 
> See: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004070

That debian bug seems to indicate a problem on the client, where FSCACHE operates. This bug (203827) is a server bug. These two issues appear to be unrelated.
Comment 6 Friedrich Beckmann 2022-01-20 17:49:25 UTC
(In reply to Chuck Lever from comment #5)
> That debian bug seems to indicate a problem on the client, where FSCACHE
> operates. This bug (203827) is a server bug. These two issues appear to be
> unrelated.

Thanks for looking at the bug. It is indeed on the client side. Sorry for the confusion.
Comment 7 bfields 2022-01-21 21:40:54 UTC
No activity on the original report in a couple years, so I'm assuming this is no longer an issue.  Reopen if it is.

Note You need to log in before you can comment on or make changes to this bug.