The nfsd kernel helper OOPSes on shutdown in kernel 5.9. Works fine in kernels 5.8.15 and earlier. Alt-SysRq-E,U,S,B is my standard response. The call trace is the same in each case, i.e. not a heisenbug, and thus should be simple to bisect if the author of the regression doesn't guess the commit involved. Here's a typical oops: Oops: 0002 [#1] SMP CPU: 1 PID: 1760 Comm: nfsd Not tainted 5.9.0 #1 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470 Taichi, BIOS P3.50 07/18/2019 RIP: 0010:unregister_shrinker+0x2c/0x80 Code: 7f 38 00 74 64 53 f6 47 1c 02 48 89 fb 75 50 48 c7 c7 20 ea 42 82 e8 e3 89 94 00 48 8b 53 20 48 8b 43 28 48 c7 c7 20 ea 42 82 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 43 20 48 RSP: 0018:ffffc90000fb3e78 EFLAGS: 00010246 RAX: ffffffffa0772140 RBX: ffffffffa073b520 RCX: 0000000000000000 RDX: ffffffffa08648e0 RSI: ffffc90000fb3db8 RDI: ffffffff8242ea20 RBP: ffffffff82473100 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8887fec60fc0 R11: 0000000000000400 R12: ffff8887f9bd6900 R13: ffffc9000065fcd8 R14: ffff8887dfeb0000 R15: ffff8887e4ee1a00 FS: 0000000000000000(0000) GS:ffff8887fec40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa08648e8 CR3: 0000000251524000 CR4: 00000000003506e0 Call Trace: nfsd_file_cache_shutdown+0x20/0xc0 [nfsd] nfsd_last_thread+0x115/0x130 [nfsd] nfsd_destroy+0x2d/0x50 [nfsd] nfsd+0x11a/0x140 [nfsd] ? nfsd_destroy+0x50/0x50 [nfsd] kthread+0xf2/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x1f/0x30 Modules linked in: rpcsec_gss_krb5 md4 cifs dns_resolver libdes nfs(-) nfsd auth_rpcgss nfs_acl lockd grace sunrpc amdgpu snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio iwlmvm snd_hda_intel snd_intel_dspcfg snd_hda_codec iwlwifi snd_hwdep mfd_core snd_hda_core gpu_sched ttm snd_pcm [last unloaded: nfsv4] CR2: ffffffffa08648e8 ---[ end trace bd7faba3b52c6ad2 ]--- The shutdown code that triggers this (effectively the stop() function in /etc/init.d/nfs) is embarrassingly old and crusty, something I wrote a decade ago or thereabouts: # Bump any NFS connections off. umount -at nfs &>/dev/null for mp in $(findmnt -nt nfs -o TARGET); do [ -n "$mp" -a -d "$mp" ] && fuser -k -m "$mp" &>/dev/null done umount -at nfs &>/dev/null # First, stop the userland helper callback daemons killall rpc.statd rpc.rquotad rpc.nfsd \ rpc.mountd rpc.idmapd rpcbind &>/dev/null for x in /usr/sbin/rpc.{statd,rquotad,nfsd,mountd,idmapd} \ /sbin/rpcbind do pidof ${x##*/} >/dev/null && killit $x -TERM -9 done # Next, try to unmount the filesystem that the kernel daemons support. if grep -q proc/fs/nfsd /proc/mounts; then fuser -k -m /proc/fs/nfsd &>/dev/null umount /proc/fs/nfsd &>/dev/null fi if grep -q var/lib/nfs/rpc_pipefs /proc/mounts; then fuser -k -m /var/lib/nfs/rpc_pipefs &>/dev/null umount /var/lib/nfs/rpc_pipefs &>/dev/null fi killall -q -9 nfsd rpciod &>/dev/null # Lastly, if the modules are loaded, try to unload them. for x in nfsv{2,3,4} nfs nfsd lockd nfs_acl auth_rpcgss sunrpc; do lsmod | grep -q "^${x}\>" && modprobe -r $x &>/dev/null done A bit repetitive there, for sure; but each fragment solved a "the foo is still there after I shut it down" issue at the time.
That's weird, shrinker registration and unregistration looks pretty simple and I haven't spotted any relevant nfsd changes in the git logs between v5.8 and v5.9.... Is there anything else interesting in the logs before that "Oops: ..." line?
Ah, I should have looked for [T1760] in the dmesg, as there are a few additional lines: BUG: unable to handle page fault for address: ffffffffa08648e8 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 240b067 P4D 240b067 PUD 240c063 PMD 7c20ee067 PTE 0 But nothing that interesting (except perhaps line #2) [Bugzilla's "preview" button shows this comment without any <CRLF>, making me wonder if I should reply via email instead of using this interface?] I rebuilt a clean 5.9.1 (distro uses GCC 9.3.0, binutils 2.33.1), booted into console mode, /etc/init.d/nfs stop. Seemed to stop OK. Then /etc/init.d/nfs start. That triggered an immediate oops in "modprobe nfs" which then hung in D/io-wait. I'll add that output as an attachment here, without any trimming, as it is considerably more verbose than the earlier oops-on-shutdown.
Created attachment 293069 [details] dmesg output of oops during "modprobe nfs"
And the winner is... 95ad37f90c338e3fd4abf61cecfe02b6f3e080f0 is the first bad commit commit 95ad37f90c338e3fd4abf61cecfe02b6f3e080f0 Author: Frank van der Linden <fllinden@amazon.com> Date: Tue Jun 23 22:39:04 2020 +0000 NFSv4.2: add client side xattr caching. Implement client side caching for NFSv4.2 extended attributes. The cache is a per-inode hashtable, with name/value entries. There is one special entry for the listxattr cache. NFS inodes have a pointer to a cache structure. The cache structure is allocated on demand, freed when the cache is invalidated. Memory shrinkers keep the size in check. Large entries (> PAGE_SIZE) are collected by a separate shrinker, and freed more aggressively than others. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> fs/nfs/Makefile | 2 +- fs/nfs/inode.c | 9 +- fs/nfs/nfs42proc.c | 12 + fs/nfs/nfs42xattr.c | 1083 +++++++++++++++++++++++++++++++++++++++++++ fs/nfs/nfs4_fs.h | 22 + fs/nfs/nfs4proc.c | 42 +- fs/nfs/nfs4super.c | 10 + include/linux/nfs_fs.h | 6 + include/uapi/linux/nfs_fs.h | 1 + 9 files changed, 1179 insertions(+), 8 deletions(-) create mode 100644 fs/nfs/nfs42xattr.c Just to verify, I backed out that single commit on a clean linux-5.9.tar.xz (imperfectly due to later commits), and am running it now as I type this. No NFS oops. As I look at the patch itself, it's reasonably substantial. Happy to test a sub-patch if, perhaps, the author would care to suggest a test case.
Oh, interesting, that patch registers three new shrinkers, then only unregisters two, which probably leaves the global list of shrinkers corrupted after the nfs module unloads, and then nfsd stumbles across that corruption.
So, untested, but maybe all you need is: diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c index 86777996cfec..efe938fac96e 100644 --- a/fs/nfs/nfs42xattr.c +++ b/fs/nfs/nfs42xattr.c @@ -1048,6 +1048,7 @@ int __init nfs4_xattr_cache_init(void) void nfs4_xattr_cache_exit(void) { + unregister_shrinker(&nfs4_xattr_large_entry_shrinker); unregister_shrinker(&nfs4_xattr_entry_shrinker); unregister_shrinker(&nfs4_xattr_cache_shrinker); list_lru_destroy(&nfs4_xattr_entry_lru);
Created attachment 293093 [details] Oops after applying test-patch from comment 6 That one-liner patch looked promising, but alas, triggers a different oops, this one when NFS is being restarted. (The bug may be due to corruption introduced in the shutdown, but it only shows itself on the restart.) This oops contains do_mount(), which suggests it's occurring roughly here: mount -t nfsd none /proc/fs/nfsd
Whoops, how about this: commit 25c328172348 (HEAD -> for-trond) Author: J. Bruce Fields <bfields@redhat.com> Date: Tue Oct 20 11:45:10 2020 -0400 NFSv4.2: fix failure to unregister shrinker We forgot to unregister the nfs4_xattr_large_entry_shrinker. That leaves the global list of shrinkers corrupted after unload of the nfs module, after which possibly unrelated code that calls register_shrinker() or unregister_shrinker() gets a BUG() with "supervisor write access in kernel mode". Reported-by: Kris Karas <bugs-a17@moonlit-rail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c index 86777996cfec..55b44a42d625 100644 --- a/fs/nfs/nfs42xattr.c +++ b/fs/nfs/nfs42xattr.c @@ -1048,8 +1048,10 @@ int __init nfs4_xattr_cache_init(void) void nfs4_xattr_cache_exit(void) { + unregister_shrinker(&nfs4_xattr_large_entry_shrinker); unregister_shrinker(&nfs4_xattr_entry_shrinker); unregister_shrinker(&nfs4_xattr_cache_shrinker); + list_lru_destroy(&nfs4_xattr_large_entry_lru); list_lru_destroy(&nfs4_xattr_entry_lru); list_lru_destroy(&nfs4_xattr_cache_lru); kmem_cache_destroy(nfs4_xattr_cache_cachep);
I have beaten on the patch in comment 8 quite a bit, and cannot make it fail. Has my blessing. Feel free to close with resolved/code-fix if no other feedback needed. Tested-By: Kris Karas <bugs-a17@moonlit-rail.com> As an aside, I am idly curious why others running 5.9 have not encountered this, too. I suppose most people are running systemd-based distros, which might step around the bug somehow without triggering it.
Thanks for the confirmation. You only hit the bug if you unload the nfs module. I bet most standard scripts don't bother to--the module probably gets loaded automatically on mount and never unmounted.
Has the fix for this been sent upstream? I've had to hand-patch all the 5.9.x shipped out by Greg KH, still no sign of the fix there.
Thanks for checking, looks like it may have gotten lost somewhere, I've sent a reminder.