Bug 209739

Summary: NFSd oops on shutdown, regression in 5.9 vs 5.8
Product: File System Reporter: Kris Karas (bugs-a21)
Component: NFSAssignee: bfields
Status: RESOLVED CODE_FIX    
Severity: normal CC: trondmy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.9 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output of oops during "modprobe nfs"
Oops after applying test-patch from comment 6

Description Kris Karas 2020-10-19 05:23:07 UTC
The nfsd kernel helper OOPSes on shutdown in kernel 5.9.  Works fine in kernels 5.8.15 and earlier.  Alt-SysRq-E,U,S,B is my standard response.  The call trace is the same in each case, i.e. not a heisenbug, and thus should be simple to bisect if the author of the regression doesn't guess the commit involved.

Here's a typical oops:
 Oops: 0002 [#1] SMP
 CPU: 1 PID: 1760 Comm: nfsd Not tainted 5.9.0 #1
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470 Taichi, BIOS P3.50 07/18/2019
 RIP: 0010:unregister_shrinker+0x2c/0x80
 Code: 7f 38 00 74 64 53 f6 47 1c 02 48 89 fb 75 50 48 c7 c7 20 ea 42 82 e8 e3 89 94 00 48 8b 53 20 48 8b 43 28 48 c7 c7 20 ea 42 82 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 43 20 48
 RSP: 0018:ffffc90000fb3e78 EFLAGS: 00010246
 RAX: ffffffffa0772140 RBX: ffffffffa073b520 RCX: 0000000000000000
 RDX: ffffffffa08648e0 RSI: ffffc90000fb3db8 RDI: ffffffff8242ea20
 RBP: ffffffff82473100 R08: 0000000000000001 R09: 0000000000000000
 R10: ffff8887fec60fc0 R11: 0000000000000400 R12: ffff8887f9bd6900
 R13: ffffc9000065fcd8 R14: ffff8887dfeb0000 R15: ffff8887e4ee1a00
 FS:  0000000000000000(0000) GS:ffff8887fec40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffa08648e8 CR3: 0000000251524000 CR4: 00000000003506e0
 Call Trace:
  nfsd_file_cache_shutdown+0x20/0xc0 [nfsd]
  nfsd_last_thread+0x115/0x130 [nfsd]
  nfsd_destroy+0x2d/0x50 [nfsd]
  nfsd+0x11a/0x140 [nfsd]
  ? nfsd_destroy+0x50/0x50 [nfsd]
  kthread+0xf2/0x130
  ? kthread_park+0x80/0x80
  ret_from_fork+0x1f/0x30
 Modules linked in: rpcsec_gss_krb5 md4 cifs dns_resolver libdes nfs(-) nfsd auth_rpcgss nfs_acl lockd grace sunrpc amdgpu snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio iwlmvm snd_hda_intel snd_intel_dspcfg snd_hda_codec iwlwifi snd_hwdep mfd_core snd_hda_core gpu_sched ttm snd_pcm [last unloaded: nfsv4]
 CR2: ffffffffa08648e8
 ---[ end trace bd7faba3b52c6ad2 ]---


The shutdown code that triggers this (effectively the stop() function in /etc/init.d/nfs) is embarrassingly old and crusty, something I wrote a decade ago or thereabouts:
  # Bump any NFS connections off.
  umount -at nfs &>/dev/null
  for mp in $(findmnt -nt nfs -o TARGET); do
    [ -n "$mp" -a -d "$mp" ] && fuser -k -m "$mp" &>/dev/null
  done
  umount -at nfs &>/dev/null
  # First, stop the userland helper callback daemons
  killall rpc.statd rpc.rquotad rpc.nfsd \
	  rpc.mountd rpc.idmapd rpcbind &>/dev/null
  for x in /usr/sbin/rpc.{statd,rquotad,nfsd,mountd,idmapd} \
	   /sbin/rpcbind
  do
    pidof ${x##*/} >/dev/null && killit $x -TERM -9
  done
  # Next, try to unmount the filesystem that the kernel daemons support.
  if grep -q proc/fs/nfsd /proc/mounts; then
    fuser -k -m /proc/fs/nfsd &>/dev/null
    umount /proc/fs/nfsd &>/dev/null
  fi
  if grep -q var/lib/nfs/rpc_pipefs /proc/mounts; then
    fuser -k -m /var/lib/nfs/rpc_pipefs &>/dev/null
    umount /var/lib/nfs/rpc_pipefs &>/dev/null
  fi
  killall -q -9 nfsd rpciod &>/dev/null
  # Lastly, if the modules are loaded, try to unload them.
  for x in nfsv{2,3,4} nfs nfsd lockd nfs_acl auth_rpcgss sunrpc; do
    lsmod | grep -q "^${x}\>" && modprobe -r $x &>/dev/null
  done

A bit repetitive there, for sure; but each fragment solved a "the foo is still there after I shut it down" issue at the time.
Comment 1 bfields 2020-10-19 15:54:52 UTC
That's weird, shrinker registration and unregistration looks pretty simple and I haven't spotted any relevant nfsd changes in the git logs between v5.8 and v5.9....  Is there anything else interesting in the logs before that "Oops: ..." line?
Comment 2 Kris Karas 2020-10-19 19:36:47 UTC
Ah, I should have looked for [T1760] in the dmesg, as there are a few additional lines:

    BUG: unable to handle page fault for address: ffffffffa08648e8
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 240b067 P4D 240b067 PUD 240c063 PMD 7c20ee067 PTE 0

But nothing that interesting (except perhaps line #2)
[Bugzilla's "preview" button shows this comment without any <CRLF>, making me wonder if I should reply via email instead of using this interface?]

I rebuilt a clean 5.9.1 (distro uses GCC 9.3.0, binutils 2.33.1), booted into console mode, /etc/init.d/nfs stop.  Seemed to stop OK.  Then /etc/init.d/nfs start.  That triggered an immediate oops in "modprobe nfs" which then hung in D/io-wait.  I'll add that output as an attachment here, without any trimming, as it is considerably more verbose than the earlier oops-on-shutdown.
Comment 3 Kris Karas 2020-10-19 19:38:03 UTC
Created attachment 293069 [details]
dmesg output of oops during "modprobe nfs"
Comment 4 Kris Karas 2020-10-20 04:32:39 UTC
And the winner is...

95ad37f90c338e3fd4abf61cecfe02b6f3e080f0 is the first bad commit
commit 95ad37f90c338e3fd4abf61cecfe02b6f3e080f0
Author: Frank van der Linden <fllinden@amazon.com>
Date:   Tue Jun 23 22:39:04 2020 +0000

    NFSv4.2: add client side xattr caching.

    Implement client side caching for NFSv4.2 extended attributes. The cache
    is a per-inode hashtable, with name/value entries. There is one special
    entry for the listxattr cache.

    NFS inodes have a pointer to a cache structure. The cache structure is
    allocated on demand, freed when the cache is invalidated.

    Memory shrinkers keep the size in check. Large entries (> PAGE_SIZE)
    are collected by a separate shrinker, and freed more aggressively
    than others.

    Signed-off-by: Frank van der Linden <fllinden@amazon.com>
    Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

 fs/nfs/Makefile             |    2 +-
 fs/nfs/inode.c              |    9 +-
 fs/nfs/nfs42proc.c          |   12 +
 fs/nfs/nfs42xattr.c         | 1083 +++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/nfs4_fs.h            |   22 +
 fs/nfs/nfs4proc.c           |   42 +-
 fs/nfs/nfs4super.c          |   10 +
 include/linux/nfs_fs.h      |    6 +
 include/uapi/linux/nfs_fs.h |    1 +
 9 files changed, 1179 insertions(+), 8 deletions(-)
 create mode 100644 fs/nfs/nfs42xattr.c


Just to verify, I backed out that single commit on a clean linux-5.9.tar.xz (imperfectly due to later commits), and am running it now as I type this.  No NFS oops.  As I look at the patch itself, it's reasonably substantial.  Happy to test a sub-patch if, perhaps, the author would care to suggest a test case.
Comment 5 bfields 2020-10-20 15:43:09 UTC
Oh, interesting, that patch registers three new shrinkers, then only unregisters two, which probably leaves the global list of shrinkers corrupted after the nfs module unloads, and then nfsd stumbles across that corruption.
Comment 6 bfields 2020-10-20 15:45:00 UTC
So, untested, but maybe all you need is:

diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c
index 86777996cfec..efe938fac96e 100644
--- a/fs/nfs/nfs42xattr.c
+++ b/fs/nfs/nfs42xattr.c
@@ -1048,6 +1048,7 @@ int __init nfs4_xattr_cache_init(void)
 
 void nfs4_xattr_cache_exit(void)
 {
+       unregister_shrinker(&nfs4_xattr_large_entry_shrinker);
        unregister_shrinker(&nfs4_xattr_entry_shrinker);
        unregister_shrinker(&nfs4_xattr_cache_shrinker);
        list_lru_destroy(&nfs4_xattr_entry_lru);
Comment 7 Kris Karas 2020-10-20 20:51:52 UTC
Created attachment 293093 [details]
Oops after applying test-patch from comment 6

That one-liner patch looked promising, but alas, triggers a different oops, this one when NFS is being restarted.  (The bug may be due to corruption introduced in the shutdown, but it only shows itself on the restart.)

This oops contains do_mount(), which suggests it's occurring roughly here:
     mount -t nfsd none /proc/fs/nfsd
Comment 8 bfields 2020-10-21 02:19:29 UTC
Whoops, how about this:

commit 25c328172348 (HEAD -> for-trond)
Author: J. Bruce Fields <bfields@redhat.com>
Date:   Tue Oct 20 11:45:10 2020 -0400

    NFSv4.2: fix failure to unregister shrinker
    
    We forgot to unregister the nfs4_xattr_large_entry_shrinker.
    
    That leaves the global list of shrinkers corrupted after unload of the
    nfs module, after which possibly unrelated code that calls
    register_shrinker() or unregister_shrinker() gets a BUG() with
    "supervisor write access in kernel mode".
    
    Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c
index 86777996cfec..55b44a42d625 100644
--- a/fs/nfs/nfs42xattr.c
+++ b/fs/nfs/nfs42xattr.c
@@ -1048,8 +1048,10 @@ int __init nfs4_xattr_cache_init(void)
 
 void nfs4_xattr_cache_exit(void)
 {
+       unregister_shrinker(&nfs4_xattr_large_entry_shrinker);
        unregister_shrinker(&nfs4_xattr_entry_shrinker);
        unregister_shrinker(&nfs4_xattr_cache_shrinker);
+       list_lru_destroy(&nfs4_xattr_large_entry_lru);
        list_lru_destroy(&nfs4_xattr_entry_lru);
        list_lru_destroy(&nfs4_xattr_cache_lru);
        kmem_cache_destroy(nfs4_xattr_cache_cachep);
Comment 9 Kris Karas 2020-10-21 06:08:38 UTC
I have beaten on the patch in comment 8 quite a bit, and cannot make it fail. Has my blessing.  Feel free to close with resolved/code-fix if no other feedback needed.

Tested-By: Kris Karas <bugs-a17@moonlit-rail.com>

As an aside, I am idly curious why others running 5.9 have not encountered this, too.  I suppose most people are running systemd-based distros, which might step around the bug somehow without triggering it.
Comment 10 bfields 2020-10-21 14:37:34 UTC
Thanks for the confirmation.

You only hit the bug if you unload the nfs module.  I bet most standard scripts don't bother to--the module probably gets loaded automatically on mount and never unmounted.
Comment 11 Kris Karas 2020-11-12 06:53:46 UTC
Has the fix for this been sent upstream?  I've had to hand-patch all the 5.9.x shipped out by Greg KH, still no sign of the fix there.
Comment 12 bfields 2020-11-12 14:04:52 UTC
Thanks for checking, looks like it may have gotten lost somewhere, I've sent a reminder.