Bug 208451

Summary: NFS server occasionally spontaneously reboots when client mounts exported directory
Product: Networking Reporter: Robert Dinse (nanook)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bfields
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.7.7 Subsystem:
Regression: No Bisected commit-id:

Description Robert Dinse 2020-07-06 02:11:42 UTC
This may be related to bug #208157.  From 5.7.0 through 5.7.4, nfs-server
would not start upon boot on one of my servers.

     With 5.7.7 this was resolved BUT now when I reboot one of the NFS clients
or unmount and remount an NFS partition on the client, the NFS server will
sometimes spontaneously reboot.

     I get these messages in /var/log/dmesg.0:

     (The old dmesg log I assume is relevant since it would be the active one
before the last boot).

[   40.302192] systemd[1]: Mounting NFSD configuration filesystem...
[   40.688313] kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
[   69.899630] kernel: NFSD: Using UMH upcall client tracking operations.
[   69.899635] kernel: NFSD: starting 90-second grace period (net f00000a8)

     After the NFS server reboots I see these NFS related messages in dmesg:

[   53.810062] systemd[1]: Mounting NFSD configuration filesystem...
[   54.254326] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  106.468779] NFSD: Using UMH upcall client tracking operations.
[  106.468781] NFSD: starting 90-second grace period (net f00000a8)
[  107.631713] NFS: Registering the id_resolver key type
[  110.815312] NFS4: Couldn't follow remote path
[  113.935404] NFS4: Couldn't follow remote path
[  117.055421] NFS4: Couldn't follow remote path
[  120.175488] NFS4: Couldn't follow remote path
[  123.295611] NFS4: Couldn't follow remote path
[  126.415625] NFS4: Couldn't follow remote path
[  129.545752] NFS4: Couldn't follow remote path
[  132.655844] NFS4: Couldn't follow remote path

     So pretty much the same thing except for the "NFS4: Couldn't follow remote path" messages which I've read are caused by old nfs-utils not using the new
system calls so probably not relevant.
Comment 1 bfields 2020-07-07 14:31:05 UTC
(In reply to Robert Dinse from comment #0)
> This may be related to bug #208157.  From 5.7.0 through 5.7.4, nfs-server
> would not start upon boot on one of my servers.
> 
>      With 5.7.7 this was resolved BUT now when I reboot one of the NFS
> clients
> or unmount and remount an NFS partition on the client, the NFS server will
> sometimes spontaneously reboot.

Well, that's not good.  Too bad the dmesg has nothing interesting in it.  Can you capture console output to see if there are messages that aren't making it to disk before the reboot?

Do you have CONFIG_PANIC_ON_OOPS set?

Is it possible for you to build kernels between 5.7.4 and 5.7.7 to figure out where exactly the server started crashing?
Comment 2 Robert Dinse 2020-07-07 18:40:50 UTC
On Tue, 7 Jul 2020, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> Well, that's not good.  Too bad the dmesg has nothing interesting in it.  Can
> you capture console output to see if there are messages that aren't making it
> to disk before the reboot?
>
> Do you have CONFIG_PANIC_ON_OOPS set?
>
> Is it possible for you to build kernels between 5.7.4 and 5.7.7 to figure out
> where exactly the server started crashing?

      It started crashing at 5.7.6 and not sure how to get 5.7.5 now.  Don't
have access to the console because the machine is 21 miles from me.  When it
reboots console is overwritten with login screen when it comes back up.

      However, I discovered another problem that may be related, nouveau is 
allowing the nvidia 210 video card to DMA without having allocated the memory 
so who knows what it is randomly overwriting.  So don't know that that isn't
related, but crashes always occur when I try to mount a file system on a
client.

      Presently I've reverted to a stock Ubuntu 5.4.0 kernel just to make
sure I haven't got a hardware issue.
Comment 3 Robert Dinse 2020-08-09 01:28:16 UTC
With a recent patch applied, this appears to be totally fixed in 5.8 so I am closing this ticket.