Bug 217815 - kernel 6.4/6.5 nfs 4.1 unresponsive
Summary: kernel 6.4/6.5 nfs 4.1 unresponsive
Status: RESOLVED MOVED
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P3 high
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-23 02:07 UTC by greg
Modified: 2023-09-02 06:51 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description greg 2023-08-23 02:07:49 UTC
I have two Synology Disk station NAS devices with NFS mounts present on Gentoo servers with the following fstab mount configuration:

10.200.1.247:/volume1/filer02-sata      /mnt/filer02-sata       nfs     vers=4.1,tcp,rsize=32768,wsize=32768,nolock,noatime,nodiratime,hard,timeo=60,retry=6,retrans=6,nconnect=4 0 0
10.200.1.247:/volume1/filer03-sata      /mnt/filer03-sata       nfs     vers=4.1,tcp,rsize=32768,wsize=32768,nolock,noatime,nodiratime,hard,timeo=60,retry=6,retrans=6,nconnect=4 0 0
10.200.1.246:/volume1/filer04-sata      /mnt/filer04-sata       nfs     vers=4.1,tcp,rsize=32768,wsize=32768,nolock,noatime,nodiratime,hard,timeo=60,retry=6,retrans=6,nconnect=4 0 0


On Linux Kernel 6.3.6 these work perfectly fine.

As soon as I upgrade to 6.4 (tested 6.4.7 through 6.4.11) or 6.5-rc7 NFS mounts randomly hang and block system operation with high load times eventually resulting in a system freeze.

dmesg/syslog:

Aug 22 18:13:49 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:13:49 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:13:49 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:13:49 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:14:35 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:15:23 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:05 sjc-www2 kernel: nfs: server 10.200.1.247 not responding, still trying
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK
Aug 22 18:16:54 sjc-www2 kernel: nfs: server 10.200.1.247 OK


The box in question i have been testing the kernel upgrades on has 1 x 10G NIC set with MTU 9000 for NFS volumes and i can successfully ping the nfs host with 9000 byte packets:

sjc-www2 ~ # ping -4 -s 9000 10.200.1.247
PING 10.200.1.247 (10.200.1.247) 9000(9028) bytes of data.
9008 bytes from 10.200.1.247: icmp_seq=1 ttl=64 time=0.205 ms
9008 bytes from 10.200.1.247: icmp_seq=2 ttl=64 time=0.279 ms
9008 bytes from 10.200.1.247: icmp_seq=3 ttl=64 time=0.402 ms
Comment 1 Bagas Sanjaya 2023-08-23 10:59:37 UTC
(In reply to greg from comment #0)
> I have two Synology Disk station NAS devices with NFS mounts present on
> Gentoo servers with the following fstab mount configuration:
> 
> 10.200.1.247:/volume1/filer02-sata      /mnt/filer02-sata       nfs    
> vers=4.1,tcp,rsize=32768,wsize=32768,nolock,noatime,nodiratime,hard,timeo=60,
> retry=6,retrans=6,nconnect=4 0 0
> 10.200.1.247:/volume1/filer03-sata      /mnt/filer03-sata       nfs    
> vers=4.1,tcp,rsize=32768,wsize=32768,nolock,noatime,nodiratime,hard,timeo=60,
> retry=6,retrans=6,nconnect=4 0 0
> 10.200.1.246:/volume1/filer04-sata      /mnt/filer04-sata       nfs    
> vers=4.1,tcp,rsize=32768,wsize=32768,nolock,noatime,nodiratime,hard,timeo=60,
> retry=6,retrans=6,nconnect=4 0 0
> 
> 
> On Linux Kernel 6.3.6 these work perfectly fine.
> 
> As soon as I upgrade to 6.4 (tested 6.4.7 through 6.4.11) or 6.5-rc7 NFS
> mounts randomly hang and block system operation with high load times
> eventually resulting in a system freeze.
> 

Then can you perform bisection (see Documentation/admin-guide/bug-bisect.rst
in kernel sources) to find the culprit?
Comment 2 Chuck Lever 2023-08-23 13:47:33 UTC
(In reply to greg from comment #0)
> I have two Synology Disk station NAS devices with NFS mounts present on
> Gentoo servers with the following fstab mount configuration:
> 
> On Linux Kernel 6.3.6 these work perfectly fine.
> 
> As soon as I upgrade to 6.4 (tested 6.4.7 through 6.4.11) or 6.5-rc7 NFS
> mounts randomly hang and block system operation with high load times
> eventually resulting in a system freeze.

Not clear from initial description: Are you changing the kernel on your Synology devices (NFS servers), or on your Gentoo servers (NFS clients) ?
Comment 3 greg 2023-08-25 06:06:16 UTC
This occurs with the Gentoo Servers (NFS clients).

Gentoo Server running kernel 6.3 - stable
As soon as I upgrade to kernel 6.4.x => 6.5-rc / linux-kernel from git I get a reproduceable high-load event and unstable NFS mout.

There are no changes made to the Synology NAS units (NFS servers).

I will attempt to build the linux kernel from git and use the bisect instructions but this will take time I do not currently have.
Comment 4 Bagas Sanjaya 2023-08-27 02:17:32 UTC
On 25/08/2023 13:06, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217815
> 
> --- Comment #3 from greg@greg.net.au ---
> This occurs with the Gentoo Servers (NFS clients).
> 
> Gentoo Server running kernel 6.3 - stable
> As soon as I upgrade to kernel 6.4.x => 6.5-rc / linux-kernel from git I get
> a
> reproduceable high-load event and unstable NFS mout.
> 
> There are no changes made to the Synology NAS units (NFS servers).
> 
> I will attempt to build the linux kernel from git and use the bisect
> instructions but this will take time I do not currently have.
> 

FYI, you can see Documentation/admin-guide/quickly-build-trimmed-linux.rst
in the kernel sources for how to build kernel with localmodconfig.

Thanks.
Comment 5 greg 2023-09-02 06:50:54 UTC
So an update from my end. This appears to be an issue with the i40 intel nic driver and nothing to do with NFS as further testing revealed incredibly poor general network performance (tested using iperf).

I am now able to run 6.4.13 stable after manually pulling down the latest i40 driver from sourceforge and compiling it as a module manually.

This can probably be closed here, and i will open a ticket with the i40 driver team.

Note You need to log in before you can comment on or make changes to this bug.