Bug 115521 - NFS Server waiting for data causing client hanging
Summary: NFS Server waiting for data causing client hanging
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-29 23:08 UTC by piknik
Modified: 2022-01-21 17:43 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.4.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description piknik 2016-03-29 23:08:23 UTC
I have configured an NFS server on the host with the following export:

"/srv/nfs4/proto" 192.168.1.70(rw,insecure,no_subtree_check,all_squash,anonuid=1000,anongid=1000)

I intend to have this export available to any anonymous user. I have a qemu VM with the exact same kernel version on a different IP mounting this export to /srv/http.

Unfortunately, this bug does not occur immediately. It takes time before the NFS server becomes unresponsive to any and all clients, but when it does hang, it requires a complete reboot of the host. I have confirmed that this bug exists at least 4 times on my machine.

I have enabled rpcdebug and have posted relevant logs below:

During the 'hang' (repeated many times):
Mar 29 18:54:29 PIKNIK-SRV kernel: --> nfsd4_store_cache_entry slot ffff8800ac6f3000
Mar 29 18:54:29 PIKNIK-SRV kernel: renewing client (clientid 56f9c078/00000001)
Mar 29 18:54:29 PIKNIK-SRV kernel: svc: socket ffff8800ac6f4000 sendto([ffff8803d5d60000 92... ], 92) = 92 (addr 192.168.1.70, port=1013)
Mar 29 18:54:29 PIKNIK-SRV kernel: svc: server ffff88041cd14000 waiting for data (to = 1080000)
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: socket ffff88006326d280 TCP data ready (svsk ffff8800ac6f4000)
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: server ffff88041cd14000, pool 0, transport ffff8800ac6f4000, inuse=4
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: tcp_recv ffff8800ac6f4000 data 1 conn 0 close 0
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: socket ffff8800ac6f4000 recvfrom(ffff8800ac6f42b8, 4) = 4
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: TCP record, 212 bytes
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: socket ffff8800ac6f4000 recvfrom(ffff8802f8cdc000, 4096) = 212
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: TCP final record (212 bytes)
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: got len=212
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: svc_authenticate (1)
Mar 29 18:54:44 PIKNIK-SRV kernel: RPC:       Want update, refage=120, age=120
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: server ffff88034c1ce000, pool 0, transport ffff8800ac6f4000, inuse=5
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: calling dispatcher
Mar 29 18:54:44 PIKNIK-SRV kernel: nfsd_dispatch: vers 4 proc 1
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: tcp_recv ffff8800ac6f4000 data 1 conn 0 close 0
Mar 29 18:54:44 PIKNIK-SRV kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE)
Mar 29 18:54:44 PIKNIK-SRV kernel: __find_in_sessionid_hashtbl: 1459208312:1:2:0
Mar 29 18:54:44 PIKNIK-SRV kernel: nfsd4_sequence: slotid 0
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: socket ffff8800ac6f4000 recvfrom(ffff8800ac6f42b8, 4) = -11
Mar 29 18:54:44 PIKNIK-SRV kernel: RPC: TCP recv_record got -11
Mar 29 18:54:44 PIKNIK-SRV kernel: RPC: TCP recvfrom got EAGAIN
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: got len=0
Mar 29 18:54:44 PIKNIK-SRV kernel: svc: server ffff88034c1ce000 waiting for data (to = 1080000)
Mar 29 18:54:44 PIKNIK-SRV kernel: check_slot_seqid enter. seqid 69 slot_seqid 68
Mar 29 18:54:44 PIKNIK-SRV kernel: nfsv4 compound op ffff880345333080 opcnt 3 #1: 53: status 0
Mar 29 18:54:44 PIKNIK-SRV kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH)
Mar 29 18:54:44 PIKNIK-SRV kernel: nfsd: fh_verify(28: 00070001 00762703 00000000 9b82b229 a1414d6c 2a240087)
Mar 29 18:54:44 PIKNIK-SRV kernel: RPC:       Want update, refage=120, age=100
Mar 29 18:54:49 PIKNIK-SRV kernel: nfsv4 compound op ffff880345333080 opcnt 3 #2: 22: status 10008
Mar 29 18:54:49 PIKNIK-SRV kernel: nfsv4 compound returned 10008

When nfsd has been shut down (could be normal, but why have activity?):
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: initializing export module (net: ffff880353fe98c0).
Mar 29 18:59:59 PIKNIK-SRV kernel: RPC:       registering /proc/net/rpc
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: initializing export module (net: ffff880066d318c0).
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: shutting down export module (net: ffff880066d318c0).
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: export shutdown complete (net: ffff880066d318c0).
Mar 29 18:59:59 PIKNIK-SRV kernel: RPC:       unregistering /proc/net/rpc
Mar 29 18:59:59 PIKNIK-SRV kernel: RPC:       registering /proc/net/rpc
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: initializing export module (net: ffff880066d34a40).
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: shutting down export module (net: ffff880066d34a40).
Mar 29 18:59:59 PIKNIK-SRV kernel: nfsd: export shutdown complete (net: ffff880066d34a40).
Mar 29 18:59:59 PIKNIK-SRV kernel: RPC:       unregistering /proc/net/rpc
Mar 29 19:00:12 PIKNIK-SRV kernel: nfsd: shutting down export module (net: ffff880353feb180).
Mar 29 19:00:12 PIKNIK-SRV kernel: nfsd: export shutdown complete (net: ffff880353feb180).
Mar 29 19:00:12 PIKNIK-SRV kernel: RPC:       unregistering /proc/net/rpc
Mar 29 19:00:12 PIKNIK-SRV kernel: nfsd: shutting down export module (net: ffff880353fe98c0).
Mar 29 19:00:12 PIKNIK-SRV kernel: nfsd: export shutdown complete (net: ffff880353fe98c0).
Mar 29 19:00:12 PIKNIK-SRV kernel: RPC:       unregistering /proc/net/rpc

Let me know if there's more info I could provide.
Comment 1 piknik 2016-03-30 19:27:24 UTC
I gave up just the other day trying to fix anything, but today I've found that it works again after trying my setup without restarting any services. I haven't restarted the host at any time, but the only thing I can think of that may have fixed the issue is a nightly (03:00) pagecache, dentries and inodes clear with echo 3 > /proc/sys/vm/drop_caches.
Comment 2 piknik 2016-04-01 19:48:06 UTC
I noticed I had many binds to /srv/nfs4/proto when using the mount command. Maybe this was the problem. One of my scripts had repeatedly binded the same location. I'll see if fixing that fixes my problem.
Comment 3 bfields 2022-01-21 17:43:08 UTC
Not sure what was going on there, but I assume it's no longer a problem.  Closing for now.

Note You need to log in before you can comment on or make changes to this bug.