Bug 215975

Summary: NFSD stops serving clients
Product: File System Reporter: Ho Ming Shun (cyph1984)
Component: NFSAssignee: Chuck Lever (chuck.lever)
Status: RESOLVED CODE_FIX    
Severity: normal CC: mgorman, mgorman, regressions
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.13.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Pcap of it transitioning from working to non-working state
trace-cmd record -e nfsd -e sunrpc of deadlock
crash #2 on same machine
crash #3 at a different site
crash #4 at yet another site
crash #5
sysreq dump #1
PCAP of crash on 5.18.0
sysreq dump #2
Patch to always allocate at least one page
trace-cmd record -p function_graph -g __alloc_pages_bulk --max-graph-depth=4
Livelock #1
Livelock #2

Description Ho Ming Shun 2022-05-13 18:48:19 UTC
Starting with kernel v5.13 NFSD will after about 24 hours get into a state where it stops serving calls to all clients and does not respond to new mount requests. Looks suspiciously like #213887 for when the issue first started appearing.

My setup at multiple sites is a NFS server exporting read-only root filesystems to about 10-30 Raspberry Pi's. Protocol used is NFS v4.0.

When the issue happens I try to recover by stopping nfs-server and rmmod nfsd auth_rpcgss nfs_acl lockd grace sunrpc. However upon starting NFS server again, the issue is not resolved. The only way to recover is a reboot.

Wireshark show the client sending a NULL call, and the server not responding.
Comment 1 Ho Ming Shun 2022-05-13 18:54:13 UTC
Created attachment 300954 [details]
Pcap of it transitioning from working to non-working state
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-20 05:26:46 UTC
Hi, I'm not one of the developers that should handle it, but consider forwarding it to them. But beforehand a quick question: does this problem still happen with the latest kernel versions (trying 5.18-rc would be ideal, but testing the latest 5.17.y will do, to)
Comment 3 Ho Ming Shun 2022-05-20 09:31:09 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #2)
> Hi, I'm not one of the developers that should handle it, but consider
> forwarding it to them. But beforehand a quick question: does this problem
> still happen with the latest kernel versions (trying 5.18-rc would be ideal,
> but testing the latest 5.17.y will do, to)

It still happens on 15.17.7, will try to reproduce it on a newer kernel, but it is quite rare and annoyingly only seem to happen in production.

It seems to usually happen after creation and deletion of large (> 1gb) files, when ffmpeg trasmuxes a file and a script deletes the original video file.
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-20 13:28:30 UTC
(In reply to Ho Ming Shun from comment #3)
> It still happens on 15.17.7, will try to reproduce it on a newer kernel, but
> it is quite rare and annoyingly only seem to happen in production.

okay, thx, that's good enough for me/now!
Comment 5 Chuck Lever 2022-05-20 15:00:38 UTC
Are there any relevant kernel log entries about the time it stops responding? The symptomology is similar to an oops/BUG that occurs while holding an NFSD-related spin lock or mutex. Evidence of that might appear in the kernel log.
Comment 6 Ho Ming Shun 2022-05-20 15:23:47 UTC
(In reply to Chuck Lever from comment #5)
> Are there any relevant kernel log entries about the time it stops
> responding? The symptomology is similar to an oops/BUG that occurs while
> holding an NFSD-related spin lock or mutex. Evidence of that might appear in
> the kernel log.

No oops/BUG when it happens. Weird thing is it does not recover even if I unload nfsd module. So it seems to happen outside of NFSD, but is still NFSD related.
Comment 7 Chuck Lever 2022-05-20 15:30:35 UTC
I'll operate on the assumption that this is a kernel bug and not a failure of an external service like DNS (ie, a failure that might leave the server's user space in some kind of weird state). You could rebuild your server kernel with some debugging:

 -- CONFIG_PROVE_LOCKING
 -- CONFIG_DETECT_HUNG_TASK

When the server hangs, it might produce some kernel log information about what is stuck after a few minutes.
Comment 8 Ho Ming Shun 2022-05-20 15:47:40 UTC
(In reply to Chuck Lever from comment #7)
> I'll operate on the assumption that this is a kernel bug and not a failure
> of an external service like DNS (ie, a failure that might leave the server's

IIRC if /etc/exports doesn't use hostnames, it should not rely on DNS?

Furthermore, an already established NFS connection should not fail because of DNS.


/etc exports is pretty simple, with no auth

/       *(ro,async,crossmnt,no_subtree_check,no_root_squash,fsid=0)
/mnt/       *(ro,async,no_subtree_check,fsid=1)
/srv/       *(ro,async,crossmnt,no_root_squash,no_subtree_check,fsid=3)


> user space in some kind of weird state). You could rebuild your server
> kernel with some debugging:
> 
>  -- CONFIG_PROVE_LOCKING
>  -- CONFIG_DETECT_HUNG_TASK
> 
> When the server hangs, it might produce some kernel log information about
> what is stuck after a few minutes.


ok let me get this rolled out on a few servers
Comment 9 Ho Ming Shun 2022-05-23 05:00:07 UTC
(In reply to Chuck Lever from comment #7)
> I'll operate on the assumption that this is a kernel bug and not a failure
> of an external service like DNS (ie, a failure that might leave the server's
> user space in some kind of weird state). You could rebuild your server
> kernel with some debugging:
> 
>  -- CONFIG_PROVE_LOCKING
>  -- CONFIG_DETECT_HUNG_TASK
> 
> When the server hangs, it might produce some kernel log information about
> what is stuck after a few minutes.


Just had a NFSD crash on a 5.17.9-arch1 kernel with CONFIG_PROVE_LOCKING and CONFIG_DETECT_HUNG_TASK. After 30 minutes of waiting, no kernel errors.
Comment 10 Ho Ming Shun 2022-05-23 15:31:38 UTC
I tried tracing a crashed NFS server with the following:

cd /sys/kernel/tracing
echo 'nfsd*' > set_ftrace_filter
echo 'function' > current_tracer

However it produces no lines. Is there something else I should set set_ftrace_filter to?
Comment 11 Chuck Lever 2022-05-23 15:51:03 UTC
(In reply to Ho Ming Shun from comment #10)
> I tried tracing a crashed NFS server with the following:
> 
> cd /sys/kernel/tracing
> echo 'nfsd*' > set_ftrace_filter
> echo 'function' > current_tracer
> 
> However it produces no lines. Is there something else I should set
> set_ftrace_filter to?

I find the sysfs interface to be unwieldy, so I use "trace-cmd" instead. "trace-cmd record -e nfsd" will capture all NFSD trace events, and adding "-e sunrpc" will capture RPC server-side tracepoints as well.

But note that tracing the NFS server can result in an avalanche of data. You will need to reduce the blanket capture parameters I just gave above to monitor just the events you think will provide the breadcrumb trail needed to identify where the deadlock occurs. Maybe:

  # trace-cmd record -e nfsd:nfsd_compound_status
    ... wait until the deadlock ...
   ^C
  # trace-cmd report | tail -100
Comment 12 Ho Ming Shun 2022-05-24 07:10:22 UTC
Thanks for pointing out the trace-cmd tool!

I wrote some scripts to detect an NFS hang and preserve a ~90s window of trace events using -e nfsd -e sunrpc. I am not familiar enough with the nfsd codepaths to figure out a narrower filter.

Attached is the output of trace-cmd report.
Comment 13 Ho Ming Shun 2022-05-24 07:11:40 UTC
Created attachment 301023 [details]
trace-cmd record -e nfsd -e sunrpc of deadlock
Comment 14 Ho Ming Shun 2022-05-24 07:16:24 UTC
It seems like the error happens at around 22324.220112. 

I am running 6 sites with this tracing setup using kernel 5.17.9-arch1-1. I should be able to get another deadlock in the next 48 hours.
Comment 15 Ho Ming Shun 2022-05-24 08:39:50 UTC
Created attachment 301024 [details]
crash #2 on same machine

Generated with trace-cmd record -e nfsd -e sunrpc
Comment 16 Ho Ming Shun 2022-05-24 10:08:51 UTC
Created attachment 301025 [details]
crash #3 at a different site

crash starts at 61340.557100
Comment 17 Ho Ming Shun 2022-05-24 13:35:43 UTC
Created attachment 301028 [details]
crash #4 at yet another site

crash starts at around 46195.454833
Comment 18 Chuck Lever 2022-05-24 14:07:50 UTC
I looked at the first trace attachment above. The reason for work stoppage is obvious and unexpected. Up to this point, normal operation, and then:

            nfsd-460   [000] 22324.180946: svc_alloc_arg_err:    pages=67

occasionally, then boom:

            nfsd-453   [000] 22324.220113: svc_alloc_arg_err:    pages=67
            nfsd-454   [001] 22324.220780: svc_alloc_arg_err:    pages=67
            nfsd-455   [000] 22324.220780: svc_alloc_arg_err:    pages=67
            nfsd-460   [000] 22324.220782: svc_alloc_arg_err:    pages=67
            nfsd-456   [001] 22324.224111: svc_alloc_arg_err:    pages=67
            nfsd-457   [001] 22324.224113: svc_alloc_arg_err:    pages=67
            nfsd-459   [000] 22324.224120: svc_alloc_arg_err:    pages=67
            nfsd-458   [000] 22324.224122: svc_alloc_arg_err:    pages=67
            nfsd-453   [001] 22324.224123: svc_alloc_arg_err:    pages=67

No nfsd thread is able to make forward progress because alloc_pages_bulk_array() is failing every time it is called. That's a sign of some kind of memory exhaustion.

alloc_pages_bulk_array() is supposed to call the normal page allocator if it can't make progress, so even during pool exhaustion, it should return at least one free page.

This issue does not seem related to nfsd. It also possibly explains why a system reboot is necessary to clear the condition.

When this deadlocked state occurs again, capture a task dump on that server and attach the kernel log with that information to this bug.

  # echo t > /proc/sysrq-trigger
Comment 19 Ho Ming Shun 2022-05-24 14:13:53 UTC
Created attachment 301029 [details]
crash #5

deadlocks starts at 14115.775954
Comment 20 Ho Ming Shun 2022-05-25 06:01:04 UTC
Created attachment 301037 [details]
sysreq dump #1
Comment 21 Ho Ming Shun 2022-05-25 08:30:55 UTC
(In reply to Chuck Lever from comment #18)
> 
> No nfsd thread is able to make forward progress because
> alloc_pages_bulk_array() is failing every time it is called. That's a sign
> of some kind of memory exhaustion.
> 

If the kernel is having trouble finding order-0 pages then isn't it a pretty catastrophic state where many other things should be failing? It shouldn't be happening on a 64-bit system with 2GB ram, 1GB swap.

> alloc_pages_bulk_array() is supposed to call the normal page allocator if it
> can't make progress, so even during pool exhaustion, it should return at
> least one free page.
> 
> This issue does not seem related to nfsd. It also possibly explains why a
> system reboot is necessary to clear the condition.

It seems like I am the only one running into this problem. Such a bug should have manifested in other systems over such a long time (5.13 - 5.17). Could it be due to me using BTRFS? Or my use of link-local IPV6 addresses?

> 
> When this deadlocked state occurs again, capture a task dump on that server
> and attach the kernel log with that information to this bug.
> 
>   # echo t > /proc/sysrq-trigger
Comment 22 Ho Ming Shun 2022-05-25 09:19:49 UTC
Created attachment 301040 [details]
PCAP of crash on 5.18.0

Updated one site to 5.18.0 to see what happens. Just observed a crash with a different behaviour. Seems like in this case, NFSD does not even SYN-ACK the connection attempt.

Similar output with trace-cmd: a flood of svc_alloc_arg_err:    pages=67
Comment 23 Ho Ming Shun 2022-05-25 09:57:02 UTC
Created attachment 301041 [details]
sysreq dump #2

sysreq dump on 5.17.9
Comment 24 Chuck Lever 2022-05-25 14:12:57 UTC
(In reply to Ho Ming Shun from comment #21)
> (In reply to Chuck Lever from comment #18)
> > 
> > No nfsd thread is able to make forward progress because
> > alloc_pages_bulk_array() is failing every time it is called. That's a sign
> > of some kind of memory exhaustion.
> > 
> 
> If the kernel is having trouble finding order-0 pages then isn't it a pretty
> catastrophic state where many other things should be failing?

alloc_bulk_pages_array() is supposed to do an order-0 allocation when it can't find a pool of pages. There might be a bug where it isn't doing that, or isn't able to do it. I'm still investigating.


> It shouldn't be happening on a 64-bit system with 2GB ram, 1GB swap.

2GB is tiny, barely adequate I'd say. Try adding some memory to this system as a workaround.

The task dump was not probative (or I'm not smart enough to get information from it).
Comment 25 Ho Ming Shun 2022-05-25 15:16:18 UTC
(In reply to Chuck Lever from comment #24)

> 
> 2GB is tiny, barely adequate I'd say. Try adding some memory to this system
> as a workaround.

Yeap it's tiny, but it works with adequate performance for its workload. 

This is the free -m when locks up. Seems to have plenty of memory left for kernel usage.

               total        used        free      shared  buff/cache   available
Mem:            1729         535          90           8        1103         960
Swap:           1023         236         787

I can add more RAM as a test, but that'll take some time (physical servers in various locations around the country). Easiest workaround is to downgrade to 5.12.
Comment 26 Chuck Lever 2022-05-25 15:19:44 UTC
I've examined the GETATTR requests on the wire before the issue begins, and they are not remarkable.

There isn't much observability in __alloc_bulk_pages(), so it's difficult to say why it is not making forward progress. You might try this:

# trace-cmd record -p function_graph -g __alloc_pages_bulk --max-graph-depth=4

With the usual mechanism of capturing just the events adjacent to the hang.

I have tried reproducing here with a 2GB x86 guest. I haven't seen an issue, so we might lean on your patience to help us continue to diagnose the issue.
Comment 27 Chuck Lever 2022-05-25 15:27:58 UTC
(In reply to Ho Ming Shun from comment #25)
> (In reply to Chuck Lever from comment #24)
> 
> > 
> > 2GB is tiny, barely adequate I'd say. Try adding some memory to this system
> > as a workaround.
> 
> Yeap it's tiny, but it works with adequate performance for its workload. 
> 
> This is the free -m when locks up. Seems to have plenty of memory left for
> kernel usage.
> 
>                total        used        free      shared  buff/cache  
> available
> Mem:            1729         535          90           8        1103        
> 960
> Swap:           1023         236         787

Since you're pushing into swap, the system is memory-constrained IME. In this situation, memory fragmentation can reduce the amount of memory that is usable.

Since this is RPi, I think your page size is 16KB. I'm not sure if that makes memory fragmentation a deeper issue than it is on x86.


> I can add more RAM as a test, but that'll take some time (physical servers
> in various locations around the country). Easiest workaround is to downgrade
> to 5.12.

Noted.
Comment 28 Ho Ming Shun 2022-05-25 16:29:26 UTC
(In reply to Chuck Lever from comment #27)

> 
> Since you're pushing into swap, the system is memory-constrained IME. In
> this situation, memory fragmentation can reduce the amount of memory that is
> usable.

Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67 pages here, not 67 contiguous pages? So fragmentation is probably not an issue here, just the amount of memory available?

> 
> Since this is RPi, I think your page size is 16KB. I'm not sure if that
> makes memory fragmentation a deeper issue than it is on x86.

To be clear, the NFS server (which is the side locking up) is x86.

Clients are RPis having page size 4096.
Comment 29 Chuck Lever 2022-05-25 16:35:21 UTC
(In reply to Ho Ming Shun from comment #28)
> (In reply to Chuck Lever from comment #27)
> 
> > 
> > Since you're pushing into swap, the system is memory-constrained IME. In
> > this situation, memory fragmentation can reduce the amount of memory that
> is
> > usable.
> 
> Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67
> pages here, not 67 contiguous pages? So fragmentation is probably not an
> issue here, just the amount of memory available?
> 
> > 
> > Since this is RPi, I think your page size is 16KB. I'm not sure if that
> > makes memory fragmentation a deeper issue than it is on x86.
> 
> To be clear, the NFS server (which is the side locking up) is x86.
> 
> Clients are RPis having page size 4096.

Then I don't understand why an x86 NFS server would request only 67 order-0 pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed to contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server to, say, 256KB?
Comment 30 Ho Ming Shun 2022-05-25 16:46:38 UTC
(In reply to Chuck Lever from comment #29)
> (In reply to Ho Ming Shun from comment #28)
> > (In reply to Chuck Lever from comment #27)
> > 
> > > 
> > > Since you're pushing into swap, the system is memory-constrained IME. In
> > > this situation, memory fragmentation can reduce the amount of memory that
> > is
> > > usable.
> > 
> > Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67
> > pages here, not 67 contiguous pages? So fragmentation is probably not an
> > issue here, just the amount of memory available?
> > 
> > > 
> > > Since this is RPi, I think your page size is 16KB. I'm not sure if that
> > > makes memory fragmentation a deeper issue than it is on x86.
> > 
> > To be clear, the NFS server (which is the side locking up) is x86.
> > 
> > Clients are RPis having page size 4096.
> 
> Then I don't understand why an x86 NFS server would request only 67 order-0
> pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed to
> contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server
> to, say, 256KB?

I've no idea how to set sv_max_mesg so it's probably the default. Could this be affected by client side mount options?
Comment 31 Chuck Lever 2022-05-25 16:57:07 UTC
(In reply to Ho Ming Shun from comment #30)
> (In reply to Chuck Lever from comment #29)
> > Then I don't understand why an x86 NFS server would request only 67 order-0
> > pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed
> to
> > contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server
> > to, say, 256KB?
> 
> I've no idea how to set sv_max_mesg so it's probably the default. Could this
> be affected by client side mount options?

It's set based on the network transport. For TCP it should be 1024KB plus a little extra. But, probably not important. I'm trying to make sense of the information available in the bug report.

Give the new trace-cmd a try. That will tell us if __alloc_pages() is being invoked or something else is happening. I will continue trying to reproduce here.
Comment 32 Ho Ming Shun 2022-05-25 16:59:52 UTC
(In reply to Chuck Lever from comment #29)
> (In reply to Ho Ming Shun from comment #28)
> > (In reply to Chuck Lever from comment #27)
> > 
> > > 
> > > Since you're pushing into swap, the system is memory-constrained IME. In
> > > this situation, memory fragmentation can reduce the amount of memory that
> > is
> > > usable.
> > 
> > Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67
> > pages here, not 67 contiguous pages? So fragmentation is probably not an
> > issue here, just the amount of memory available?
> > 
> > > 
> > > Since this is RPi, I think your page size is 16KB. I'm not sure if that
> > > makes memory fragmentation a deeper issue than it is on x86.
> > 
> > To be clear, the NFS server (which is the side locking up) is x86.
> > 
> > Clients are RPis having page size 4096.
> 
> Then I don't understand why an x86 NFS server would request only 67 order-0
> pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed to
> contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server
> to, say, 256KB?

After some code-digging

(In reply to Chuck Lever from comment #31)
> (In reply to Ho Ming Shun from comment #30)
> > (In reply to Chuck Lever from comment #29)
> > > Then I don't understand why an x86 NFS server would request only 67
> order-0
> > > pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed
> > to
> > > contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this
> server
> > > to, say, 256KB?
> > 
> > I've no idea how to set sv_max_mesg so it's probably the default. Could
> this
> > be affected by client side mount options?
> 
> It's set based on the network transport. For TCP it should be 1024KB plus a
> little extra. But, probably not important. I'm trying to make sense of the
> information available in the bug report.

Just to be sure:

# cat /proc/fs/nfsd/max_block_size 
262144

> 
> Give the new trace-cmd a try. That will tell us if __alloc_pages() is being
> invoked or something else is happening. I will continue trying to reproduce
> here.


Already rolled out. Waiting for a crash :)
Comment 33 Mel Gorman 2022-05-25 17:15:31 UTC
Created attachment 301045 [details]
Patch to always allocate at least one page

I think this bug might be the same as 216007 in that a partially populated array potentially never makes progress unless something unrelated wakes kswapd. Please test the attached patch against 5.18 if possible.
Comment 34 Ho Ming Shun 2022-05-25 17:55:55 UTC
(In reply to Mel Gorman from comment #33)
> Created attachment 301045 [details]
> Patch to always allocate at least one page
> 
> I think this bug might be the same as 216007 in that a partially populated
> array potentially never makes progress unless something unrelated wakes
> kswapd. Please test the attached patch against 5.18 if possible.

Rebuilding kernel with this patch. Will roll out and test tomorrow.

Does this explain the persistence of my issue though? Once NFSD hangs it hangs permanently until reboot.
Comment 35 Ho Ming Shun 2022-05-26 00:40:37 UTC
Created attachment 301048 [details]
trace-cmd record -p function_graph -g __alloc_pages_bulk --max-graph-depth=4

This was captured without Mel Gorman's patches.

As if in reply to my previous comment (Comment #34), this is the first time I've seen the NFSD lockup resolved by itself!! Luckily, it happened while everyone was asleep :)

About 1.5 hours after the lockup, my logs show the NFS clients recovering and operating normally.

I'll proceed to deploy my my patched 5.18.0 kernels.
Comment 36 Mel Gorman 2022-05-26 08:53:41 UTC
(In reply to Ho Ming Shun from comment #34)
> (In reply to Mel Gorman from comment #33)
> > Created attachment 301045 [details]
> > Patch to always allocate at least one page
> > 
> > I think this bug might be the same as 216007 in that a partially populated
> > array potentially never makes progress unless something unrelated wakes
> > kswapd. Please test the attached patch against 5.18 if possible.
> 
> Rebuilding kernel with this patch. Will roll out and test tomorrow.
> 
> Does this explain the persistence of my issue though? Once NFSD hangs it
> hangs permanently until reboot.

No, it doesn't really explain why a reboot is necessary. The XFS bug was clear-cut in that the situation that triggers the condition could easily persist. With nfs and sunrpc unloaded and reloaded I would expect any existing partial arrays would be freed and it would at least partially work for a while given that the page cache would also be freed. However, I would still like to eliminate a known problem to see if the "shape" of the problem changes.
Comment 37 Ho Ming Shun 2022-05-26 09:01:14 UTC
(In reply to Mel Gorman from comment #36)
> (In reply to Ho Ming Shun from comment #34)
> > (In reply to Mel Gorman from comment #33)
> > > Created attachment 301045 [details]
> > > Patch to always allocate at least one page
> > > 
> > > I think this bug might be the same as 216007 in that a partially
> populated
> > > array potentially never makes progress unless something unrelated wakes
> > > kswapd. Please test the attached patch against 5.18 if possible.
> > 
> > Rebuilding kernel with this patch. Will roll out and test tomorrow.
> > 
> > Does this explain the persistence of my issue though? Once NFSD hangs it
> > hangs permanently until reboot.
> 
> No, it doesn't really explain why a reboot is necessary. The XFS bug was
> clear-cut in that the situation that triggers the condition could easily
> persist. With nfs and sunrpc unloaded and reloaded I would expect any
> existing partial arrays would be freed and it would at least partially work
> for a while given that the page cache would also be freed. However, I would
> still like to eliminate a known problem to see if the "shape" of the problem
> changes.

As I mentioned in Comment #35, I just realized I am not looking at a permanent deadlock. I observed it resolving itself after 1.5 hours. 

We were usually too quick to reboot a server without seeing if it will revive itself.

It's been 8 hours since I rolled out your patch. Usually I would have seen something. Let's give it another 24 hours.
Comment 38 Ho Ming Shun 2022-05-26 15:26:16 UTC
(In reply to Mel Gorman from comment #36)
> (In reply to Ho Ming Shun from comment #34)
> > (In reply to Mel Gorman from comment #33)
> > > Created attachment 301045 [details]
> > > Patch to always allocate at least one page
> > > 
> > > I think this bug might be the same as 216007 in that a partially
> populated
> > > array potentially never makes progress unless something unrelated wakes
> > > kswapd. Please test the attached patch against 5.18 if possible.
> > 
> > Rebuilding kernel with this patch. Will roll out and test tomorrow.
> > 
> > Does this explain the persistence of my issue though? Once NFSD hangs it
> > hangs permanently until reboot.
> 
> No, it doesn't really explain why a reboot is necessary. The XFS bug was
> clear-cut in that the situation that triggers the condition could easily
> persist. With nfs and sunrpc unloaded and reloaded I would expect any
> existing partial arrays would be freed and it would at least partially work
> for a while given that the page cache would also be freed. However, I would
> still like to eliminate a known problem to see if the "shape" of the problem
> changes.

I just got another deadlock with your patch applied onto 5.18.0. It feels like it's taking longer between crashes. But that could just be placebo.

Unfortunately I fat-fingered a command and overwrote the traces :(
Comment 39 Mel Gorman 2022-05-27 09:04:35 UTC
(In reply to Ho Ming Shun from comment #38)
> (In reply to Mel Gorman from comment #36)
> > (In reply to Ho Ming Shun from comment #34)
> > > (In reply to Mel Gorman from comment #33)
> > > > Created attachment 301045 [details]
> > > > Patch to always allocate at least one page
> > > > 
> > > > I think this bug might be the same as 216007 in that a partially
> > populated
> > > > array potentially never makes progress unless something unrelated wakes
> > > > kswapd. Please test the attached patch against 5.18 if possible.
> > > 
> > > Rebuilding kernel with this patch. Will roll out and test tomorrow.
> > > 
> > > Does this explain the persistence of my issue though? Once NFSD hangs it
> > > hangs permanently until reboot.
> > 
> > No, it doesn't really explain why a reboot is necessary. The XFS bug was
> > clear-cut in that the situation that triggers the condition could easily
> > persist. With nfs and sunrpc unloaded and reloaded I would expect any
> > existing partial arrays would be freed and it would at least partially work
> > for a while given that the page cache would also be freed. However, I would
> > still like to eliminate a known problem to see if the "shape" of the
> problem
> > changes.
> 
> I just got another deadlock with your patch applied onto 5.18.0. It feels
> like it's taking longer between crashes. But that could just be placebo.
> 
> Unfortunately I fat-fingered a command and overwrote the traces :(

That's unfortunate because I need to see the trace to determine if svc_alloc_arg_err appears in the trace during the deadlock reporting multiple attempts to allocate the same number of pages.
Comment 40 Chuck Lever 2022-05-27 15:37:14 UTC
(In reply to Mel Gorman from comment #39) 
> That's unfortunate because I need to see the trace to determine if
> svc_alloc_arg_err appears in the trace during the deadlock reporting
> multiple attempts to allocate the same number of pages.

Mel, with this issue, we have an opportunity to fine-tune the information gathered by the svc_alloc_arg_err trace point. I already have started a patch that adds the returned number of pages, as a point of comparison. Let me know if there is other useful information that might be added.
Comment 41 Ho Ming Shun 2022-05-28 12:06:56 UTC
(In reply to Mel Gorman from comment #33)
> Created attachment 301045 [details]
> Patch to always allocate at least one page
> 
> I think this bug might be the same as 216007 in that a partially populated
> array potentially never makes progress unless something unrelated wakes
> kswapd. Please test the attached patch against 5.18 if possible.

There is definitely a marked improvement. 

It's been more than 48 hours since I rolled out the patch to 10 sites, and there hasn't been a single deadlock. In contrast with unpatched 5.17.9, I was experiencing 2 to 5 deadlocks a day on 6 sites.

There is still that worrying deadlock that happened that I lost the traces for. Will continue running this setup on more sites and report what I find.
Comment 42 Mel Gorman 2022-05-30 08:04:20 UTC
(In reply to Chuck Lever from comment #40)
> (In reply to Mel Gorman from comment #39) 
> > That's unfortunate because I need to see the trace to determine if
> > svc_alloc_arg_err appears in the trace during the deadlock reporting
> > multiple attempts to allocate the same number of pages.
> 
> Mel, with this issue, we have an opportunity to fine-tune the information
> gathered by the svc_alloc_arg_err trace point. I already have started a
> patch that adds the returned number of pages, as a point of comparison. Let
> me know if there is other useful information that might be added.

That would be a definite improvement to svc_alloc_arg_err. In terms of debugging, adding vmscan:mm_vmscan_wakeup_kswapd would be helpful to see if kswapd gets woken during a stall and kmem:mm_page_alloc to guess when bulk allocation fails. Otherwise, the big hole is the lack of tracing in __alloc_pages_bulk. In retrospect, it would have been appropriate to trace when alloc_pages_bulk filled normally and when it fell back to allocating a single page.
Comment 43 Mel Gorman 2022-05-30 08:06:43 UTC
(In reply to Ho Ming Shun from comment #41)
> (In reply to Mel Gorman from comment #33)
> > Created attachment 301045 [details]
> > Patch to always allocate at least one page
> > 
> > I think this bug might be the same as 216007 in that a partially populated
> > array potentially never makes progress unless something unrelated wakes
> > kswapd. Please test the attached patch against 5.18 if possible.
> 
> There is definitely a marked improvement. 
> 
> It's been more than 48 hours since I rolled out the patch to 10 sites, and
> there hasn't been a single deadlock. In contrast with unpatched 5.17.9, I
> was experiencing 2 to 5 deadlocks a day on 6 sites.
> 
> There is still that worrying deadlock that happened that I lost the traces
> for. Will continue running this setup on more sites and report what I find.

Thanks. It is possible the deadlock with the lost traces was for some other reason. While the bulk allocation failure may have been the cause of many failures, it may not be the only reason. If and when it occurs again, hopefully the traces will have a clue.
Comment 44 Ho Ming Shun 2022-06-02 12:11:06 UTC
I recorded two NFS livelocks today on the patched 5.18.0 kernel.

This is definitely not the bulk allocation issue from the traces and the symptoms. Unlike the bulk allocation issue, this seems to resolve itself after about 30s. It's also much rarer.

It's entirely possible that I have been facing this issue, even before 5.12, but this came to light only with better monitoring of NFS service. Since it tends to recover itself, it might just been written off as a transient power problem.
Comment 45 Ho Ming Shun 2022-06-02 12:11:46 UTC
Created attachment 301089 [details]
Livelock #1
Comment 46 Ho Ming Shun 2022-06-02 12:12:07 UTC
Created attachment 301090 [details]
Livelock #2
Comment 47 Chuck Lever 2022-06-02 14:45:06 UTC
(In reply to Ho Ming Shun from comment #44)
> I recorded two NFS livelocks today on the patched 5.18.0 kernel.
> 
> This is definitely not the bulk allocation issue from the traces and the
> symptoms. Unlike the bulk allocation issue, this seems to resolve itself
> after about 30s. It's also much rarer.
> 
> It's entirely possible that I have been facing this issue, even before 5.12,
> but this came to light only with better monitoring of NFS service. Since it
> tends to recover itself, it might just been written off as a transient power
> problem.

This is indeed not a problem allocating memory. The NFS server cannot connect to a client's NFSv4.0 backchannel service:

       mount.nfs-466093 [001] 177512.296658: xprt_ping:            peer=[127.0.0.1]:2049 status=-110
       mount.nfs-466093 [001] 177512.296659: rpc_call_status:      task:00000001@00000003 status=-110
       mount.nfs-466093 [001] 177512.296660: rpc_timeout_status:   task:00000001@00000003 status=0
       mount.nfs-466093 [001] 177512.296661: rpc_call_rpcerror:    task:00000001@00000003 tk_status=-5 rpc_status=-110

It gives up after a few moments (-110 is ETIMEDOUT). You should figure out whether that client has actually gone away (ie, powered off, or its container was destroyed) or whether you have an actual network issue. The server is trying to connect to that client because the client owns some delegations, and those are preventing other clients from accessing files that client was using.

Meanwhile, housekeeping: I'm marking this bug report FIXED, and if the backchannel issue appears to be a software defect, please create a new bug report since it's a different issue. Thank you for helping us run this one down.