Bug 215975
Summary: | NFSD stops serving clients | ||
---|---|---|---|
Product: | File System | Reporter: | Ho Ming Shun (cyph1984) |
Component: | NFS | Assignee: | Chuck Lever (chuck.lever) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | mgorman, mgorman, regressions |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 5.13.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Pcap of it transitioning from working to non-working state
trace-cmd record -e nfsd -e sunrpc of deadlock crash #2 on same machine crash #3 at a different site crash #4 at yet another site crash #5 sysreq dump #1 PCAP of crash on 5.18.0 sysreq dump #2 Patch to always allocate at least one page trace-cmd record -p function_graph -g __alloc_pages_bulk --max-graph-depth=4 Livelock #1 Livelock #2 |
Description
Ho Ming Shun
2022-05-13 18:48:19 UTC
Created attachment 300954 [details]
Pcap of it transitioning from working to non-working state
Hi, I'm not one of the developers that should handle it, but consider forwarding it to them. But beforehand a quick question: does this problem still happen with the latest kernel versions (trying 5.18-rc would be ideal, but testing the latest 5.17.y will do, to) (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #2) > Hi, I'm not one of the developers that should handle it, but consider > forwarding it to them. But beforehand a quick question: does this problem > still happen with the latest kernel versions (trying 5.18-rc would be ideal, > but testing the latest 5.17.y will do, to) It still happens on 15.17.7, will try to reproduce it on a newer kernel, but it is quite rare and annoyingly only seem to happen in production. It seems to usually happen after creation and deletion of large (> 1gb) files, when ffmpeg trasmuxes a file and a script deletes the original video file. (In reply to Ho Ming Shun from comment #3) > It still happens on 15.17.7, will try to reproduce it on a newer kernel, but > it is quite rare and annoyingly only seem to happen in production. okay, thx, that's good enough for me/now! Are there any relevant kernel log entries about the time it stops responding? The symptomology is similar to an oops/BUG that occurs while holding an NFSD-related spin lock or mutex. Evidence of that might appear in the kernel log. (In reply to Chuck Lever from comment #5) > Are there any relevant kernel log entries about the time it stops > responding? The symptomology is similar to an oops/BUG that occurs while > holding an NFSD-related spin lock or mutex. Evidence of that might appear in > the kernel log. No oops/BUG when it happens. Weird thing is it does not recover even if I unload nfsd module. So it seems to happen outside of NFSD, but is still NFSD related. I'll operate on the assumption that this is a kernel bug and not a failure of an external service like DNS (ie, a failure that might leave the server's user space in some kind of weird state). You could rebuild your server kernel with some debugging: -- CONFIG_PROVE_LOCKING -- CONFIG_DETECT_HUNG_TASK When the server hangs, it might produce some kernel log information about what is stuck after a few minutes. (In reply to Chuck Lever from comment #7) > I'll operate on the assumption that this is a kernel bug and not a failure > of an external service like DNS (ie, a failure that might leave the server's IIRC if /etc/exports doesn't use hostnames, it should not rely on DNS? Furthermore, an already established NFS connection should not fail because of DNS. /etc exports is pretty simple, with no auth / *(ro,async,crossmnt,no_subtree_check,no_root_squash,fsid=0) /mnt/ *(ro,async,no_subtree_check,fsid=1) /srv/ *(ro,async,crossmnt,no_root_squash,no_subtree_check,fsid=3) > user space in some kind of weird state). You could rebuild your server > kernel with some debugging: > > -- CONFIG_PROVE_LOCKING > -- CONFIG_DETECT_HUNG_TASK > > When the server hangs, it might produce some kernel log information about > what is stuck after a few minutes. ok let me get this rolled out on a few servers (In reply to Chuck Lever from comment #7) > I'll operate on the assumption that this is a kernel bug and not a failure > of an external service like DNS (ie, a failure that might leave the server's > user space in some kind of weird state). You could rebuild your server > kernel with some debugging: > > -- CONFIG_PROVE_LOCKING > -- CONFIG_DETECT_HUNG_TASK > > When the server hangs, it might produce some kernel log information about > what is stuck after a few minutes. Just had a NFSD crash on a 5.17.9-arch1 kernel with CONFIG_PROVE_LOCKING and CONFIG_DETECT_HUNG_TASK. After 30 minutes of waiting, no kernel errors. I tried tracing a crashed NFS server with the following: cd /sys/kernel/tracing echo 'nfsd*' > set_ftrace_filter echo 'function' > current_tracer However it produces no lines. Is there something else I should set set_ftrace_filter to? (In reply to Ho Ming Shun from comment #10) > I tried tracing a crashed NFS server with the following: > > cd /sys/kernel/tracing > echo 'nfsd*' > set_ftrace_filter > echo 'function' > current_tracer > > However it produces no lines. Is there something else I should set > set_ftrace_filter to? I find the sysfs interface to be unwieldy, so I use "trace-cmd" instead. "trace-cmd record -e nfsd" will capture all NFSD trace events, and adding "-e sunrpc" will capture RPC server-side tracepoints as well. But note that tracing the NFS server can result in an avalanche of data. You will need to reduce the blanket capture parameters I just gave above to monitor just the events you think will provide the breadcrumb trail needed to identify where the deadlock occurs. Maybe: # trace-cmd record -e nfsd:nfsd_compound_status ... wait until the deadlock ... ^C # trace-cmd report | tail -100 Thanks for pointing out the trace-cmd tool! I wrote some scripts to detect an NFS hang and preserve a ~90s window of trace events using -e nfsd -e sunrpc. I am not familiar enough with the nfsd codepaths to figure out a narrower filter. Attached is the output of trace-cmd report. Created attachment 301023 [details]
trace-cmd record -e nfsd -e sunrpc of deadlock
It seems like the error happens at around 22324.220112. I am running 6 sites with this tracing setup using kernel 5.17.9-arch1-1. I should be able to get another deadlock in the next 48 hours. Created attachment 301024 [details]
crash #2 on same machine
Generated with trace-cmd record -e nfsd -e sunrpc
Created attachment 301025 [details]
crash #3 at a different site
crash starts at 61340.557100
Created attachment 301028 [details]
crash #4 at yet another site
crash starts at around 46195.454833
I looked at the first trace attachment above. The reason for work stoppage is obvious and unexpected. Up to this point, normal operation, and then: nfsd-460 [000] 22324.180946: svc_alloc_arg_err: pages=67 occasionally, then boom: nfsd-453 [000] 22324.220113: svc_alloc_arg_err: pages=67 nfsd-454 [001] 22324.220780: svc_alloc_arg_err: pages=67 nfsd-455 [000] 22324.220780: svc_alloc_arg_err: pages=67 nfsd-460 [000] 22324.220782: svc_alloc_arg_err: pages=67 nfsd-456 [001] 22324.224111: svc_alloc_arg_err: pages=67 nfsd-457 [001] 22324.224113: svc_alloc_arg_err: pages=67 nfsd-459 [000] 22324.224120: svc_alloc_arg_err: pages=67 nfsd-458 [000] 22324.224122: svc_alloc_arg_err: pages=67 nfsd-453 [001] 22324.224123: svc_alloc_arg_err: pages=67 No nfsd thread is able to make forward progress because alloc_pages_bulk_array() is failing every time it is called. That's a sign of some kind of memory exhaustion. alloc_pages_bulk_array() is supposed to call the normal page allocator if it can't make progress, so even during pool exhaustion, it should return at least one free page. This issue does not seem related to nfsd. It also possibly explains why a system reboot is necessary to clear the condition. When this deadlocked state occurs again, capture a task dump on that server and attach the kernel log with that information to this bug. # echo t > /proc/sysrq-trigger Created attachment 301029 [details]
crash #5
deadlocks starts at 14115.775954
Created attachment 301037 [details]
sysreq dump #1
(In reply to Chuck Lever from comment #18) > > No nfsd thread is able to make forward progress because > alloc_pages_bulk_array() is failing every time it is called. That's a sign > of some kind of memory exhaustion. > If the kernel is having trouble finding order-0 pages then isn't it a pretty catastrophic state where many other things should be failing? It shouldn't be happening on a 64-bit system with 2GB ram, 1GB swap. > alloc_pages_bulk_array() is supposed to call the normal page allocator if it > can't make progress, so even during pool exhaustion, it should return at > least one free page. > > This issue does not seem related to nfsd. It also possibly explains why a > system reboot is necessary to clear the condition. It seems like I am the only one running into this problem. Such a bug should have manifested in other systems over such a long time (5.13 - 5.17). Could it be due to me using BTRFS? Or my use of link-local IPV6 addresses? > > When this deadlocked state occurs again, capture a task dump on that server > and attach the kernel log with that information to this bug. > > # echo t > /proc/sysrq-trigger Created attachment 301040 [details]
PCAP of crash on 5.18.0
Updated one site to 5.18.0 to see what happens. Just observed a crash with a different behaviour. Seems like in this case, NFSD does not even SYN-ACK the connection attempt.
Similar output with trace-cmd: a flood of svc_alloc_arg_err: pages=67
Created attachment 301041 [details]
sysreq dump #2
sysreq dump on 5.17.9
(In reply to Ho Ming Shun from comment #21) > (In reply to Chuck Lever from comment #18) > > > > No nfsd thread is able to make forward progress because > > alloc_pages_bulk_array() is failing every time it is called. That's a sign > > of some kind of memory exhaustion. > > > > If the kernel is having trouble finding order-0 pages then isn't it a pretty > catastrophic state where many other things should be failing? alloc_bulk_pages_array() is supposed to do an order-0 allocation when it can't find a pool of pages. There might be a bug where it isn't doing that, or isn't able to do it. I'm still investigating. > It shouldn't be happening on a 64-bit system with 2GB ram, 1GB swap. 2GB is tiny, barely adequate I'd say. Try adding some memory to this system as a workaround. The task dump was not probative (or I'm not smart enough to get information from it). (In reply to Chuck Lever from comment #24) > > 2GB is tiny, barely adequate I'd say. Try adding some memory to this system > as a workaround. Yeap it's tiny, but it works with adequate performance for its workload. This is the free -m when locks up. Seems to have plenty of memory left for kernel usage. total used free shared buff/cache available Mem: 1729 535 90 8 1103 960 Swap: 1023 236 787 I can add more RAM as a test, but that'll take some time (physical servers in various locations around the country). Easiest workaround is to downgrade to 5.12. I've examined the GETATTR requests on the wire before the issue begins, and they are not remarkable. There isn't much observability in __alloc_bulk_pages(), so it's difficult to say why it is not making forward progress. You might try this: # trace-cmd record -p function_graph -g __alloc_pages_bulk --max-graph-depth=4 With the usual mechanism of capturing just the events adjacent to the hang. I have tried reproducing here with a 2GB x86 guest. I haven't seen an issue, so we might lean on your patience to help us continue to diagnose the issue. (In reply to Ho Ming Shun from comment #25) > (In reply to Chuck Lever from comment #24) > > > > > 2GB is tiny, barely adequate I'd say. Try adding some memory to this system > > as a workaround. > > Yeap it's tiny, but it works with adequate performance for its workload. > > This is the free -m when locks up. Seems to have plenty of memory left for > kernel usage. > > total used free shared buff/cache > available > Mem: 1729 535 90 8 1103 > 960 > Swap: 1023 236 787 Since you're pushing into swap, the system is memory-constrained IME. In this situation, memory fragmentation can reduce the amount of memory that is usable. Since this is RPi, I think your page size is 16KB. I'm not sure if that makes memory fragmentation a deeper issue than it is on x86. > I can add more RAM as a test, but that'll take some time (physical servers > in various locations around the country). Easiest workaround is to downgrade > to 5.12. Noted. (In reply to Chuck Lever from comment #27) > > Since you're pushing into swap, the system is memory-constrained IME. In > this situation, memory fragmentation can reduce the amount of memory that is > usable. Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67 pages here, not 67 contiguous pages? So fragmentation is probably not an issue here, just the amount of memory available? > > Since this is RPi, I think your page size is 16KB. I'm not sure if that > makes memory fragmentation a deeper issue than it is on x86. To be clear, the NFS server (which is the side locking up) is x86. Clients are RPis having page size 4096. (In reply to Ho Ming Shun from comment #28) > (In reply to Chuck Lever from comment #27) > > > > > Since you're pushing into swap, the system is memory-constrained IME. In > > this situation, memory fragmentation can reduce the amount of memory that > is > > usable. > > Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67 > pages here, not 67 contiguous pages? So fragmentation is probably not an > issue here, just the amount of memory available? > > > > > Since this is RPi, I think your page size is 16KB. I'm not sure if that > > makes memory fragmentation a deeper issue than it is on x86. > > To be clear, the NFS server (which is the side locking up) is x86. > > Clients are RPis having page size 4096. Then I don't understand why an x86 NFS server would request only 67 order-0 pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed to contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server to, say, 256KB? (In reply to Chuck Lever from comment #29) > (In reply to Ho Ming Shun from comment #28) > > (In reply to Chuck Lever from comment #27) > > > > > > > > Since you're pushing into swap, the system is memory-constrained IME. In > > > this situation, memory fragmentation can reduce the amount of memory that > > is > > > usable. > > > > Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67 > > pages here, not 67 contiguous pages? So fragmentation is probably not an > > issue here, just the amount of memory available? > > > > > > > > Since this is RPi, I think your page size is 16KB. I'm not sure if that > > > makes memory fragmentation a deeper issue than it is on x86. > > > > To be clear, the NFS server (which is the side locking up) is x86. > > > > Clients are RPis having page size 4096. > > Then I don't understand why an x86 NFS server would request only 67 order-0 > pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed to > contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server > to, say, 256KB? I've no idea how to set sv_max_mesg so it's probably the default. Could this be affected by client side mount options? (In reply to Ho Ming Shun from comment #30) > (In reply to Chuck Lever from comment #29) > > Then I don't understand why an x86 NFS server would request only 67 order-0 > > pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed > to > > contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server > > to, say, 256KB? > > I've no idea how to set sv_max_mesg so it's probably the default. Could this > be affected by client side mount options? It's set based on the network transport. For TCP it should be 1024KB plus a little extra. But, probably not important. I'm trying to make sense of the information available in the bug report. Give the new trace-cmd a try. That will tell us if __alloc_pages() is being invoked or something else is happening. I will continue trying to reproduce here. (In reply to Chuck Lever from comment #29) > (In reply to Ho Ming Shun from comment #28) > > (In reply to Chuck Lever from comment #27) > > > > > > > > Since you're pushing into swap, the system is memory-constrained IME. In > > > this situation, memory fragmentation can reduce the amount of memory that > > is > > > usable. > > > > Hmmm. I am probably horribly misunderstanding this. NFS is trying to get 67 > > pages here, not 67 contiguous pages? So fragmentation is probably not an > > issue here, just the amount of memory available? > > > > > > > > Since this is RPi, I think your page size is 16KB. I'm not sure if that > > > makes memory fragmentation a deeper issue than it is on x86. > > > > To be clear, the NFS server (which is the side locking up) is x86. > > > > Clients are RPis having page size 4096. > > Then I don't understand why an x86 NFS server would request only 67 order-0 > pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed to > contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this server > to, say, 256KB? After some code-digging (In reply to Chuck Lever from comment #31) > (In reply to Ho Ming Shun from comment #30) > > (In reply to Chuck Lever from comment #29) > > > Then I don't understand why an x86 NFS server would request only 67 > order-0 > > > pages from __alloc_bulk_pages(). The svc_rqst::rq_pages array is supposed > > to > > > contain nearly 260 pages. Is the sv_max_mesg somehow reduced on this > server > > > to, say, 256KB? > > > > I've no idea how to set sv_max_mesg so it's probably the default. Could > this > > be affected by client side mount options? > > It's set based on the network transport. For TCP it should be 1024KB plus a > little extra. But, probably not important. I'm trying to make sense of the > information available in the bug report. Just to be sure: # cat /proc/fs/nfsd/max_block_size 262144 > > Give the new trace-cmd a try. That will tell us if __alloc_pages() is being > invoked or something else is happening. I will continue trying to reproduce > here. Already rolled out. Waiting for a crash :) Created attachment 301045 [details]
Patch to always allocate at least one page
I think this bug might be the same as 216007 in that a partially populated array potentially never makes progress unless something unrelated wakes kswapd. Please test the attached patch against 5.18 if possible.
(In reply to Mel Gorman from comment #33) > Created attachment 301045 [details] > Patch to always allocate at least one page > > I think this bug might be the same as 216007 in that a partially populated > array potentially never makes progress unless something unrelated wakes > kswapd. Please test the attached patch against 5.18 if possible. Rebuilding kernel with this patch. Will roll out and test tomorrow. Does this explain the persistence of my issue though? Once NFSD hangs it hangs permanently until reboot. Created attachment 301048 [details] trace-cmd record -p function_graph -g __alloc_pages_bulk --max-graph-depth=4 This was captured without Mel Gorman's patches. As if in reply to my previous comment (Comment #34), this is the first time I've seen the NFSD lockup resolved by itself!! Luckily, it happened while everyone was asleep :) About 1.5 hours after the lockup, my logs show the NFS clients recovering and operating normally. I'll proceed to deploy my my patched 5.18.0 kernels. (In reply to Ho Ming Shun from comment #34) > (In reply to Mel Gorman from comment #33) > > Created attachment 301045 [details] > > Patch to always allocate at least one page > > > > I think this bug might be the same as 216007 in that a partially populated > > array potentially never makes progress unless something unrelated wakes > > kswapd. Please test the attached patch against 5.18 if possible. > > Rebuilding kernel with this patch. Will roll out and test tomorrow. > > Does this explain the persistence of my issue though? Once NFSD hangs it > hangs permanently until reboot. No, it doesn't really explain why a reboot is necessary. The XFS bug was clear-cut in that the situation that triggers the condition could easily persist. With nfs and sunrpc unloaded and reloaded I would expect any existing partial arrays would be freed and it would at least partially work for a while given that the page cache would also be freed. However, I would still like to eliminate a known problem to see if the "shape" of the problem changes. (In reply to Mel Gorman from comment #36) > (In reply to Ho Ming Shun from comment #34) > > (In reply to Mel Gorman from comment #33) > > > Created attachment 301045 [details] > > > Patch to always allocate at least one page > > > > > > I think this bug might be the same as 216007 in that a partially > populated > > > array potentially never makes progress unless something unrelated wakes > > > kswapd. Please test the attached patch against 5.18 if possible. > > > > Rebuilding kernel with this patch. Will roll out and test tomorrow. > > > > Does this explain the persistence of my issue though? Once NFSD hangs it > > hangs permanently until reboot. > > No, it doesn't really explain why a reboot is necessary. The XFS bug was > clear-cut in that the situation that triggers the condition could easily > persist. With nfs and sunrpc unloaded and reloaded I would expect any > existing partial arrays would be freed and it would at least partially work > for a while given that the page cache would also be freed. However, I would > still like to eliminate a known problem to see if the "shape" of the problem > changes. As I mentioned in Comment #35, I just realized I am not looking at a permanent deadlock. I observed it resolving itself after 1.5 hours. We were usually too quick to reboot a server without seeing if it will revive itself. It's been 8 hours since I rolled out your patch. Usually I would have seen something. Let's give it another 24 hours. (In reply to Mel Gorman from comment #36) > (In reply to Ho Ming Shun from comment #34) > > (In reply to Mel Gorman from comment #33) > > > Created attachment 301045 [details] > > > Patch to always allocate at least one page > > > > > > I think this bug might be the same as 216007 in that a partially > populated > > > array potentially never makes progress unless something unrelated wakes > > > kswapd. Please test the attached patch against 5.18 if possible. > > > > Rebuilding kernel with this patch. Will roll out and test tomorrow. > > > > Does this explain the persistence of my issue though? Once NFSD hangs it > > hangs permanently until reboot. > > No, it doesn't really explain why a reboot is necessary. The XFS bug was > clear-cut in that the situation that triggers the condition could easily > persist. With nfs and sunrpc unloaded and reloaded I would expect any > existing partial arrays would be freed and it would at least partially work > for a while given that the page cache would also be freed. However, I would > still like to eliminate a known problem to see if the "shape" of the problem > changes. I just got another deadlock with your patch applied onto 5.18.0. It feels like it's taking longer between crashes. But that could just be placebo. Unfortunately I fat-fingered a command and overwrote the traces :( (In reply to Ho Ming Shun from comment #38) > (In reply to Mel Gorman from comment #36) > > (In reply to Ho Ming Shun from comment #34) > > > (In reply to Mel Gorman from comment #33) > > > > Created attachment 301045 [details] > > > > Patch to always allocate at least one page > > > > > > > > I think this bug might be the same as 216007 in that a partially > > populated > > > > array potentially never makes progress unless something unrelated wakes > > > > kswapd. Please test the attached patch against 5.18 if possible. > > > > > > Rebuilding kernel with this patch. Will roll out and test tomorrow. > > > > > > Does this explain the persistence of my issue though? Once NFSD hangs it > > > hangs permanently until reboot. > > > > No, it doesn't really explain why a reboot is necessary. The XFS bug was > > clear-cut in that the situation that triggers the condition could easily > > persist. With nfs and sunrpc unloaded and reloaded I would expect any > > existing partial arrays would be freed and it would at least partially work > > for a while given that the page cache would also be freed. However, I would > > still like to eliminate a known problem to see if the "shape" of the > problem > > changes. > > I just got another deadlock with your patch applied onto 5.18.0. It feels > like it's taking longer between crashes. But that could just be placebo. > > Unfortunately I fat-fingered a command and overwrote the traces :( That's unfortunate because I need to see the trace to determine if svc_alloc_arg_err appears in the trace during the deadlock reporting multiple attempts to allocate the same number of pages. (In reply to Mel Gorman from comment #39) > That's unfortunate because I need to see the trace to determine if > svc_alloc_arg_err appears in the trace during the deadlock reporting > multiple attempts to allocate the same number of pages. Mel, with this issue, we have an opportunity to fine-tune the information gathered by the svc_alloc_arg_err trace point. I already have started a patch that adds the returned number of pages, as a point of comparison. Let me know if there is other useful information that might be added. (In reply to Mel Gorman from comment #33) > Created attachment 301045 [details] > Patch to always allocate at least one page > > I think this bug might be the same as 216007 in that a partially populated > array potentially never makes progress unless something unrelated wakes > kswapd. Please test the attached patch against 5.18 if possible. There is definitely a marked improvement. It's been more than 48 hours since I rolled out the patch to 10 sites, and there hasn't been a single deadlock. In contrast with unpatched 5.17.9, I was experiencing 2 to 5 deadlocks a day on 6 sites. There is still that worrying deadlock that happened that I lost the traces for. Will continue running this setup on more sites and report what I find. (In reply to Chuck Lever from comment #40) > (In reply to Mel Gorman from comment #39) > > That's unfortunate because I need to see the trace to determine if > > svc_alloc_arg_err appears in the trace during the deadlock reporting > > multiple attempts to allocate the same number of pages. > > Mel, with this issue, we have an opportunity to fine-tune the information > gathered by the svc_alloc_arg_err trace point. I already have started a > patch that adds the returned number of pages, as a point of comparison. Let > me know if there is other useful information that might be added. That would be a definite improvement to svc_alloc_arg_err. In terms of debugging, adding vmscan:mm_vmscan_wakeup_kswapd would be helpful to see if kswapd gets woken during a stall and kmem:mm_page_alloc to guess when bulk allocation fails. Otherwise, the big hole is the lack of tracing in __alloc_pages_bulk. In retrospect, it would have been appropriate to trace when alloc_pages_bulk filled normally and when it fell back to allocating a single page. (In reply to Ho Ming Shun from comment #41) > (In reply to Mel Gorman from comment #33) > > Created attachment 301045 [details] > > Patch to always allocate at least one page > > > > I think this bug might be the same as 216007 in that a partially populated > > array potentially never makes progress unless something unrelated wakes > > kswapd. Please test the attached patch against 5.18 if possible. > > There is definitely a marked improvement. > > It's been more than 48 hours since I rolled out the patch to 10 sites, and > there hasn't been a single deadlock. In contrast with unpatched 5.17.9, I > was experiencing 2 to 5 deadlocks a day on 6 sites. > > There is still that worrying deadlock that happened that I lost the traces > for. Will continue running this setup on more sites and report what I find. Thanks. It is possible the deadlock with the lost traces was for some other reason. While the bulk allocation failure may have been the cause of many failures, it may not be the only reason. If and when it occurs again, hopefully the traces will have a clue. I recorded two NFS livelocks today on the patched 5.18.0 kernel. This is definitely not the bulk allocation issue from the traces and the symptoms. Unlike the bulk allocation issue, this seems to resolve itself after about 30s. It's also much rarer. It's entirely possible that I have been facing this issue, even before 5.12, but this came to light only with better monitoring of NFS service. Since it tends to recover itself, it might just been written off as a transient power problem. Created attachment 301089 [details]
Livelock #1
Created attachment 301090 [details]
Livelock #2
(In reply to Ho Ming Shun from comment #44) > I recorded two NFS livelocks today on the patched 5.18.0 kernel. > > This is definitely not the bulk allocation issue from the traces and the > symptoms. Unlike the bulk allocation issue, this seems to resolve itself > after about 30s. It's also much rarer. > > It's entirely possible that I have been facing this issue, even before 5.12, > but this came to light only with better monitoring of NFS service. Since it > tends to recover itself, it might just been written off as a transient power > problem. This is indeed not a problem allocating memory. The NFS server cannot connect to a client's NFSv4.0 backchannel service: mount.nfs-466093 [001] 177512.296658: xprt_ping: peer=[127.0.0.1]:2049 status=-110 mount.nfs-466093 [001] 177512.296659: rpc_call_status: task:00000001@00000003 status=-110 mount.nfs-466093 [001] 177512.296660: rpc_timeout_status: task:00000001@00000003 status=0 mount.nfs-466093 [001] 177512.296661: rpc_call_rpcerror: task:00000001@00000003 tk_status=-5 rpc_status=-110 It gives up after a few moments (-110 is ETIMEDOUT). You should figure out whether that client has actually gone away (ie, powered off, or its container was destroyed) or whether you have an actual network issue. The server is trying to connect to that client because the client owns some delegations, and those are preventing other clients from accessing files that client was using. Meanwhile, housekeeping: I'm marking this bug report FIXED, and if the backchannel issue appears to be a software defect, please create a new bug report since it's a different issue. Thank you for helping us run this one down. |