Bug 218743
Summary: | NFS-RDMA-Connected Regression Found on Upstream Linux 6.9-rc1 | ||
---|---|---|---|
Product: | File System | Reporter: | Manuel Gomez (manuel.gomez) |
Component: | NFS | Assignee: | Chuck Lever (cel) |
Status: | ASSIGNED --- | ||
Severity: | high | CC: | dennis.dalessandro, jlayton, stephen |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.9-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | e084ee673c77cade06ab4c2e36b5624c82608b8c |
Attachments: | NFS-RDMA-Connected Trace |
Description
Manuel Gomez
2024-04-18 00:00:22 UTC
I see a performance regression on one of my NFS server systems with one of its interfaces, but not the other. I have not found a reproducer that works on any other system in my lab. I suspect, at this early stage, that this issue is related to the capabilities of each card (ie, MR and QP count limits, queue size limits, and so on). Can you try one thing for me? On your NFS server, change to a non-NFS directory such as /tmp, and then: $ sudo trace-cmd record -e rpcrdma:*err Run your reproducer, then ^C the "trace-cmd". Attach the resulting trace.dat file to this bug. Created attachment 306187 [details]
NFS-RDMA-Connected Trace
Hello Chuck. Please see my trace file attached. Thank you!
nfsd-7771 [060] 1758.891809: svcrdma_sq_post_err: cq.id=205 cid=226 sc_sq_avail=13643/851 status=-12 sq_post_err reports ENOMEM, and the rdma->sc_sq_avail (13643) is >> rdma->sc_sq_depth (851). The number of available SQ entries is always supposed to be smaller than the SQ depth. That seems like a Send Queue accounting bug in svcrdma. I've created a patch to revert e084ee673c77 and applied it to the nfsd-fixes branch at https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git Can you test it? I tried reproducing the regression with your nfsd-fixes branch and I also reverted the faulty commit from the baseline v6.9-rc1 kernel. I was unable to reproduce the issue with e084ee673c77 reverted. Thank you very much. |