On the Linux 6.9-rc1 kernel there is a performance regression for NFS file transfers when Connected IPoIB mode is enabled. The network switch is OPA (Omnipath Architecture). The most recent good commit in my bisection was the v6.8 mainline kernel (e8f897f4afef0031fe618a8e94127a0934896aba). Bisecting from v6.8 to v6.9-rc1 showed me that "[e084ee673c77cade06ab4c2e36b5624c82608b8c] svcrdma: Add Write chunk WRs to the RPC's Send WR chain" was indeed the culprit of the regression. Here are the steps I ran to reproduce the issue: 1. Establish IPoIB Connected Mode on both client and host nodes: "echo connected > /sys/class/net/ibs785/mode" 2. Start an NFS server on the host node: "systemctl start opafm sleep 10 systemctl start nfs-server modprobe svcrdma echo "rdma 20049" > /proc/fs/nfsd/portlist mkdir -p /mnt/nfs_test mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test sudo exportfs -o fsid=0,rw,async,insecure,no_root_squash 192.168.2.0/255.255.255.0:/mnt/nfs_test_testrun/" 3. Ready the client node: "mkdir -p /mnt/nfs_test mount -o rdma,port=20049 192.168.2.1:/mnt/nfs_test_testrun /mnt/nfs_test_testrun/" 4. Run the actual test from the client node: " #!/bin/bash fsize=268435456 jfile=/dev/shm/run_nfs_test2.junk tfile=/dev/shm/run_nfs_test2.tmp nfsfile=/mnt/nfs_test_testrun/run_nfs_test2.junk rm -r -f /mnt/nfs_test_testrun/ rm -f ${tfile} rm -f ${jfile} dd if=/dev/urandom iflag=fullblock of=${jfile} bs=1024 count=$((fsize/1024)); for i in {1..100}; do cp ${jfile} ${nfsfile}; # Bottleneck 1 cp ${nfsfile} ${tfile}; # Bottleneck 2 cmp ${jfile} ${tfile}; rm -f ${tfile}; echo "DONE with iter ${i}" done; rm -f ${jfile}; rm -f ${tfile}; echo "Done"; " On v6.8 I was seeing this test taking about 1m50s to complete, for all 10 iterations. On v6.9-rc1 it takes 3-7 minutes, and I also see these kernel traces printed continuously in dmesg during this regression: [23720.243905] INFO: task kworker/61:1:556 blocked for more than 122 seconds. [23720.251709] Not tainted 6.9.0-rc1 #1 [23720.256387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23720.265268] task:kworker/61:1 state:D stack:0 pid:556 tgid:556 ppid:2 flags:0x00004000 [23720.275822] Workqueue: events __svc_rdma_free [rpcrdma] [23720.281803] Call Trace: [23720.284630] <TASK> [23720.287067] __schedule+0x210/0x660 [23720.291063] schedule+0x2c/0xb0 [23720.294668] schedule_timeout+0x146/0x160 [23720.299249] __wait_for_common+0x92/0x1d0 [23720.303828] ? __pfx_schedule_timeout+0x10/0x10 [23720.308987] __ib_drain_sq+0xfa/0x170 [ib_core] [23720.314190] ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core] [23720.320343] ib_drain_qp+0x71/0x80 [ib_core] [23720.325232] __svc_rdma_free+0x28/0x100 [rpcrdma] [23720.330604] process_one_work+0x196/0x3d0 [23720.335185] worker_thread+0x2fc/0x410 [23720.339470] ? __pfx_worker_thread+0x10/0x10 [23720.344336] kthread+0xdf/0x110 [23720.347941] ? __pfx_kthread+0x10/0x10 [23720.352225] ret_from_fork+0x30/0x50 [23720.356317] ? __pfx_kthread+0x10/0x10 [23720.360602] ret_from_fork_asm+0x1a/0x30 [23720.365083] </TASK>
I see a performance regression on one of my NFS server systems with one of its interfaces, but not the other. I have not found a reproducer that works on any other system in my lab. I suspect, at this early stage, that this issue is related to the capabilities of each card (ie, MR and QP count limits, queue size limits, and so on). Can you try one thing for me? On your NFS server, change to a non-NFS directory such as /tmp, and then: $ sudo trace-cmd record -e rpcrdma:*err Run your reproducer, then ^C the "trace-cmd". Attach the resulting trace.dat file to this bug.
Created attachment 306187 [details] NFS-RDMA-Connected Trace Hello Chuck. Please see my trace file attached. Thank you!
nfsd-7771 [060] 1758.891809: svcrdma_sq_post_err: cq.id=205 cid=226 sc_sq_avail=13643/851 status=-12 sq_post_err reports ENOMEM, and the rdma->sc_sq_avail (13643) is >> rdma->sc_sq_depth (851). The number of available SQ entries is always supposed to be smaller than the SQ depth. That seems like a Send Queue accounting bug in svcrdma. I've created a patch to revert e084ee673c77 and applied it to the nfsd-fixes branch at https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git Can you test it?
I tried reproducing the regression with your nfsd-fixes branch and I also reverted the faulty commit from the baseline v6.9-rc1 kernel. I was unable to reproduce the issue with e084ee673c77 reverted. Thank you very much.