Bug 218743 - NFS-RDMA-Connected Regression Found on Upstream Linux 6.9-rc1
Summary: NFS-RDMA-Connected Regression Found on Upstream Linux 6.9-rc1
Status: ASSIGNED
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: Intel Linux
: P3 high
Assignee: Chuck Lever
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-18 00:00 UTC by Manuel Gomez
Modified: 2024-04-22 18:33 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.9-rc1
Subsystem:
Regression: Yes
Bisected commit-id: e084ee673c77cade06ab4c2e36b5624c82608b8c


Attachments
NFS-RDMA-Connected Trace (1.59 MB, application/x-compressed)
2024-04-19 19:52 UTC, Manuel Gomez
Details

Description Manuel Gomez 2024-04-18 00:00:22 UTC
On the Linux 6.9-rc1 kernel there is a performance regression for NFS file transfers when Connected IPoIB mode is enabled. The network switch is OPA (Omnipath Architecture).

The most recent good commit in my bisection was the v6.8 mainline kernel (e8f897f4afef0031fe618a8e94127a0934896aba). Bisecting from v6.8 to v6.9-rc1 showed me that "[e084ee673c77cade06ab4c2e36b5624c82608b8c] svcrdma: Add Write chunk WRs to the RPC's Send WR chain" was indeed the culprit of the regression.


Here are the steps I ran to reproduce the issue:
1. Establish IPoIB Connected Mode on both client and host nodes:
"echo connected > /sys/class/net/ibs785/mode"


2. Start an NFS server on the host node:
"systemctl start opafm
sleep 10
systemctl start nfs-server
modprobe svcrdma
echo "rdma 20049" > /proc/fs/nfsd/portlist
mkdir -p /mnt/nfs_test
mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test
sudo exportfs -o fsid=0,rw,async,insecure,no_root_squash 192.168.2.0/255.255.255.0:/mnt/nfs_test_testrun/"


3. Ready the client node:
"mkdir -p /mnt/nfs_test
mount -o rdma,port=20049 192.168.2.1:/mnt/nfs_test_testrun /mnt/nfs_test_testrun/"


4. Run the actual test from the client node:
"
#!/bin/bash

fsize=268435456
jfile=/dev/shm/run_nfs_test2.junk
tfile=/dev/shm/run_nfs_test2.tmp
nfsfile=/mnt/nfs_test_testrun/run_nfs_test2.junk
rm -r -f /mnt/nfs_test_testrun/
rm -f ${tfile}
rm -f ${jfile}

dd if=/dev/urandom iflag=fullblock of=${jfile} bs=1024 count=$((fsize/1024));

for i in {1..100}; do
          cp ${jfile} ${nfsfile}; # Bottleneck 1

          cp ${nfsfile} ${tfile}; # Bottleneck 2

         cmp ${jfile} ${tfile};

          rm -f ${tfile};
        echo "DONE with iter ${i}"
done;

rm -f ${jfile};
rm -f ${tfile};
echo "Done";
"


On v6.8 I was seeing this test taking about 1m50s to complete, for all 10 iterations. On v6.9-rc1 it takes 3-7 minutes, and I also see these kernel traces printed continuously in dmesg during this regression:

[23720.243905] INFO: task kworker/61:1:556 blocked for more than 122 seconds.
[23720.251709]       Not tainted 6.9.0-rc1 #1
[23720.256387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23720.265268] task:kworker/61:1    state:D stack:0     pid:556   tgid:556   ppid:2      flags:0x00004000
[23720.275822] Workqueue: events __svc_rdma_free [rpcrdma]
[23720.281803] Call Trace:
[23720.284630]  <TASK>
[23720.287067]  __schedule+0x210/0x660
[23720.291063]  schedule+0x2c/0xb0
[23720.294668]  schedule_timeout+0x146/0x160
[23720.299249]  __wait_for_common+0x92/0x1d0
[23720.303828]  ? __pfx_schedule_timeout+0x10/0x10
[23720.308987]  __ib_drain_sq+0xfa/0x170 [ib_core]
[23720.314190]  ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core]
[23720.320343]  ib_drain_qp+0x71/0x80 [ib_core]
[23720.325232]  __svc_rdma_free+0x28/0x100 [rpcrdma]
[23720.330604]  process_one_work+0x196/0x3d0
[23720.335185]  worker_thread+0x2fc/0x410
[23720.339470]  ? __pfx_worker_thread+0x10/0x10
[23720.344336]  kthread+0xdf/0x110
[23720.347941]  ? __pfx_kthread+0x10/0x10
[23720.352225]  ret_from_fork+0x30/0x50
[23720.356317]  ? __pfx_kthread+0x10/0x10
[23720.360602]  ret_from_fork_asm+0x1a/0x30
[23720.365083]  </TASK>
Comment 1 Chuck Lever 2024-04-19 17:11:52 UTC
I see a performance regression on one of my NFS server systems with one of its interfaces, but not the other. I have not found a reproducer that works on any other system in my lab. I suspect, at this early stage, that this issue is related to the capabilities of each card (ie, MR and QP count limits, queue size limits, and so on).

Can you try one thing for me? On your NFS server, change to a non-NFS directory such as /tmp, and then:

 $ sudo trace-cmd record -e rpcrdma:*err

Run your reproducer, then ^C the "trace-cmd". Attach the resulting trace.dat file to this bug.
Comment 2 Manuel Gomez 2024-04-19 19:52:05 UTC
Created attachment 306187 [details]
NFS-RDMA-Connected Trace

Hello Chuck. Please see my trace file attached. Thank you!
Comment 3 Chuck Lever 2024-04-20 15:07:22 UTC
  nfsd-7771  [060]  1758.891809: svcrdma_sq_post_err:  cq.id=205 cid=226 sc_sq_avail=13643/851 status=-12

sq_post_err reports ENOMEM, and the rdma->sc_sq_avail (13643) is >> rdma->sc_sq_depth (851). The number of available SQ entries is always supposed to be smaller than the SQ depth. That seems like a Send Queue accounting bug in svcrdma.

I've created a patch to revert e084ee673c77 and applied it to the nfsd-fixes branch at

  https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

Can you test it?
Comment 4 Manuel Gomez 2024-04-22 18:31:37 UTC
I tried reproducing the regression with your nfsd-fixes branch and I also reverted the faulty commit from the baseline v6.9-rc1 kernel. I was unable to reproduce the issue with e084ee673c77 reverted. Thank you very much.

Note You need to log in before you can comment on or make changes to this bug.