As reported in: https://marc.info/?l=linux-nfs&m=165115109917372&w=4 There is an apparent regression in the NFS client starting with 5.18 kernels. The testing being dons is with NFS/RDMA on OPA HW. To start an NFS server I am doing the following: # cat start_nfs_server.sh #!/bin/bash systemctl start opafm sleep 10 systemctl start nfs-server modprobe svcrdma echo "rdma 20049" > /proc/fs/nfsd/portlist mkdir -p /mnt/nfs_test mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test exportfs -o fsid=0,rw,async,insecure,no_root_squash 192.168.254.0/255.255.255.0:/mnt/nfs_test exportfs -s Now to start the client: # cat start_nfs_client.sh #!/bin/bash mkdir -p /mnt/nfs_test mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test ls /mnt/nfs_test df -h The test can be distilled down to: # cat nfs_test.sh #!/bin/bash fsize=268435456 jfile=/dev/shm/run_nfs_test.junk nfsfile=/mnt/nfs_test/run_nfs_test.junk tfile=/dev/shm/run_nfs_test.tmp fail=0 echo "Creating $jfile..." dd if=/dev/urandom iflag=fullblock of=${jfile} bs=1024 count=$((fsize/1024)); echo "Done"; for i in {1..2500}; do echo "Test $i FailCount=$fail"; start=`date +%s`; echo "copy $jfile to $nfsfile..."; time cp ${jfile} ${nfsfile}; echo "Done"; echo "copy $nfsfile to $tfile..."; time cp ${nfsfile} ${tfile}; <--- can take more than 60s vs .02s echo "Done"; stop=`date +%s`; echo "Comparing files..."; cmp ${jfile} ${tfile}; if [[ $? -ne 0 ]]; then echo "Filed comparison failed"; exit 1; fi echo "Done"; echo "Remove $tfile..."; time rm -f ${tfile}; echo "Done"; echo "Remove $nfsfile..."; time rm -f ${nfsfile}; echo "Done"; dur=$(($stop-$start)); if [[ $dur -gt 2 ]]; then echo "FAIL: Took too long\n"; fail=$(($fail+1)); if [[ $fail -gt 10 ]]; then echo "Fail limit reached. Bailing." exit 1 fi fi done echo "Remove $jfile..."; time rm -f ${jfile}; echo "Done"; echo "Total Failures: $fail"
(In reply to Denny Dalessandro from comment #0) > As reported in: > https://marc.info/?l=linux-nfs&m=165115109917372&w=4 > > There is an apparent regression in the NFS client starting with 5.18 > kernels. The testing being dons is with NFS/RDMA on OPA HW. > > To start an NFS server I am doing the following: > > # cat start_nfs_server.sh > #!/bin/bash > systemctl start opafm > sleep 10 > systemctl start nfs-server > modprobe svcrdma > echo "rdma 20049" > /proc/fs/nfsd/portlist > mkdir -p /mnt/nfs_test > mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test > exportfs -o fsid=0,rw,async,insecure,no_root_squash > 192.168.254.0/255.255.255.0:/mnt/nfs_test > exportfs -s > > Now to start the client: > # cat start_nfs_client.sh > #!/bin/bash > mkdir -p /mnt/nfs_test > mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test > ls /mnt/nfs_test > df -h > > The test can be distilled down to: > # cat nfs_test.sh > #!/bin/bash > > fsize=268435456 > jfile=/dev/shm/run_nfs_test.junk > nfsfile=/mnt/nfs_test/run_nfs_test.junk > tfile=/dev/shm/run_nfs_test.tmp > fail=0 > > echo "Creating $jfile..." > dd if=/dev/urandom iflag=fullblock of=${jfile} bs=1024 count=$((fsize/1024)); > echo "Done"; > > for i in {1..2500}; do > echo "Test $i FailCount=$fail"; > > start=`date +%s`; > > echo "copy $jfile to $nfsfile..."; > time cp ${jfile} ${nfsfile}; > echo "Done"; > > echo "copy $nfsfile to $tfile..."; > time cp ${nfsfile} ${tfile}; <--- can take more than 60s vs > .02s > echo "Done"; > > stop=`date +%s`; > > echo "Comparing files..."; > cmp ${jfile} ${tfile}; > if [[ $? -ne 0 ]]; then > echo "Filed comparison failed"; > exit 1; > fi > echo "Done"; > > echo "Remove $tfile..."; > time rm -f ${tfile}; > echo "Done"; > > > echo "Remove $nfsfile..."; > time rm -f ${nfsfile}; > echo "Done"; > > dur=$(($stop-$start)); > if [[ $dur -gt 2 ]]; then > echo "FAIL: Took too long\n"; > fail=$(($fail+1)); > > if [[ $fail -gt 10 ]]; then > echo "Fail limit reached. Bailing." > exit 1 > fi > fi > done > > echo "Remove $jfile..."; > time rm -f ${jfile}; > echo "Done"; > > echo "Total Failures: $fail" Please provide some information either thru the kernel's perf tool or nfs4/rpcrdma tracepoints during the run. It might show where the time is being spent.
Since being asked on the mailing list I have run using tcp. This was done with ipoib and I ran 2500 iterations without a single test failure. Any specific trace points? I mean I can pick and choose, looks like there are 100 of them. Perhaps... svcrdma_post_recv svcrdma_post_send svcrdma_sq_full svcrdma_wc_recv svcrdma_wc_send xprtrdma_post_send xprtrdma_post_recv xprtrdma_wc_send
(In reply to Denny Dalessandro from comment #2) > Since being asked on the mailing list I have run using tcp. This was done > with ipoib and I ran 2500 iterations without a single test failure. > > Any specific trace points? I mean I can pick and choose, looks like there > are 100 of them. Perhaps... > > svcrdma_post_recv > svcrdma_post_send > svcrdma_sq_full > svcrdma_wc_recv > svcrdma_wc_send > > xprtrdma_post_send > xprtrdma_post_recv > xprtrdma_wc_send Given that we dont know where the pause is, I'd say enable all. You are missing 4mins right? Somewhere you should be able to see that in tracepoints.
Ok will enable all. It's actually like 1 minute and change now. It varies though.
Here is the nfs4 and rpcrdma trace. I'm betting bugzilla will garble the format so I'll also add it as an attachment. Note NFS is also used to talk to awfm-01 and awfm-02 in the logs below. The issue is with 192.168.254.1 that is the host doing RDMA. 6299 kworker/11:0H-7802 [011] ..... 90443.786164: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=60 7 new recvs, 64 active 6300 kworker/11:0H-7802 [011] ..... 90443.786164: xprtrdma_reply: task:000005f4@0000000a xid=0xbb6c7067 credits=64 6301 kworker/11:0H-7802 [011] ..... 90443.786165: xprtrdma_mr_localinv: task:000005f4@0000000a mr.id=5092 nents=32 131072@0xbb6c70679eb0f000:0x13e4e55c (FROM_DEVICE) 6302 kworker/11:0H-7802 [011] ..... 90443.786165: xprtrdma_mr_localinv: task:000005f4@0000000a mr.id=5093 nents=96 393216@0xbb6c70679eaaf000:0x13e5e698 (FROM_DEVICE) 6303 kworker/10:0H-71 [010] ..... 90443.786179: xprtrdma_wc_send: cq.id=59 cid=52 status=SUCCESS (0/0x0) 6304 kworker/11:0H-7802 [011] ..... 90443.786341: xprtrdma_wc_receive: cq.id=60 cid=184 status=SUCCESS (0/0x0) received=180 6305 kworker/11:0H-7802 [011] ..... 90443.786341: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=60 0 new recvs, 70 active 6306 kworker/11:0H-7802 [011] ..... 90443.786342: xprtrdma_reply: task:000005f5@0000000a xid=0xbc6c7067 credits=64 6307 kworker/11:0H-7802 [011] ..... 90443.786342: xprtrdma_mr_reminv: task:000005f5@0000000a mr.id=5091 nents=64 262144@0xbc6c70679eb2f000:0x13e3e49e (FROM_DEVICE) 6308 kworker/11:0H-7802 [011] ..... 90443.786342: xprtrdma_mr_unmap: mr.id=5091 nents=64 262144@0xbc6c70679eb2f000:0x13e3e49e (FROM_DEVICE) 6309 kworker/11:0H-7802 [011] ..... 90443.786343: xprtrdma_decode_seg: 262144@0xbc6c70679eb2f000:0x13e3e49e 6310 kworker/u57:2-31533 [000] ..... 90498.612659: nfs4_setup_sequence: session=0xe565b3b4 slot_nr=0 seq_nr=1492 highest_used_slotid=0 6311 kworker/u57:1-31484 [002] ..... 90498.612689: nfs4_setup_sequence: session=0x96a8c42c slot_nr=0 seq_nr=1496 highest_used_slotid=0 6312 kworker/u57:1-31484 [002] ..... 90498.612974: nfs4_sequence_done: error=0 (OK) session=0x96a8c42c slot_nr=0 seq_nr=1496 highest_slotid=29 target_highest_slotid=29 status_flags=0x0 () 6313 kworker/u57:1-31484 [002] ..... 90498.612977: nfs4_sequence: error=0 (OK) dstaddr=awfm-02 6314 kworker/u57:2-31533 [000] ..... 90498.612987: nfs4_sequence_done: error=0 (OK) session=0xe565b3b4 slot_nr=0 seq_nr=1492 highest_slotid=9 target_highest_slotid=9 status_flags=0x0 () 6315 kworker/u57:2-31533 [000] ..... 90498.612990: nfs4_sequence: error=0 (OK) dstaddr=awfm-01 6316 kworker/u57:1-31484 [002] ..... 90506.804628: nfs4_setup_sequence: session=0x32fd4c6d slot_nr=0 seq_nr=1800 highest_used_slotid=0 6317 kworker/u57:1-31484 [002] ..... 90506.804681: xprtrdma_disconnect: peer=[192.168.254.1]:20049 rc=0 connection status=1 6318 kworker/11:0H-7802 [011] ..... 90506.804705: xprtrdma_wc_receive: cq.id=60 cid=183 status=WR_FLUSH_ERR (5/0x0) received=0 6319 kworker/11:0H-7802 [011] ..... 90506.804707: xprtrdma_wc_receive: cq.id=60 cid=180 status=WR_FLUSH_ERR (5/0x0) received=0 . . . 6385 kworker/11:0H-7802 [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=187 status=WR_FLUSH_ERR (5/0x0) received=0 6386 kworker/11:0H-7802 [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=141 status=WR_FLUSH_ERR (5/0x0) received=0 6387 kworker/11:0H-7802 [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=186 status=WR_FLUSH_ERR (5/0x0) received=0 6388 kworker/10:2H-7733 [010] ..... 90506.804733: xprtrdma_wc_li: cq.id=59 mr.id=5092 status=WR_FLUSH_ERR (5/0x0) 6389 kworker/10:2H-7733 [010] ..... 90506.804735: xprtrdma_wc_li_done: cq.id=59 mr.id=5093 status=WR_FLUSH_ERR (5/0x0) 6390 kworker/u57:3-31266 [001] ..... 90506.804744: nfs4_sequence_done: error=0 (OK) session=0x32fd4c6d slot_nr=0 seq_nr=1800 highest_slotid=63 target_highest_slotid=63 status_flags=0x0 () 6391 kworker/u57:3-31266 [001] ..... 90506.804747: nfs4_sequence: error=0 (OK) dstaddr=172.20.10.17 6392 kworker/u57:1-31484 [002] ..... 90506.836670: xprtrdma_mr_unmap: mr.id=5094 nents=1 4@0x000000003bcf432c:0x13e6e701 (FROM_DEVICE) 6393 kworker/u57:1-31484 [002] ..... 90506.842672: xprtrdma_mr_unmap: mr.id=5093 nents=96 393216@0xbb6c70679eaaf000:0x13e5e698 (FROM_DEVICE) 6394 kworker/u57:1-31484 [002] ..... 90506.848641: xprtrdma_mr_unmap: mr.id=5092 nents=32 131072@0xbb6c70679eb0f000:0x13e4e55c (FROM_DEVICE) 6395 kworker/u57:1-31484 [002] ..... 90506.878694: xprtrdma_op_connect: peer=[192.168.254.1]:20049 delay=0 6396 kworker/u57:1-31484 [002] ..... 90506.879801: xprtrdma_post_recv: cq.id=62 cid=1 6397 kworker/u57:1-31484 [002] ..... 90506.879802: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=62 1 new recvs, 0 active 6398 kworker/12:1-7741 [012] ..... 90506.880263: xprtrdma_inline_thresh: 192.168.254.2:36234 -> 192.168.254.1:20049 neg send/recv=4096/4096, calc send/recv=3924/3980 6399 kworker/u57:1-31484 [002] ..... 90506.880439: xprtrdma_createmrs: peer=[192.168.254.1]:20049 created 5 MRs 6400 kworker/u57:1-31484 [002] ..... 90506.880441: xprtrdma_mr_map: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE) 6401 kworker/u57:1-31484 [002] ..... 90506.880441: xprtrdma_mr_fastreg: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE) 6402 kworker/u57:1-31484 [002] ..... 90506.880444: xprtrdma_connect: peer=[192.168.254.1]:20049 rc=0 connection status=1 6403 kworker/u57:1-31484 [002] ..... 90506.880457: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE) 6404 kworker/u57:1-31484 [002] ..... 90506.880458: xprtrdma_chunk_write: task:000005f4@0000000a 393216@0xbb6c70679eaaf000:0x13eff001 (more) 6405 kworker/u57:1-31484 [002] ..... 90506.880459: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE) 6406 kworker/u57:1-31484 [002] ..... 90506.880459: xprtrdma_chunk_write: task:000005f4@0000000a 131072@0xbb6c70679eb0f000:0x13eeef01 (last) 6407 kworker/u57:1-31484 [002] ..... 90506.880460: xprtrdma_marshal: task:000005f4@0000000a xid=0xbb6c7067 hdr=68 xdr=108/0/0 pullup/write list 6408 kworker/u57:1-31484 [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE) 6409 kworker/u57:1-31484 [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE) 6410 kworker/u57:1-31484 [002] ..... 90506.880461: xprtrdma_post_send: task:000005f4@0000000a cq.id=61 cid=3 (2 SGEs) 6411 kworker/13:2H-7769 [013] ..... 90506.880864: xprtrdma_wc_receive: cq.id=62 cid=1 status=SUCCESS (0/0x0) received=196 6412 kworker/13:2H-7769 [013] ..... 90506.880868: xprtrdma_post_recv: cq.id=62 cid=132 . . . 6385 kworker/11:0H-7802 [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=187 status=WR_FLUSH_ERR (5/0x0) received=0 6386 kworker/11:0H-7802 [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=141 status=WR_FLUSH_ERR (5/0x0) received=0 6387 kworker/11:0H-7802 [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=186 status=WR_FLUSH_ERR (5/0x0) received=0 6388 kworker/10:2H-7733 [010] ..... 90506.804733: xprtrdma_wc_li: cq.id=59 mr.id=5092 status=WR_FLUSH_ERR (5/0x0) 6389 kworker/10:2H-7733 [010] ..... 90506.804735: xprtrdma_wc_li_done: cq.id=59 mr.id=5093 status=WR_FLUSH_ERR (5/0x0) 6390 kworker/u57:3-31266 [001] ..... 90506.804744: nfs4_sequence_done: error=0 (OK) session=0x32fd4c6d slot_nr=0 seq_nr=1800 highest_slotid=63 target_highest_slotid=63 status_flags=0x0 () 6391 kworker/u57:3-31266 [001] ..... 90506.804747: nfs4_sequence: error=0 (OK) dstaddr=172.20.10.17 6392 kworker/u57:1-31484 [002] ..... 90506.836670: xprtrdma_mr_unmap: mr.id=5094 nents=1 4@0x000000003bcf432c:0x13e6e701 (FROM_DEVICE) 6393 kworker/u57:1-31484 [002] ..... 90506.842672: xprtrdma_mr_unmap: mr.id=5093 nents=96 393216@0xbb6c70679eaaf000:0x13e5e698 (FROM_DEVICE) 6394 kworker/u57:1-31484 [002] ..... 90506.848641: xprtrdma_mr_unmap: mr.id=5092 nents=32 131072@0xbb6c70679eb0f000:0x13e4e55c (FROM_DEVICE) 6395 kworker/u57:1-31484 [002] ..... 90506.878694: xprtrdma_op_connect: peer=[192.168.254.1]:20049 delay=0 6396 kworker/u57:1-31484 [002] ..... 90506.879801: xprtrdma_post_recv: cq.id=62 cid=1 6397 kworker/u57:1-31484 [002] ..... 90506.879802: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=62 1 new recvs, 0 active 6398 kworker/12:1-7741 [012] ..... 90506.880263: xprtrdma_inline_thresh: 192.168.254.2:36234 -> 192.168.254.1:20049 neg send/recv=4096/4096, calc send/recv=3924/3980 6399 kworker/u57:1-31484 [002] ..... 90506.880439: xprtrdma_createmrs: peer=[192.168.254.1]:20049 created 5 MRs 6400 kworker/u57:1-31484 [002] ..... 90506.880441: xprtrdma_mr_map: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE) 6401 kworker/u57:1-31484 [002] ..... 90506.880441: xprtrdma_mr_fastreg: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE) 6402 kworker/u57:1-31484 [002] ..... 90506.880444: xprtrdma_connect: peer=[192.168.254.1]:20049 rc=0 connection status=1 6403 kworker/u57:1-31484 [002] ..... 90506.880457: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE) 6404 kworker/u57:1-31484 [002] ..... 90506.880458: xprtrdma_chunk_write: task:000005f4@0000000a 393216@0xbb6c70679eaaf000:0x13eff001 (more) 6405 kworker/u57:1-31484 [002] ..... 90506.880459: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE) 6406 kworker/u57:1-31484 [002] ..... 90506.880459: xprtrdma_chunk_write: task:000005f4@0000000a 131072@0xbb6c70679eb0f000:0x13eeef01 (last) 6407 kworker/u57:1-31484 [002] ..... 90506.880460: xprtrdma_marshal: task:000005f4@0000000a xid=0xbb6c7067 hdr=68 xdr=108/0/0 pullup/write list 6408 kworker/u57:1-31484 [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE) 6409 kworker/u57:1-31484 [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE) 6410 kworker/u57:1-31484 [002] ..... 90506.880461: xprtrdma_post_send: task:000005f4@0000000a cq.id=61 cid=3 (2 SGEs) 6411 kworker/13:2H-7769 [013] ..... 90506.880864: xprtrdma_wc_receive: cq.id=62 cid=1 status=SUCCESS (0/0x0) received=196 6412 kworker/13:2H-7769 [013] ..... 90506.880868: xprtrdma_post_recv: cq.id=62 cid=132 . . . 1599 kworker/u57:3-31266 [001] ..... 90507.048475: xprtrdma_post_send: task:00000719@0000000a cq.id=61 cid=35 (2 SGEs) 11600 kworker/13:2H-7769 [013] ..... 90507.048692: xprtrdma_wc_receive: cq.id=62 cid=199 status=SUCCESS (0/0x0) received=180 11601 kworker/13:2H-7769 [013] ..... 90507.048694: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=62 0 new recvs, 66 active 11602 kworker/13:2H-7769 [013] ..... 90507.048694: xprtrdma_reply: task:00000719@0000000a xid=0xe06d7067 credits=64 11603 kworker/13:2H-7769 [013] ..... 90507.048695: xprtrdma_mr_reminv: task:00000719@0000000a mr.id=5103 nents=64 262144@0xe06d7067db46f000:0x13eff094 (FROM_DEVICE) 11604 kworker/13:2H-7769 [013] ..... 90507.048695: xprtrdma_mr_unmap: mr.id=5103 nents=64 262144@0xe06d7067db46f000:0x13eff094 (FROM_DEVICE) 11605 kworker/13:2H-7769 [013] ..... 90507.048696: xprtrdma_decode_seg: 262144@0xe06d7067db46f000:0x13eff094 11606 date-31541 [008] ..... 90507.050087: nfs4_setup_sequence: session=0x32fd4c6d slot_nr=0 seq_nr=1801 highest_used_slotid=0 11607 date-31541 [008] ..... 90507.050259: nfs4_map_name_to_uid: error=0 (OK) id=100001215 name=100001215 11608 date-31541 [008] ..... 90507.050260: nfs4_map_group_to_gid: error=0 (OK) id=100000514 name=100000514 11609 date-31541 [008] ..... 90507.050263: nfs4_sequence_done: error=0 (OK) session=0x32fd4c6d slot_nr=0 seq_nr=1801 highest_slotid=63 target_highest_slotid=63 status_flags=0x0 () 11610 date-31541 [008] ..... 90507.050269: nfs4_getattr: error=0 (OK) fileid=00:36:38626666 fhandle=0xbe19d387 valid=TYPE|MODE|NLINK|OWNER|GROUP|RDEV|SIZE|FSID|FILEID|ATIME|MTIME|CTIME|CHANGE|0x400200
Created attachment 301255 [details] Client side trace points Here is the result of: echo 1 > /sys/kernel/debug/tracing/events/rpcrdma/enable echo 1 > /sys/kernel/debug/tracing/events/nfs4/enable executed just before the cp from nfs and stopped after completion.
Dennis is your rdma mount over v3? I don't see any nfs4 tracepoints that I would like to see from a cp command. Like an open and a read. If your mount is over v3 please include "nfs" tracepoints instead of "nfs4". What I also see on the tracepoints is an rdma disconnect. That's not expected in a normal flow of things. So something went wrong. I also see about 64seconds gap from the last successful tracepoints to when the connection starts to get re-established. It would be nice to see to see the NFS layer tracepoints around this time for more info. What linux rdma driver is being used here? perhaps we can enable some tracepoints in it as well. I also would like to update the bugzilla that I have tested the 5.18 code on my RDMA setup that uses Mellanox CX-5 cards and I saw no timing problems using your scripts that does the local-to-nfs and back copying.
You know I assumed it was using v4. I will definitely go double check that. The RDMA driver being used is hfi1. I'm the maintainer for it so I can add tracepoints/debugging anywhere that's useful. We make extensive use of tracepoints already so chances are I've got something that will help here.
Turns out it was using v3. I have updated trace points and turned on some of the hfi1 traces. I'll upload the log file shortly. I also collected the same trace points on the server side and enabled ib_cm trace points. On the client we still see the xprtrdma_disconnect. Doesn't seem to be an indication of why the connection was torn down.
Created attachment 301257 [details] Client side trace points update Trace points for client side targeting nfs vs nfsv4 and adding ib_cm.
Created attachment 301258 [details] Server side trace points Server side trace points to match the client side.
Any other ideas on what to try? Other trace points that might be helpful?
Looking at client.trace: kworker/u57:1-32640 [005] ..... 107211.048665: nfs_readpage_done: fileid=00:2d:2510 fhandle=0x3c3ad416 offset=91750400 count=262144 res=262144 status=262144 kworker/u60:0-322 [000] d..1. 107212.234307: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u60:0-322 [000] d..1. 107222.234399: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u60:0-322 [000] d..1. 107232.234520: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u60:0-322 [000] d..1. 107242.234609: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u60:0-322 [000] d..1. 107252.234699: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u60:0-322 [000] d..1. 107262.234851: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u57:1-32640 [005] d..1. 107271.750033: icm_send_dreq: local_id=1503022084 remote_id=629322257 state=ESTABLISHED lap_state=LAP_UNINIT kworker/u57:1-32640 [005] d..2. 107271.750036: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u57:1-32640 [005] ..... 107271.750042: xprtrdma_disconnect: peer=[192.168.254.1]:20049 rc=0 connection status=1 kworker/8:0H-61 [008] ..... 107271.750050: hfi1_rc_do_send: [0000:05:00.0] qpn 4a flag 1 kworker/3:1H-931 [003] ..... 107271.750068: xprtrdma_wc_receive: cq.id=74 cid=155 status=WR_FLUSH_ERR (5/0x0) received=0 kworker/3:1H-931 [003] ..... 107271.750069: xprtrdma_wc_receive: cq.id=74 cid=154 status=WR_FLUSH_ERR (5/0x0) received=0 kworker/3:1H-931 [003] ..... 107271.750070: xprtrdma_wc_receive: cq.id=74 cid=153 status=WR_FLUSH_ERR (5/0x0) received=0 ... kworker/3:1H-931 [003] ..... 107271.750090: xprtrdma_wc_receive: cq.id=74 cid=191 status=WR_FLUSH_ERR (5/0x0) received=0 kworker/3:1H-931 [003] ..... 107271.750090: xprtrdma_wc_receive: cq.id=74 cid=162 status=WR_FLUSH_ERR (5/0x0) received=0 kworker/3:1H-931 [003] ..... 107271.750090: xprtrdma_wc_receive: cq.id=74 cid=161 status=WR_FLUSH_ERR (5/0x0) received=0 kworker/u57:1-32640 [005] ..... 107271.750096: hfi1_rc_do_send: [0000:05:00.0] qpn 4a flag 0 kworker/u60:0-322 [000] d..1. 107272.235001: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0 kworker/u60:0-322 [000] d..1. 107272.752170: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0 kworker/u60:1-1150 [012] d..1. 107272.752514: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0 kworker/u60:0-322 [000] d..1. 107272.752698: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0 kworker/u60:1-1150 [012] d..1. 107272.753017: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0 kworker/u60:0-322 [000] d..1. 107272.753373: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0 xprtrdma's QP seems to be 0x4a, QP 0x01 is something else? I don't see the Send Queue size in this trace. Could xprtrdma be overrunning its Send Queue? Enable the rdma_core and rdma_cma trace subsystems to see details of the RDMA resources allocated by RDMA consumers like xprtrdma.
(In reply to Chuck Lever from comment #13) > Looking at client.trace: > > kworker/u57:1-32640 [005] ..... 107211.048665: nfs_readpage_done: > fileid=00:2d:2510 fhandle=0x3c3ad416 offset=91750400 count=262144 res=262144 > status=262144 > kworker/u60:0-322 [000] d..1. 107212.234307: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u60:0-322 [000] d..1. 107222.234399: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u60:0-322 [000] d..1. 107232.234520: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u60:0-322 [000] d..1. 107242.234609: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u60:0-322 [000] d..1. 107252.234699: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u60:0-322 [000] d..1. 107262.234851: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u57:1-32640 [005] d..1. 107271.750033: icm_send_dreq: > local_id=1503022084 remote_id=629322257 state=ESTABLISHED > lap_state=LAP_UNINIT > kworker/u57:1-32640 [005] d..2. 107271.750036: hfi1_rc_do_send: > [0000:05:00.0] qpn 1 flag 0 > kworker/u57:1-32640 [005] ..... 107271.750042: xprtrdma_disconnect: > peer=[192.168.254.1]:20049 rc=0 connection status=1 Well I edited out my initial observation, which is that hfi1 seems to be retransmitting RDMA Sends, if I understand the hfi1_rc_do_send trace events, but it's not getting RC ACKs or completions.
After trying isolate suspicious patches, I am positive that patch https://lore.kernel.org/lkml/20211115165452.835502712@linuxfoundation.org/ caused the issue. Without it, our test which is able to reproduce the issue completes.
That patch is almost certainly a red herring. The only effect it can have here is to clear the page cache (which would explain why your second 'cp' now takes 60s instead of 0.2). Since you are running the script as root, I suggest adding a line echo 1 > /proc/sys/vm/drop_caches between these two lines > echo "Done"; > > echo "copy $nfsfile to $tfile..."; ...and see what the bisection now shows.
I re-ran the test after adding the command you recommended above, and I am still getting the same error.
That's expected. Does it also see the error when you revert the patch that git bisect pointed to?
1. Plain v5.19 : I see the errors. 2. v5.19 + patch reverted : no errors. 3. Plain v5.19 + command you recommended above in the script: I see the errors.
The only interesting case is 4. v5.19 + patch reverted + "drop_caches" in the script: ????? I'll bet that show up as "I see the errors".
ok, maybe I misunderstood your previous message. I will give that a try.
I gave it a try and I got the same result you expected. I did see errors when using: v5.19 + patch reverted + "drop_caches" in the script. I am checking investigation the relationship between the on one hand failure and some messages coming from GSI on the other hand. I do see that they always coincide to arrive together at the same time in a specific order.
Thanks for the update. Yes, it makes sense that the patch you pointed to was just unmasking an existing problem rather than being a root cause...
So is the going theory that something is causing the caches to get dropped more often now? So this means our performance problem is because the cache drop. OK, is there a way to isolate *why* cache gets dropped? Something we can instrument, that way we can trace back and see what the cause is.
It looks like that dropping the caches makes it more likely to have transactions in flight when another component (SDMA engine) wakes up to respond to interrupts. What I am observing point to the SDMA being the culprit.
I did more testing on this issue and I can confirm that the SDMA engine, which was suspected to be culprit before is not to blame. But I saw that two client tasks sharing the queue pair and communicating with the server at the same time. I also see that this causes the server to return op codes which are not what the client is expecting. This mix up causes the client to not mark the task complete and eventually timeout. I also see that the test passes for 6.2-rc3. And I went through the commits between that and 6.2.0 in RDMA and in the hfi1 driver, but none of those seem to solve the problem on its own when applied to 5.19.0. I do suspect that the issue is in the synchronization between the client threads. But when I look what might have changed there for 6.2-rc3 I do not see anything new. One note to add, when I add code to debug the issue that causes some delays between client threads, the problem goes away. Which makes me think it is some sort of synchronization.