Bug 216160 - NFS client regression starting in 5.18
Summary: NFS client regression starting in 5.18
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-21 15:02 UTC by Denny Dalessandro
Modified: 2023-03-28 17:29 UTC (History)
7 users (show)

See Also:
Kernel Version: 5.18-rc2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Client side trace points (1.48 MB, text/plain)
2022-06-22 15:33 UTC, Denny Dalessandro
Details
Client side trace points update (210.54 KB, application/x-gzip)
2022-06-22 21:46 UTC, Denny Dalessandro
Details
Server side trace points (1.01 MB, application/x-gzip)
2022-06-22 21:47 UTC, Denny Dalessandro
Details

Description Denny Dalessandro 2022-06-21 15:02:43 UTC
As reported in:
https://marc.info/?l=linux-nfs&m=165115109917372&w=4

There is an apparent regression in the NFS client starting with 5.18 kernels. The testing being dons is with NFS/RDMA on OPA HW.

To start an NFS server I am doing the following:

# cat start_nfs_server.sh
#!/bin/bash
systemctl start opafm
sleep 10
systemctl start nfs-server
modprobe svcrdma
echo "rdma 20049" > /proc/fs/nfsd/portlist
mkdir -p /mnt/nfs_test
mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test
exportfs -o fsid=0,rw,async,insecure,no_root_squash 192.168.254.0/255.255.255.0:/mnt/nfs_test
exportfs -s

Now to start the client:
# cat start_nfs_client.sh
#!/bin/bash
mkdir -p /mnt/nfs_test
mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test
ls /mnt/nfs_test
df -h

The test can be distilled down to:
# cat nfs_test.sh
#!/bin/bash

fsize=268435456
jfile=/dev/shm/run_nfs_test.junk
nfsfile=/mnt/nfs_test/run_nfs_test.junk
tfile=/dev/shm/run_nfs_test.tmp
fail=0

echo "Creating $jfile..."
dd if=/dev/urandom iflag=fullblock of=${jfile} bs=1024 count=$((fsize/1024));
echo "Done";

for i in {1..2500}; do
        echo "Test $i FailCount=$fail";

        start=`date +%s`;

        echo "copy $jfile to $nfsfile...";
        time cp ${jfile} ${nfsfile};
        echo "Done";

        echo "copy $nfsfile to $tfile...";
        time cp ${nfsfile} ${tfile};       <--- can take more than 60s vs .02s
        echo "Done";

        stop=`date +%s`;

        echo "Comparing files...";
        cmp ${jfile} ${tfile};
        if [[ $? -ne 0 ]]; then
                echo "Filed comparison failed";
                exit 1;
        fi
        echo "Done";

        echo "Remove $tfile...";
        time rm -f ${tfile};
        echo "Done";


        echo "Remove $nfsfile...";
        time rm -f ${nfsfile};
        echo "Done";

        dur=$(($stop-$start));
        if [[ $dur -gt 2 ]]; then
                echo "FAIL: Took too long\n";
                fail=$(($fail+1));

                if [[ $fail -gt 10 ]]; then
                        echo "Fail limit reached. Bailing."
                        exit 1
                fi
        fi
done

echo "Remove $jfile...";
time rm -f ${jfile};
echo "Done";

echo "Total Failures: $fail"
Comment 1 kolga 2022-06-21 17:55:44 UTC
(In reply to Denny Dalessandro from comment #0)
> As reported in:
> https://marc.info/?l=linux-nfs&m=165115109917372&w=4
> 
> There is an apparent regression in the NFS client starting with 5.18
> kernels. The testing being dons is with NFS/RDMA on OPA HW.
> 
> To start an NFS server I am doing the following:
> 
> # cat start_nfs_server.sh
> #!/bin/bash
> systemctl start opafm
> sleep 10
> systemctl start nfs-server
> modprobe svcrdma
> echo "rdma 20049" > /proc/fs/nfsd/portlist
> mkdir -p /mnt/nfs_test
> mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test
> exportfs -o fsid=0,rw,async,insecure,no_root_squash
> 192.168.254.0/255.255.255.0:/mnt/nfs_test
> exportfs -s
> 
> Now to start the client:
> # cat start_nfs_client.sh
> #!/bin/bash
> mkdir -p /mnt/nfs_test
> mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test
> ls /mnt/nfs_test
> df -h
> 
> The test can be distilled down to:
> # cat nfs_test.sh
> #!/bin/bash
> 
> fsize=268435456
> jfile=/dev/shm/run_nfs_test.junk
> nfsfile=/mnt/nfs_test/run_nfs_test.junk
> tfile=/dev/shm/run_nfs_test.tmp
> fail=0
> 
> echo "Creating $jfile..."
> dd if=/dev/urandom iflag=fullblock of=${jfile} bs=1024 count=$((fsize/1024));
> echo "Done";
> 
> for i in {1..2500}; do
>         echo "Test $i FailCount=$fail";
> 
>         start=`date +%s`;
> 
>         echo "copy $jfile to $nfsfile...";
>         time cp ${jfile} ${nfsfile};
>         echo "Done";
> 
>         echo "copy $nfsfile to $tfile...";
>         time cp ${nfsfile} ${tfile};       <--- can take more than 60s vs
> .02s
>         echo "Done";
> 
>         stop=`date +%s`;
> 
>         echo "Comparing files...";
>         cmp ${jfile} ${tfile};
>         if [[ $? -ne 0 ]]; then
>                 echo "Filed comparison failed";
>                 exit 1;
>         fi
>         echo "Done";
> 
>         echo "Remove $tfile...";
>         time rm -f ${tfile};
>         echo "Done";
> 
> 
>         echo "Remove $nfsfile...";
>         time rm -f ${nfsfile};
>         echo "Done";
> 
>         dur=$(($stop-$start));
>         if [[ $dur -gt 2 ]]; then
>                 echo "FAIL: Took too long\n";
>                 fail=$(($fail+1));
> 
>                 if [[ $fail -gt 10 ]]; then
>                         echo "Fail limit reached. Bailing."
>                         exit 1
>                 fi
>         fi
> done
> 
> echo "Remove $jfile...";
> time rm -f ${jfile};
> echo "Done";
> 
> echo "Total Failures: $fail"

Please provide some information either thru the kernel's perf tool or nfs4/rpcrdma tracepoints during the run. It might show where the time is being spent.
Comment 2 Denny Dalessandro 2022-06-21 18:13:13 UTC
Since being asked on the mailing list I have run using tcp. This was done with ipoib and I ran 2500 iterations without a single test failure.

Any specific trace points? I mean I can pick and choose, looks like there are 100 of them. Perhaps...

svcrdma_post_recv
svcrdma_post_send
svcrdma_sq_full
svcrdma_wc_recv
svcrdma_wc_send

xprtrdma_post_send
xprtrdma_post_recv
xprtrdma_wc_send
Comment 3 kolga 2022-06-21 18:26:00 UTC
(In reply to Denny Dalessandro from comment #2)
> Since being asked on the mailing list I have run using tcp. This was done
> with ipoib and I ran 2500 iterations without a single test failure.
> 
> Any specific trace points? I mean I can pick and choose, looks like there
> are 100 of them. Perhaps...
> 
> svcrdma_post_recv
> svcrdma_post_send
> svcrdma_sq_full
> svcrdma_wc_recv
> svcrdma_wc_send
> 
> xprtrdma_post_send
> xprtrdma_post_recv
> xprtrdma_wc_send

Given that we dont know where the pause is, I'd say enable all. You are missing 4mins right? Somewhere you should be able to see that in tracepoints.
Comment 4 Denny Dalessandro 2022-06-21 18:28:48 UTC
Ok will enable all. It's actually like 1 minute and change now. It varies though.
Comment 5 Denny Dalessandro 2022-06-22 15:31:13 UTC
Here is the nfs4 and rpcrdma trace. I'm betting bugzilla will garble the format
so I'll also add it as an attachment. Note NFS is also used to talk to awfm-01
and awfm-02 in the logs below. The issue is with 192.168.254.1 that is the 
host doing RDMA.

 6299    kworker/11:0H-7802    [011] ..... 90443.786164: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=60 7 new recvs, 64 active
 6300    kworker/11:0H-7802    [011] ..... 90443.786164: xprtrdma_reply: task:000005f4@0000000a xid=0xbb6c7067 credits=64
 6301    kworker/11:0H-7802    [011] ..... 90443.786165: xprtrdma_mr_localinv: task:000005f4@0000000a mr.id=5092 nents=32 131072@0xbb6c70679eb0f000:0x13e4e55c (FROM_DEVICE)
 6302    kworker/11:0H-7802    [011] ..... 90443.786165: xprtrdma_mr_localinv: task:000005f4@0000000a mr.id=5093 nents=96 393216@0xbb6c70679eaaf000:0x13e5e698 (FROM_DEVICE)
 6303    kworker/10:0H-71      [010] ..... 90443.786179: xprtrdma_wc_send: cq.id=59 cid=52 status=SUCCESS (0/0x0)
 6304    kworker/11:0H-7802    [011] ..... 90443.786341: xprtrdma_wc_receive: cq.id=60 cid=184 status=SUCCESS (0/0x0) received=180
 6305    kworker/11:0H-7802    [011] ..... 90443.786341: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=60 0 new recvs, 70 active
 6306    kworker/11:0H-7802    [011] ..... 90443.786342: xprtrdma_reply: task:000005f5@0000000a xid=0xbc6c7067 credits=64
 6307    kworker/11:0H-7802    [011] ..... 90443.786342: xprtrdma_mr_reminv: task:000005f5@0000000a mr.id=5091 nents=64 262144@0xbc6c70679eb2f000:0x13e3e49e (FROM_DEVICE)
 6308    kworker/11:0H-7802    [011] ..... 90443.786342: xprtrdma_mr_unmap: mr.id=5091 nents=64 262144@0xbc6c70679eb2f000:0x13e3e49e (FROM_DEVICE)
 6309    kworker/11:0H-7802    [011] ..... 90443.786343: xprtrdma_decode_seg: 262144@0xbc6c70679eb2f000:0x13e3e49e
 6310    kworker/u57:2-31533   [000] ..... 90498.612659: nfs4_setup_sequence: session=0xe565b3b4 slot_nr=0 seq_nr=1492 highest_used_slotid=0
 6311    kworker/u57:1-31484   [002] ..... 90498.612689: nfs4_setup_sequence: session=0x96a8c42c slot_nr=0 seq_nr=1496 highest_used_slotid=0
 6312    kworker/u57:1-31484   [002] ..... 90498.612974: nfs4_sequence_done: error=0 (OK) session=0x96a8c42c slot_nr=0 seq_nr=1496 highest_slotid=29 target_highest_slotid=29 status_flags=0x0 ()
 6313    kworker/u57:1-31484   [002] ..... 90498.612977: nfs4_sequence: error=0 (OK) dstaddr=awfm-02
 6314    kworker/u57:2-31533   [000] ..... 90498.612987: nfs4_sequence_done: error=0 (OK) session=0xe565b3b4 slot_nr=0 seq_nr=1492 highest_slotid=9 target_highest_slotid=9 status_flags=0x0 ()
 6315    kworker/u57:2-31533   [000] ..... 90498.612990: nfs4_sequence: error=0 (OK) dstaddr=awfm-01
 6316    kworker/u57:1-31484   [002] ..... 90506.804628: nfs4_setup_sequence: session=0x32fd4c6d slot_nr=0 seq_nr=1800 highest_used_slotid=0
 6317    kworker/u57:1-31484   [002] ..... 90506.804681: xprtrdma_disconnect: peer=[192.168.254.1]:20049 rc=0 connection status=1
 6318    kworker/11:0H-7802    [011] ..... 90506.804705: xprtrdma_wc_receive: cq.id=60 cid=183 status=WR_FLUSH_ERR (5/0x0) received=0
 6319    kworker/11:0H-7802    [011] ..... 90506.804707: xprtrdma_wc_receive: cq.id=60 cid=180 status=WR_FLUSH_ERR (5/0x0) received=0
.
.
.
 6385    kworker/11:0H-7802    [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=187 status=WR_FLUSH_ERR (5/0x0) received=0
 6386    kworker/11:0H-7802    [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=141 status=WR_FLUSH_ERR (5/0x0) received=0
 6387    kworker/11:0H-7802    [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=186 status=WR_FLUSH_ERR (5/0x0) received=0
 6388    kworker/10:2H-7733    [010] ..... 90506.804733: xprtrdma_wc_li: cq.id=59 mr.id=5092 status=WR_FLUSH_ERR (5/0x0)
 6389    kworker/10:2H-7733    [010] ..... 90506.804735: xprtrdma_wc_li_done: cq.id=59 mr.id=5093 status=WR_FLUSH_ERR (5/0x0)
 6390    kworker/u57:3-31266   [001] ..... 90506.804744: nfs4_sequence_done: error=0 (OK) session=0x32fd4c6d slot_nr=0 seq_nr=1800 highest_slotid=63 target_highest_slotid=63 status_flags=0x0 ()
 6391    kworker/u57:3-31266   [001] ..... 90506.804747: nfs4_sequence: error=0 (OK) dstaddr=172.20.10.17
 6392    kworker/u57:1-31484   [002] ..... 90506.836670: xprtrdma_mr_unmap: mr.id=5094 nents=1 4@0x000000003bcf432c:0x13e6e701 (FROM_DEVICE)
 6393    kworker/u57:1-31484   [002] ..... 90506.842672: xprtrdma_mr_unmap: mr.id=5093 nents=96 393216@0xbb6c70679eaaf000:0x13e5e698 (FROM_DEVICE)
 6394    kworker/u57:1-31484   [002] ..... 90506.848641: xprtrdma_mr_unmap: mr.id=5092 nents=32 131072@0xbb6c70679eb0f000:0x13e4e55c (FROM_DEVICE)
 6395    kworker/u57:1-31484   [002] ..... 90506.878694: xprtrdma_op_connect: peer=[192.168.254.1]:20049 delay=0
 6396    kworker/u57:1-31484   [002] ..... 90506.879801: xprtrdma_post_recv: cq.id=62 cid=1
 6397    kworker/u57:1-31484   [002] ..... 90506.879802: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=62 1 new recvs, 0 active
 6398     kworker/12:1-7741    [012] ..... 90506.880263: xprtrdma_inline_thresh: 192.168.254.2:36234 -> 192.168.254.1:20049 neg send/recv=4096/4096, calc send/recv=3924/3980
 6399    kworker/u57:1-31484   [002] ..... 90506.880439: xprtrdma_createmrs: peer=[192.168.254.1]:20049 created 5 MRs
 6400    kworker/u57:1-31484   [002] ..... 90506.880441: xprtrdma_mr_map: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE)
 6401    kworker/u57:1-31484   [002] ..... 90506.880441: xprtrdma_mr_fastreg: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE)
 6402    kworker/u57:1-31484   [002] ..... 90506.880444: xprtrdma_connect: peer=[192.168.254.1]:20049 rc=0 connection status=1
 6403    kworker/u57:1-31484   [002] ..... 90506.880457: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE)
 6404    kworker/u57:1-31484   [002] ..... 90506.880458: xprtrdma_chunk_write: task:000005f4@0000000a 393216@0xbb6c70679eaaf000:0x13eff001 (more)
 6405    kworker/u57:1-31484   [002] ..... 90506.880459: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE)
 6406    kworker/u57:1-31484   [002] ..... 90506.880459: xprtrdma_chunk_write: task:000005f4@0000000a 131072@0xbb6c70679eb0f000:0x13eeef01 (last)
 6407    kworker/u57:1-31484   [002] ..... 90506.880460: xprtrdma_marshal: task:000005f4@0000000a xid=0xbb6c7067 hdr=68 xdr=108/0/0 pullup/write list
 6408    kworker/u57:1-31484   [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE)
 6409    kworker/u57:1-31484   [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE)
 6410    kworker/u57:1-31484   [002] ..... 90506.880461: xprtrdma_post_send: task:000005f4@0000000a cq.id=61 cid=3 (2 SGEs)
 6411    kworker/13:2H-7769    [013] ..... 90506.880864: xprtrdma_wc_receive: cq.id=62 cid=1 status=SUCCESS (0/0x0) received=196
 6412    kworker/13:2H-7769    [013] ..... 90506.880868: xprtrdma_post_recv: cq.id=62 cid=132
.
.
.
 6385    kworker/11:0H-7802    [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=187 status=WR_FLUSH_ERR (5/0x0) received=0
 6386    kworker/11:0H-7802    [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=141 status=WR_FLUSH_ERR (5/0x0) received=0
 6387    kworker/11:0H-7802    [011] ..... 90506.804730: xprtrdma_wc_receive: cq.id=60 cid=186 status=WR_FLUSH_ERR (5/0x0) received=0
 6388    kworker/10:2H-7733    [010] ..... 90506.804733: xprtrdma_wc_li: cq.id=59 mr.id=5092 status=WR_FLUSH_ERR (5/0x0)
 6389    kworker/10:2H-7733    [010] ..... 90506.804735: xprtrdma_wc_li_done: cq.id=59 mr.id=5093 status=WR_FLUSH_ERR (5/0x0)
 6390    kworker/u57:3-31266   [001] ..... 90506.804744: nfs4_sequence_done: error=0 (OK) session=0x32fd4c6d slot_nr=0 seq_nr=1800 highest_slotid=63 target_highest_slotid=63 status_flags=0x0 ()
 6391    kworker/u57:3-31266   [001] ..... 90506.804747: nfs4_sequence: error=0 (OK) dstaddr=172.20.10.17
 6392    kworker/u57:1-31484   [002] ..... 90506.836670: xprtrdma_mr_unmap: mr.id=5094 nents=1 4@0x000000003bcf432c:0x13e6e701 (FROM_DEVICE)
 6393    kworker/u57:1-31484   [002] ..... 90506.842672: xprtrdma_mr_unmap: mr.id=5093 nents=96 393216@0xbb6c70679eaaf000:0x13e5e698 (FROM_DEVICE)
 6394    kworker/u57:1-31484   [002] ..... 90506.848641: xprtrdma_mr_unmap: mr.id=5092 nents=32 131072@0xbb6c70679eb0f000:0x13e4e55c (FROM_DEVICE)
 6395    kworker/u57:1-31484   [002] ..... 90506.878694: xprtrdma_op_connect: peer=[192.168.254.1]:20049 delay=0
 6396    kworker/u57:1-31484   [002] ..... 90506.879801: xprtrdma_post_recv: cq.id=62 cid=1
 6397    kworker/u57:1-31484   [002] ..... 90506.879802: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=62 1 new recvs, 0 active
 6398     kworker/12:1-7741    [012] ..... 90506.880263: xprtrdma_inline_thresh: 192.168.254.2:36234 -> 192.168.254.1:20049 neg send/recv=4096/4096, calc send/recv=3924/3980
 6399    kworker/u57:1-31484   [002] ..... 90506.880439: xprtrdma_createmrs: peer=[192.168.254.1]:20049 created 5 MRs
 6400    kworker/u57:1-31484   [002] ..... 90506.880441: xprtrdma_mr_map: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE)
 6401    kworker/u57:1-31484   [002] ..... 90506.880441: xprtrdma_mr_fastreg: task:00000000@ffffffff mr.id=5104 nents=1 4@0x00000000d982372c:0x13f0f101 (FROM_DEVICE)
 6402    kworker/u57:1-31484   [002] ..... 90506.880444: xprtrdma_connect: peer=[192.168.254.1]:20049 rc=0 connection status=1
 6403    kworker/u57:1-31484   [002] ..... 90506.880457: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE)
 6404    kworker/u57:1-31484   [002] ..... 90506.880458: xprtrdma_chunk_write: task:000005f4@0000000a 393216@0xbb6c70679eaaf000:0x13eff001 (more)
 6405    kworker/u57:1-31484   [002] ..... 90506.880459: xprtrdma_mr_map: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE)
 6406    kworker/u57:1-31484   [002] ..... 90506.880459: xprtrdma_chunk_write: task:000005f4@0000000a 131072@0xbb6c70679eb0f000:0x13eeef01 (last)
 6407    kworker/u57:1-31484   [002] ..... 90506.880460: xprtrdma_marshal: task:000005f4@0000000a xid=0xbb6c7067 hdr=68 xdr=108/0/0 pullup/write list
 6408    kworker/u57:1-31484   [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5102 nents=32 131072@0xbb6c70679eb0f000:0x13eeef01 (FROM_DEVICE)
 6409    kworker/u57:1-31484   [002] ..... 90506.880460: xprtrdma_mr_fastreg: task:000005f4@0000000a mr.id=5103 nents=96 393216@0xbb6c70679eaaf000:0x13eff001 (FROM_DEVICE)
 6410    kworker/u57:1-31484   [002] ..... 90506.880461: xprtrdma_post_send: task:000005f4@0000000a cq.id=61 cid=3 (2 SGEs)
 6411    kworker/13:2H-7769    [013] ..... 90506.880864: xprtrdma_wc_receive: cq.id=62 cid=1 status=SUCCESS (0/0x0) received=196
 6412    kworker/13:2H-7769    [013] ..... 90506.880868: xprtrdma_post_recv: cq.id=62 cid=132
.
.
.
1599    kworker/u57:3-31266   [001] ..... 90507.048475: xprtrdma_post_send: task:00000719@0000000a cq.id=61 cid=35 (2 SGEs)
11600    kworker/13:2H-7769    [013] ..... 90507.048692: xprtrdma_wc_receive: cq.id=62 cid=199 status=SUCCESS (0/0x0) received=180
11601    kworker/13:2H-7769    [013] ..... 90507.048694: xprtrdma_post_recvs: peer=[192.168.254.1]:20049 cq.id=62 0 new recvs, 66 active
11602    kworker/13:2H-7769    [013] ..... 90507.048694: xprtrdma_reply: task:00000719@0000000a xid=0xe06d7067 credits=64
11603    kworker/13:2H-7769    [013] ..... 90507.048695: xprtrdma_mr_reminv: task:00000719@0000000a mr.id=5103 nents=64 262144@0xe06d7067db46f000:0x13eff094 (FROM_DEVICE)
11604    kworker/13:2H-7769    [013] ..... 90507.048695: xprtrdma_mr_unmap: mr.id=5103 nents=64 262144@0xe06d7067db46f000:0x13eff094 (FROM_DEVICE)
11605    kworker/13:2H-7769    [013] ..... 90507.048696: xprtrdma_decode_seg: 262144@0xe06d7067db46f000:0x13eff094
11606             date-31541   [008] ..... 90507.050087: nfs4_setup_sequence: session=0x32fd4c6d slot_nr=0 seq_nr=1801 highest_used_slotid=0
11607             date-31541   [008] ..... 90507.050259: nfs4_map_name_to_uid: error=0 (OK) id=100001215 name=100001215
11608             date-31541   [008] ..... 90507.050260: nfs4_map_group_to_gid: error=0 (OK) id=100000514 name=100000514
11609             date-31541   [008] ..... 90507.050263: nfs4_sequence_done: error=0 (OK) session=0x32fd4c6d slot_nr=0 seq_nr=1801 highest_slotid=63 target_highest_slotid=63 status_flags=0x0 ()
11610             date-31541   [008] ..... 90507.050269: nfs4_getattr: error=0 (OK) fileid=00:36:38626666 fhandle=0xbe19d387 valid=TYPE|MODE|NLINK|OWNER|GROUP|RDEV|SIZE|FSID|FILEID|ATIME|MTIME|CTIME|CHANGE|0x400200
Comment 6 Denny Dalessandro 2022-06-22 15:33:46 UTC
Created attachment 301255 [details]
Client side trace points

Here is the result of:

        echo 1 > /sys/kernel/debug/tracing/events/rpcrdma/enable
        echo 1 > /sys/kernel/debug/tracing/events/nfs4/enable

executed just before the cp from nfs and stopped after completion.
Comment 7 kolga 2022-06-22 16:59:51 UTC
Dennis is your rdma mount over v3? I don't see any nfs4 tracepoints that I would like to see from a cp command. Like an open and a read. If your mount is over v3 please include "nfs" tracepoints instead of "nfs4".

What I also see on the tracepoints is an rdma disconnect. That's not expected in a 
normal flow of things. So something went wrong. I also see about 64seconds gap from
the last successful tracepoints to when the connection starts to get re-established.
It would be nice to see to see the NFS layer tracepoints around this time for 
more info.

What linux rdma driver is being used here? perhaps we can enable some tracepoints in it as well. 

I also would like to update the bugzilla that I have tested the 5.18 code on my RDMA setup that uses Mellanox CX-5 cards and I saw no timing problems using your scripts that does the local-to-nfs and back copying.
Comment 8 Denny Dalessandro 2022-06-22 17:22:06 UTC
You know I assumed it was using v4. I will definitely go double check that.

The RDMA driver being used is hfi1. I'm the maintainer for it so I can add tracepoints/debugging anywhere that's useful. We make extensive use of tracepoints already so chances are I've got something that will help here.
Comment 9 Denny Dalessandro 2022-06-22 21:45:51 UTC
Turns out it was using v3. I have updated trace points and turned on some of the hfi1 traces. I'll upload the log file shortly. I also collected the same trace points on the server side and enabled ib_cm trace points.

On the client we still see the xprtrdma_disconnect. Doesn't seem to be an indication of why the connection was torn down.
Comment 10 Denny Dalessandro 2022-06-22 21:46:43 UTC
Created attachment 301257 [details]
Client side trace points update

Trace points for client side targeting nfs vs nfsv4 and adding ib_cm.
Comment 11 Denny Dalessandro 2022-06-22 21:47:55 UTC
Created attachment 301258 [details]
Server side trace points

Server side trace points to match the client side.
Comment 12 Denny Dalessandro 2022-07-05 16:09:12 UTC
Any other ideas on what to try? Other trace points that might be helpful?
Comment 13 Chuck Lever 2022-07-05 16:35:35 UTC
Looking at client.trace:

   kworker/u57:1-32640   [005] ..... 107211.048665: nfs_readpage_done: fileid=00:2d:2510 fhandle=0x3c3ad416 offset=91750400 count=262144 res=262144 status=262144
   kworker/u60:0-322     [000] d..1. 107212.234307: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u60:0-322     [000] d..1. 107222.234399: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u60:0-322     [000] d..1. 107232.234520: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u60:0-322     [000] d..1. 107242.234609: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u60:0-322     [000] d..1. 107252.234699: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u60:0-322     [000] d..1. 107262.234851: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u57:1-32640   [005] d..1. 107271.750033: icm_send_dreq: local_id=1503022084 remote_id=629322257 state=ESTABLISHED lap_state=LAP_UNINIT
   kworker/u57:1-32640   [005] d..2. 107271.750036: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u57:1-32640   [005] ..... 107271.750042: xprtrdma_disconnect: peer=[192.168.254.1]:20049 rc=0 connection status=1
    kworker/8:0H-61      [008] ..... 107271.750050: hfi1_rc_do_send: [0000:05:00.0] qpn 4a flag 1
    kworker/3:1H-931     [003] ..... 107271.750068: xprtrdma_wc_receive: cq.id=74 cid=155 status=WR_FLUSH_ERR (5/0x0) received=0
    kworker/3:1H-931     [003] ..... 107271.750069: xprtrdma_wc_receive: cq.id=74 cid=154 status=WR_FLUSH_ERR (5/0x0) received=0
    kworker/3:1H-931     [003] ..... 107271.750070: xprtrdma_wc_receive: cq.id=74 cid=153 status=WR_FLUSH_ERR (5/0x0) received=0

 ...

    kworker/3:1H-931     [003] ..... 107271.750090: xprtrdma_wc_receive: cq.id=74 cid=191 status=WR_FLUSH_ERR (5/0x0) received=0
    kworker/3:1H-931     [003] ..... 107271.750090: xprtrdma_wc_receive: cq.id=74 cid=162 status=WR_FLUSH_ERR (5/0x0) received=0
    kworker/3:1H-931     [003] ..... 107271.750090: xprtrdma_wc_receive: cq.id=74 cid=161 status=WR_FLUSH_ERR (5/0x0) received=0
   kworker/u57:1-32640   [005] ..... 107271.750096: hfi1_rc_do_send: [0000:05:00.0] qpn 4a flag 0
   kworker/u60:0-322     [000] d..1. 107272.235001: hfi1_rc_do_send: [0000:05:00.0] qpn 1 flag 0
   kworker/u60:0-322     [000] d..1. 107272.752170: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0
   kworker/u60:1-1150    [012] d..1. 107272.752514: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0
   kworker/u60:0-322     [000] d..1. 107272.752698: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0
   kworker/u60:1-1150    [012] d..1. 107272.753017: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0
   kworker/u60:0-322     [000] d..1. 107272.753373: hfi1_rc_do_send: [0000:05:00.0] qpn 0 flag 0

xprtrdma's QP seems to be 0x4a, QP 0x01 is something else?

I don't see the Send Queue size in this trace. Could xprtrdma be overrunning its Send Queue? Enable the rdma_core and rdma_cma trace subsystems to see details of the RDMA resources allocated by RDMA consumers like xprtrdma.
Comment 14 Chuck Lever 2022-07-05 16:38:02 UTC
(In reply to Chuck Lever from comment #13)
> Looking at client.trace:
> 
>    kworker/u57:1-32640   [005] ..... 107211.048665: nfs_readpage_done:
> fileid=00:2d:2510 fhandle=0x3c3ad416 offset=91750400 count=262144 res=262144
> status=262144
>    kworker/u60:0-322     [000] d..1. 107212.234307: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u60:0-322     [000] d..1. 107222.234399: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u60:0-322     [000] d..1. 107232.234520: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u60:0-322     [000] d..1. 107242.234609: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u60:0-322     [000] d..1. 107252.234699: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u60:0-322     [000] d..1. 107262.234851: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u57:1-32640   [005] d..1. 107271.750033: icm_send_dreq:
> local_id=1503022084 remote_id=629322257 state=ESTABLISHED
> lap_state=LAP_UNINIT
>    kworker/u57:1-32640   [005] d..2. 107271.750036: hfi1_rc_do_send:
> [0000:05:00.0] qpn 1 flag 0
>    kworker/u57:1-32640   [005] ..... 107271.750042: xprtrdma_disconnect:
> peer=[192.168.254.1]:20049 rc=0 connection status=1

Well I edited out my initial observation, which is that hfi1 seems to be retransmitting RDMA Sends, if I understand the hfi1_rc_do_send trace events, but it's not getting RC ACKs or completions.
Comment 15 ehab.ababneh 2023-01-19 15:06:49 UTC
After trying isolate suspicious patches, I am positive that patch https://lore.kernel.org/lkml/20211115165452.835502712@linuxfoundation.org/ caused the issue. Without it, our test which is able to reproduce the issue completes.
Comment 16 Trond Myklebust 2023-01-19 16:58:58 UTC
That patch is almost certainly a red herring. The only effect it can have here is to clear the page cache (which would explain why your second 'cp' now takes 60s instead of 0.2).

Since you are running the script as root, I suggest adding a line

echo 1 > /proc/sys/vm/drop_caches

between these two lines

>         echo "Done";
> 
>         echo "copy $nfsfile to $tfile...";

...and see what the bisection now shows.
Comment 17 ehab.ababneh 2023-01-23 02:53:15 UTC
I re-ran the test after adding the command you recommended above, and I am still getting the same error.
Comment 18 Trond Myklebust 2023-01-23 04:53:22 UTC
That's expected. Does it also see the error when you revert the patch that git bisect pointed to?
Comment 19 ehab.ababneh 2023-01-23 14:59:39 UTC
1. Plain v5.19 : I see the errors.
2. v5.19 + patch reverted : no errors.
3. Plain v5.19 + command you recommended above in the script: I see the errors.
Comment 20 Trond Myklebust 2023-01-23 16:13:41 UTC
The only interesting case is

4. v5.19 + patch reverted + "drop_caches" in the script: ?????

I'll bet that show up as "I see the errors".
Comment 21 ehab.ababneh 2023-01-23 17:37:46 UTC
ok, maybe I misunderstood your previous message. I will give that a try.
Comment 22 ehab.ababneh 2023-01-30 16:56:53 UTC
I gave it a try and I got the same result you expected. I did see errors when using: v5.19 + patch reverted + "drop_caches" in the script.

I am checking investigation the relationship between the on one hand failure and some messages coming from GSI on the other hand. I do see that they always coincide to arrive together at the same time in a specific order.
Comment 23 Trond Myklebust 2023-01-30 18:13:27 UTC
Thanks for the update. Yes, it makes sense that the patch you pointed to was just unmasking an existing problem rather than being a root cause...
Comment 24 Denny Dalessandro 2023-01-30 19:36:06 UTC
So is the going theory that something is causing the caches to get dropped more often now? So this means our performance problem is because the cache drop.

OK, is there a way to isolate *why* cache gets dropped? Something we can instrument, that way we can trace back and see what the cause is.
Comment 25 ehab.ababneh 2023-02-03 15:15:45 UTC
It looks like that dropping the caches makes it more likely to have transactions in flight when another component (SDMA engine) wakes up to respond to interrupts. What I am observing point to the SDMA being the culprit.
Comment 26 ehab.ababneh 2023-03-28 17:29:12 UTC
I did more testing on this issue and I can confirm that the SDMA engine, which was suspected to be culprit before is not to blame. But I saw that two client tasks sharing the queue pair and communicating with the server at the same time. I also see that this causes the server to return op codes which are not what the client is expecting. This mix up causes the client to not mark the task complete and eventually timeout. 

I also see that the test passes for 6.2-rc3. And I went through the commits between that and 6.2.0 in RDMA and in the hfi1 driver, but none of those seem to solve the problem on its own when applied to 5.19.0.

I do suspect that the issue is in the synchronization between the client threads. But when I look what might have changed there for 6.2-rc3 I do not see anything new.

One note to add, when I add code to debug the issue that causes some delays between client threads, the problem goes away. Which makes me think it is some sort of synchronization.

Note You need to log in before you can comment on or make changes to this bug.