Bug 198053

Summary: NFSoRDMAv4 very unstable/slow. invalid request error/local length error
Product: File System Reporter: Andreas Schrägle (bugs)
Component: NFSAssignee: Chuck Lever (chucklever)
Status: RESOLVED CODE_FIX    
Severity: normal CC: anna, beck, c.kuntz, chucklever, ivaylo.markov, jziegler, mikko.tanner, trondmy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.14.2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: rpcdebug trace
NFS server trace
trace-cmd record -e sunrpc -e rpcrdma
rpc debug on client side (different run)
patch that adds trace points around Send/DMA map/unmap
attachment-18182-0.html
Proposed fix for bug 198053
Patched kernel 5.5 traces

Description Andreas Schrägle 2017-11-30 22:38:40 UTC
When trying to use NFSoRDMA with NFSv4.x, the share mounts fine and can be browsed, but trying to read files is incredibly slow and results in errors like this on the server:

[512890.665096] svcrdma: Send: invalid request error (9/0x8a)
[512900.904527] svcrdma: Send: invalid request error (9/0x8a)
[512921.382826] svcrdma: Send: invalid request error (9/0x8a)

and like this on the client:
[ 2059.784875] rpcrdma: Recv: local length error (1/0xd7)
[ 2064.899103] rpcrdma: Recv: local length error (1/0xd7)
[ 2075.139555] rpcrdma: Recv: local length error (1/0xd7)
[…]
[ 3394.073774] rpcrdma: localinv: memory management operation error (6/0xc9)
[ 3397.657259] rpcrdma: Recv: local length error (1/0xd7)
[ 3407.897604] rpcrdma: Recv: local length error (1/0xd7)
[ 3424.281591] rpcrdma: localinv: memory management operation error (6/0xc9)
[ 3428.378880] rpcrdma: Recv: local length error (1/0xd7)
[ 3454.489003] rpcrdma: localinv: memory management operation error (6/0xc9)
[ 3458.585606] rpcrdma: Recv: local length error (1/0xd7)
[ 3458.590797] rpcrdma: Recv: local length error (1/0xd7)
[ 3463.706362] rpcrdma: Recv: local length error (1/0xd7)
[ 3473.946508] rpcrdma: Recv: local length error (1/0xd7)
[ 3484.697233] rpcrdma: localinv: memory management operation error (6/0xc9)
[ 3494.426769] rpcrdma: Recv: local length error (1/0xd7)
[ 3514.905317] rpcrdma: localinv: memory management operation error (6/0xc9)

This occurs together with zfs on linux, but has also been reproduced with other filesystems like btrfs and ext4.

Some other people also found this bug and reported it at https://github.com/zfsonlinux/zfs/issues/6795, so there might be some information I forgot or that did not seem relevant to me, there.

Seeing as there is some work happening on the xprtrdma code, I wanted to try out 4.15, but did not get around to it, as zfs on linux does not support that, so far.
Comment 1 Chuck Lever 2017-12-01 15:14:03 UTC
I haven't seen anything like this. I test regularly with NFSv3, NFSv4.0, and NFSv4.1 using Mellanox CX-3 Pro, CX-4, and CX-5 cards in IB and RoCE mode. Red Hat has even broader hardware coverage.

NFS/RDMA is supposed to work in both mainline and in CentOS/RHEL 7. Thus, it's a new bug. I suggest the following next steps:

- Add more information to this bug about your platform (client and server hardware and software, fabric configuration, and so on)

- Would it be possible to bisect the mainline kernel to figure out where this behavior was introduced?
Comment 2 Chuck Lever 2017-12-03 18:06:26 UTC
Further details needed:

- Server O/S version is important to know

- Is there a simple reproducer? Can it be shared?

- Which NFS operations do you believe are triggering the issue?
Comment 3 Andreas Schrägle 2017-12-03 23:43:22 UTC
Thanks for your response and I apologise in advance for the following wall of text.

> - Add more information to this bug about your platform (client and server
> hardware and software, fabric configuration, and so on)

Right now, the server I am running is Arch Linux with 4.14.3-1-ARCH. That can be Ubuntu or CentOS for testing purposes, though. The people on github seem to be running CentOS 7.4.1708 (Linux 3.10.0-693.5.2.el7.x86_64) and Ubuntu 17.10.

As for hardware configuration, my setup consists of two ConnectX-2 cards. The people on github seem to have various hardware, including mlx4 and mlx5 cards.

Does this tell you what you need to know about the fabric configuration?
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        […]
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

> - Would it be possible to bisect the mainline kernel to figure out where
> this behavior was introduced?

Bisecting mainline would be possible, although it will take quite some time and would require knowing a working configuration first, right?

> - Is there a simple reproducer? Can it be shared?

Not yet, sorry. I will try my best to come up with a better reproducer than "maybe use zfs on linux or btrfs or something and try to set up a NFSoRDMA server", soon.

> - Which NFS operations do you believe are triggering the issue?

If I had to guess, I would say the READ NFS operation triggers this, whereas things like READDIR, ACCESS and LOOKUP seem fine. That is more or less pure guessing, though.

When turning on debugging on mlx4_core, the error messages above change to look like this:
client:
[   60.634572] mlx4_core 0000:07:00.0: Mapped 1 chunks/256 KB at 108040000 for ICM
[   60.657878] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port 1 without duplicate
[   72.283214] rpcrdma: Recv: local length error (1/0xd7)
[   72.285567] mlx4_core 0000:07:00.0: Have more references for index 0, no need to modify mac table
[   72.287370] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port 1 without duplicate
[   72.289511] rpcrdma: Recv: local length error (1/0xd7)
[   77.494830] mlx4_core 0000:07:00.0: Have more references for index 0, no need to modify mac table
[   77.496597] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port 1 without duplicate
[   77.498924] rpcrdma: Recv: local length error (1/0xd7)
[   87.521517] mlx4_core 0000:07:00.0: Have more references for index 0, no need to modify mac table
[   87.523846] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port 1 without duplicate
[…]
server:
[ 1047.640089] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port 1 without duplicate
[ 1210.438539] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port 1 without duplicate
[ 1216.780935] svcrdma: Send: invalid request error (9/0x8a)                                        
[ 1216.783404] mlx4_core 0000:01:00.0: Have more references for index 0, no need to modify mac table
[ 1216.783840] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port 1 without duplicate
[ 1216.786421] svcrdma: Send: invalid request error (9/0x8a)                                        
[ 1221.902710] mlx4_core 0000:01:00.0: Have more references for index 0, no need to modify mac table
[ 1221.903990] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port 1 without duplicate
[…]
Comment 4 Chuck Lever 2017-12-04 15:48:19 UTC
(In reply to Andreas Schrägle from comment #3)
> Does this tell you what you need to know about the fabric configuration?
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         […]
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>                         link_layer:             Ethernet

I see that link_layer is Ethernet. That implies RoCE mode. Can you confirm that? Is it hardware RoCE, or soft RoCE?


> > - Is there a simple reproducer? Can it be shared?
> 
> Not yet, sorry. I will try my best to come up with a better reproducer than
> "maybe use zfs on linux or btrfs or something and try to set up a NFSoRDMA
> server", soon.

Bisecting quickly is going to require a simple reproducer, so let's start with finding a reproducer.


> > - Which NFS operations do you believe are triggering the issue?
> 
> If I had to guess, I would say the READ NFS operation triggers this, whereas
> things like READDIR, ACCESS and LOOKUP seem fine. That is more or less pure
> guessing, though.
> 
> When turning on debugging on mlx4_core, the error messages above change to
> look like this:
> client:
> [   60.634572] mlx4_core 0000:07:00.0: Mapped 1 chunks/256 KB at 108040000
> for ICM
> [   60.657878] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port
> 1 without duplicate
> [   72.283214] rpcrdma: Recv: local length error (1/0xd7)
> [   72.285567] mlx4_core 0000:07:00.0: Have more references for index 0, no
> need to modify mac table
> [   72.287370] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port
> 1 without duplicate
> [   72.289511] rpcrdma: Recv: local length error (1/0xd7)
> [   77.494830] mlx4_core 0000:07:00.0: Have more references for index 0, no
> need to modify mac table
> [   77.496597] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port
> 1 without duplicate
> [   77.498924] rpcrdma: Recv: local length error (1/0xd7)
> [   87.521517] mlx4_core 0000:07:00.0: Have more references for index 0, no
> need to modify mac table
> [   87.523846] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for port
> 1 without duplicate
> […]
> server:
> [ 1047.640089] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port
> 1 without duplicate
> [ 1210.438539] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port
> 1 without duplicate
> [ 1216.780935] svcrdma: Send: invalid request error (9/0x8a)                
> 
> [ 1216.783404] mlx4_core 0000:01:00.0: Have more references for index 0, no
> need to modify mac table
> [ 1216.783840] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port
> 1 without duplicate
> [ 1216.786421] svcrdma: Send: invalid request error (9/0x8a)                
> 
> [ 1221.902710] mlx4_core 0000:01:00.0: Have more references for index 0, no
> need to modify mac table
> [ 1221.903990] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for port
> 1 without duplicate
> […]

These mlx4 debugging messages do not seem relevant to the problem. Can you try this on your client, once you have a simple way to reproduce the issue:

  # rpcdebug -m rpc -s call trans

This will dump a lot of debugging into /var/log/messages, and will cause NFS to run slowly. Please disable mlx4 driver debugging to reduce log traffic during this test.

The benefit is we should be able to see exactly which NFS operation is causing the issue, what parameters it is given, and how the transport is using RDMA to convey the operation and its parameters.

To turn RPC debug messages off when you are done:

  # rpcdebug -m rpc -c

Once you have found the line with "rpcrdma: Recv: local length error (1/0xd7)" copy the previous few dozen lines into a comment in this bug.

On CentOS / RHEL 7 systems, there is a log message rate limiter. This makes it imperative that the reproducer be simple and quick so that all of the relevant log messages will be saved in the log.
Comment 5 Andreas Schrägle 2017-12-10 22:13:44 UTC
Created attachment 261089 [details]
rpcdebug trace
Comment 6 Andreas Schrägle 2017-12-10 22:17:00 UTC
(In reply to Chuck Lever from comment #4)
> (In reply to Andreas Schrägle from comment #3)
> > Does this tell you what you need to know about the fabric configuration?
> > hca_id: mlx4_0
> >         transport:                      InfiniBand (0)
> >         […]
> >         phys_port_cnt:                  1
> >                 port:   1
> >                         state:                  PORT_ACTIVE (4)
> >                         max_mtu:                4096 (5)
> >                         active_mtu:             4096 (5)
> >                         sm_lid:                 0
> >                         port_lid:               0
> >                         port_lmc:               0x00
> >                         link_layer:             Ethernet
> 
> I see that link_layer is Ethernet. That implies RoCE mode. Can you confirm
> that? Is it hardware RoCE, or soft RoCE?

Yes, this is hardware RoCE.

> > > - Is there a simple reproducer? Can it be shared?
> > 
> > Not yet, sorry. I will try my best to come up with a better reproducer than
> > "maybe use zfs on linux or btrfs or something and try to set up a NFSoRDMA
> > server", soon.
> 
> Bisecting quickly is going to require a simple reproducer, so let's start
> with finding a reproducer.
> 
> 
> > > - Which NFS operations do you believe are triggering the issue?
> > 
> > If I had to guess, I would say the READ NFS operation triggers this,
> whereas
> > things like READDIR, ACCESS and LOOKUP seem fine. That is more or less pure
> > guessing, though.
> > 
> > When turning on debugging on mlx4_core, the error messages above change to
> > look like this:
> > client:
> > [   60.634572] mlx4_core 0000:07:00.0: Mapped 1 chunks/256 KB at 108040000
> > for ICM
> > [   60.657878] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for
> port
> > 1 without duplicate
> > [   72.283214] rpcrdma: Recv: local length error (1/0xd7)
> > [   72.285567] mlx4_core 0000:07:00.0: Have more references for index 0, no
> > need to modify mac table
> > [   72.287370] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for
> port
> > 1 without duplicate
> > [   72.289511] rpcrdma: Recv: local length error (1/0xd7)
> > [   77.494830] mlx4_core 0000:07:00.0: Have more references for index 0, no
> > need to modify mac table
> > [   77.496597] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for
> port
> > 1 without duplicate
> > [   77.498924] rpcrdma: Recv: local length error (1/0xd7)
> > [   87.521517] mlx4_core 0000:07:00.0: Have more references for index 0, no
> > need to modify mac table
> > [   87.523846] mlx4_core 0000:07:00.0: Registering MAC: 0x2c953895c for
> port
> > 1 without duplicate
> > […]
> > server:
> > [ 1047.640089] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for
> port
> > 1 without duplicate
> > [ 1210.438539] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for
> port
> > 1 without duplicate
> > [ 1216.780935] svcrdma: Send: invalid request error (9/0x8a)                
> > 
> > [ 1216.783404] mlx4_core 0000:01:00.0: Have more references for index 0, no
> > need to modify mac table
> > [ 1216.783840] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for
> port
> > 1 without duplicate
> > [ 1216.786421] svcrdma: Send: invalid request error (9/0x8a)                
> > 
> > [ 1221.902710] mlx4_core 0000:01:00.0: Have more references for index 0, no
> > need to modify mac table
> > [ 1221.903990] mlx4_core 0000:01:00.0: Registering MAC: 0x2c953deec for
> port
> > 1 without duplicate
> > […]
> 
> These mlx4 debugging messages do not seem relevant to the problem. Can you
> try this on your client, once you have a simple way to reproduce the issue:
> 
>   # rpcdebug -m rpc -s call trans
> 
> This will dump a lot of debugging into /var/log/messages, and will cause NFS
> to run slowly. Please disable mlx4 driver debugging to reduce log traffic
> during this test.
> 
> The benefit is we should be able to see exactly which NFS operation is
> causing the issue, what parameters it is given, and how the transport is
> using RDMA to convey the operation and its parameters.
> 
> To turn RPC debug messages off when you are done:
> 
>   # rpcdebug -m rpc -c
> 
> Once you have found the line with "rpcrdma: Recv: local length error
> (1/0xd7)" copy the previous few dozen lines into a comment in this bug.

I attached the whole log, created with
"rpcdebug -m rpc -s call trans
mount -v -t nfs4 -o port=20049,vers=4.0,proto=rdma 192.168.127.1:/test /test
timeout --signal=KILL 30 cat /test/test.file > /dev/null
umount /test
rpcdebug -m rpc -c"
to the report. 

> On CentOS / RHEL 7 systems, there is a log message rate limiter. This makes
> it imperative that the reproducer be simple and quick so that all of the
> relevant log messages will be saved in the log.

While I was quite sure I observed this with other filesystems than ZFS (on Linux), I could not reproduce this in my most recent tests with them.

So the way to reproduce this would be to create a zpool and export a dataset from it, then execute something like the above on a client.
Comment 7 Chuck Lever 2017-12-11 16:30:15 UTC
The mount under test:
 - NFSv4.0 mount (SETCLIENTID / SETCLIENTID_CONFIRM)
 - with sec=sys (Couldn't create auth handle (flavor 390004))
 - rsize=wsize=1MB (because I see 1MB READ operations)
 - The inline threshold is 4096 (rpcrdma_update_connect_private: max send 4096, max recv 4096)

The first failing request appears to be task ID 121, which is a 1MB NFSv4 READ. It is the first NFS READ operation in the capture.

kernel: RPC:   121 call_start nfs4 proc READ (async)

The client believes the non-payload part of the server's Reply will be smaller than 4KB.

kernel: RPC:   121 xprt_rdma_allocate: send size = 536, recv size = 244, req = ffff93ce37fce000

The client provides a Write list with one 1MB Write chunk, and no Reply chunk.

kernel: RPC:       frwr_op_map: Using frmr ffff93ce7c0dae68 to map 256 segments (1048576 bytes)
kernel: RPC:   122 rpcrdma_encode_write_list: 1048576@0x016435745000:0x00004f02 (last)
kernel: RPC:   122 rpcrdma_marshal_req: inline/write list: hdrlen 52 rpclen

kernel: rpcrdma: Recv: local length error (1/0xd7)

A Local length error occurs on a Receive when the remote posted a Send larger than the local Receive buffers. The client's Receive buffer length is 4096 bytes. If we take the error at face value, the server while sending the READ reply has posted a Send larger than 4096 bytes, which it should never do.

A possible workaround is to boost the size of the client's Receive buffers to see if the server's Reply is just over the inline limit, or there is something else going on. Try this, before you mount with proto=rdma:

 $ sudo echo 8192 > /proc/sys/sunrpc/rdma_max_inline_read

Meanwhile, it might be interesting to try the same reproducer with proto=tcp, and capture the network traffic using tcpdump. This would tell me more about the contents of the server's NFSv4 READ responses. Is it sending more GETATTR data than the client expects? Is it returning a large ACL? Are the filehandles unexpectedly large?

On the client, before running the reproducer (with proto=tcp):

 # tcpdump -s0 -w /tmp/sniffer.pcap

After the reproducer completes, gzip /tmp/sniffer.pcap, and attach it to this bug report. I need to see just the first few NFSv4 READ replies. You may need to specify the Ethernet device explicitly on the tcpdump command line if your client has more than one.
Comment 8 Joakim Ziegler 2019-05-28 22:08:53 UTC
I am the person who reported the original bug on the ZFS on Linux bugtracker. I'm wondering if there's been any progress on this. I'm about to start some new NFS testing myself, and I'd be willing to test this again if that will help.
Comment 9 Chuck Lever 2019-05-29 15:19:57 UTC
Joakim - Because ZFS is an out-of-kernel file system, I don't have ready access to a system with it. I am relying on the Reporters to reproduce the issue, and since late 2017 no-one has responded to me. Thus no further progress has been made.

If you have some time to help, that would be appreciated.
Comment 10 Joakim Ziegler 2019-05-29 17:34:43 UTC
The original bug report states "This occurs together with zfs on linux, but has also been reproduced with other filesystems like btrfs and ext4", so although I originally discovered it on ZFS, I don't believe it's ZFS specific.

Anyway, I'm willing to help out with this as needed. I do not program much, but I'm pretty decent at testing and reporting. I will do some testing on a new setup fairly soon, and I'll see if I run into this out of the box, or if it will require some further prodding to show up again (or maybe it's been drive-by fixed by changes since then, who knows?)
Comment 11 Chuck Lever 2019-05-29 17:53:33 UTC
Comment 6 above states:

> While I was quite sure I observed this with other filesystems than ZFS
> (on Linux), I could not reproduce this in my most recent tests with them.

Red Hat tests NFS/RDMA with the set of filesystems they support in RHEL, and they have not reported a similar problem.

Thanks for any help!
Comment 12 Joakim Ziegler 2019-05-29 22:55:29 UTC
Understood. I will probably be testing this between this week and next. One question, I'm on CentOS 7.5, and I'm planning on testing this with kernel 3.10.0.957.12.2.el7. Is that sufficiently updated that my test data will be useful?
Comment 13 Chuck Lever 2019-05-30 14:27:46 UTC
(In reply to Joakim Ziegler from comment #12)
> Understood. I will probably be testing this between this week and next. One
> question, I'm on CentOS 7.5, and I'm planning on testing this with kernel
> 3.10.0.957.12.2.el7. Is that sufficiently updated that my test data will be
> useful?

Red Hat keeps the NFS/RDMA bits close to upstream, so this should be an adequate starting place. However, I am not strictly speaking familiar with exactly what they have in the code base. Let's see what your test results are.
Comment 14 Mikko Tanner 2019-07-30 00:38:42 UTC
I am experiencing this very same bug. I believe it has been around at least since Ubuntu 16.04 days (kernel 4.4 or 4.10), because this is the time I built the current Infiniband cluster I am still using as a lab setup and I can see from my notes from back then that nfsv4 with RDMA was not working, but v3 worked fine.

Back to today, with Ubuntu 18.04 and hwe-edge kernel 5.0.0.16.72 (5.0.8 mainline) on both client and server, this bug still exists. ZFS has been in use for the whole time as well, from 0.6.x to 0.8.1 now.

Infiniband hardware is as follows:
- HCA: Mellanox MCX354A-FCBT (ConnectX3) on server and client
- Switch: Voltaire 4036 QDR
- OpenSM 3.3.20 (overrides the switch)

I will try to do some RPC debugging in the next few days as per Chuck's suggestions. In the meantime, I set up a quick test with an exported ramdisk (tmpfs) and nfsv4 mount, which failed to reproduce the issue, pointing squarely at ZFS and RDMA interaction on the server side.

mount options for ramdisk:
10.10.10.6:/ramdisk on /mnt/tmp type nfs4 (rw,noatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.120,local_lock=none,addr=10.10.10.6,_netdev)

RDMA transfer speed hovers around 500MB/s both to/from the mounted tmpfs (v4) and to/from a ZFS-based share (v3).
Comment 15 Mikko Tanner 2019-07-30 00:55:04 UTC
Additionally, a reproducer is trivial:
- create a ZFS pool and filesystem
- copy random data to the filesystem
- modprobe svcrdma && echo rdma 20049 > /proc/fs/nfsd/portlist
- export said filesystem with kernel NFS server
- mount the filesystem on a client with -o vers=4.x,proto=rdma
- try to read/copy data from said filesystem

The above assumes an existing Infiniband network, obviously.
Comment 16 Chuck Lever 2019-07-30 17:45:19 UTC
(In reply to Mikko Tanner from comment #14)
> I will try to do some RPC debugging in the next few days as per Chuck's
> suggestions. In the meantime, I set up a quick test with an exported ramdisk
> (tmpfs) and nfsv4 mount, which failed to reproduce the issue, pointing
> squarely at ZFS and RDMA interaction on the server side.

This is my conclusion as well. NFSv4 on RDMA has been working for other local filesystem types for a long while.

> RDMA transfer speed hovers around 500MB/s both to/from the mounted tmpfs
> (v4) and to/from a ZFS-based share (v3).

Your transfer speed result is about what I would expect.

Thanks for capturing more information. On newer upstream kernels, there is now a repertoire of trace points in the NFS and RPC code that might help you drill into this, if you are comfortable with that.
Comment 17 Joakim Ziegler 2019-08-02 00:34:42 UTC
Mikko, I'm curious, does the workaround mentioned in comment 7:

$ sudo echo 8192 > /proc/sys/sunrpc/rdma_max_inline_read

Before mounting, fix the issue for you?
Comment 18 Joakim Ziegler 2019-08-04 02:48:15 UTC
So, I tested this on my own setup just now, and while NFS v3 works fine over RDMA, switching to v4 fills my dmesg on the client up with:

rpcrdma: Recv: local length error (1/0xd7)


While the server gets a bunch of:

[Sat Aug  3 21:36:11 2019] svcrdma: Could not send reply, err=-107. Closing transport.
[Sat Aug  3 21:36:12 2019] mlx5_0:dump_cqe:278:(pid 17909): dump error cqe
[Sat Aug  3 21:36:12 2019] 00000000 00000000 00000000 00000000
[Sat Aug  3 21:36:12 2019] 00000000 00000000 00000000 00000000
[Sat Aug  3 21:36:12 2019] 00000000 00000000 00000000 00000000
[Sat Aug  3 21:36:12 2019] 00000000 00008a12 01000162 096e4dd0
[Sat Aug  3 21:36:12 2019] svcrdma: Send: invalid request error (9/0x8a)


My mtab has this share as:
 rw,relatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.20.0.6,local_lock=none,addr=10.20.0.1

The workaround mentioned earlier, echo 8192 > /proc/sys/sunrpc/rdma_max_inline_read, does not seem to make any difference.

This is on CentOS 7.6.1810, kernel 3.10.0-957.27.2.el7.x86_64 both on the server and client.


For what it's worth, I'm getting steady reads of just under 2GB/sec from this server with NFS v3. This is a FDR network with Mellanox adapters and a Mellanox SX6036 switch.
Comment 19 Chuck Lever 2019-08-09 19:24:05 UTC
(In reply to Mikko Tanner from comment #14)
> I will try to do some RPC debugging in the next few days as per Chuck's
> suggestions. In the meantime, I set up a quick test with an exported ramdisk
> (tmpfs) and nfsv4 mount, which failed to reproduce the issue, pointing
> squarely at ZFS and RDMA interaction on the server side.

Any update here? Even knowing which NFS request is the first failing operation could be helpful.
Comment 20 Joakim Ziegler 2019-08-09 19:25:11 UTC
(In reply to Chuck Lever from comment #19)
> (In reply to Mikko Tanner from comment #14)
> > I will try to do some RPC debugging in the next few days as per Chuck's
> > suggestions. In the meantime, I set up a quick test with an exported
> ramdisk
> > (tmpfs) and nfsv4 mount, which failed to reproduce the issue, pointing
> > squarely at ZFS and RDMA interaction on the server side.
> 
> Any update here? Even knowing which NFS request is the first failing
> operation could be helpful.

I'm fairly technical, but I've never done RPC debugging before. If you wouldn't mind kind of walking me through it, I could make time for this.
Comment 21 Chuck Lever 2019-08-09 20:27:34 UTC
(In reply to Joakim Ziegler from comment #20)
> I'm fairly technical, but I've never done RPC debugging before. If you
> wouldn't mind kind of walking me through it, I could make time for this.

Let's try the following, hoping CentOS 7.5 is recent enough to contain appropriate trace events in the server's RPC/RDMA transport.

- Become root on your NFS server.

- Install the trace-cmd tool on your NFS server (should be available in common CentOS repositories).

- On your NFS server, enter the command "trace-cmd record -e sunrpc -e rpcrdma" and hit return. Wait for the "Hit ^C to stop" message, then proceed to the next step.

- Reproduce the problem from a convenient client.

- Once the reproducer has failed, on your NFS server, stop the recording with "^C".

- Attach the result "trace.dat" file to this bugzilla report.

If you are curious, you can examine the recording using "trace-cmd report | less".
Comment 22 Joakim Ziegler 2019-08-10 00:16:08 UTC
(In reply to Chuck Lever from comment #21)
> (In reply to Joakim Ziegler from comment #20)
> > I'm fairly technical, but I've never done RPC debugging before. If you
> > wouldn't mind kind of walking me through it, I could make time for this.

> Let's try the following, hoping CentOS 7.5 is recent enough to contain
> appropriate trace events in the server's RPC/RDMA transport.

> [Thorough explanation snipped]

This looks very manageable. I'll try to get this trace done during this weekend, since I need to "break" NFS for one of the clients, and also the NFS server is doing a lot of unrelated stuff during the week, which I suspect will generate very large trace files with a lot of irrelevant info cluttering everything up. On the weekend, it's pretty much idle. I'll get back to you ASAP.
Comment 23 Chuck Lever 2019-08-10 19:24:29 UTC
(In reply to Joakim Ziegler from comment #22)
> the NFS server is doing a lot of unrelated stuff during the week, which I
> suspect will generate very large trace files with a lot of irrelevant info
> cluttering everything up.

Indeed, that is true.
Comment 24 Joakim Ziegler 2019-08-11 03:05:20 UTC
Created attachment 284317 [details]
NFS server trace

Here's a trace from the NFS server. Stuff done while this trace was running:

* Mounted shares on the client with protocol version 4.1

* On the client, did a cat of a fairly large binary file from one of the shares to /dev/null

* Wait a bit while observing this in messages on the client:

[Sat Aug 10 21:17:36 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:17:51 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:17:51 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:18:06 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:18:21 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:18:36 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:18:51 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:06 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:21 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:21 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:22 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:37 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:37 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:37 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:37 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:37 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:52 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:52 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:52 2019] rpcrdma: Recv: local length error (1/0xd7)
[Sat Aug 10 21:19:52 2019] rpcrdma: Recv: local length error (1/0xd7)

* Stop the trace, bzip it (too big for the attachment limit otherwise), and send

Hope this is helpful. Please let me know if there's anything more I can do.
Comment 25 Chuck Lever 2019-08-12 16:26:52 UTC
Downloaded and reviewed 190810-nfs_rdma_error-server-trace.dat. It appears that there are no svc_rdma trace points in this kernel, so I can't see the operation of the RDMA transport layer. Would it be difficult to update your server to CentOS 7.6 or later, and try again?
Comment 26 Joakim Ziegler 2019-08-12 16:28:08 UTC
This is in fact CentOS 7.6 with the kernel 3.10.0-957.27.2.el7.x86_64. I'm open to installing something else, if you'd tell me what (and maybe how, if it's complicated).
Comment 27 Chuck Lever 2019-08-12 16:36:18 UTC
CentOS 7.4 is mentioned above, sorry about that. Try this:

  # trace-cmd list | grep svcrdma

And I will ask my RH buddies what's up with server-side tracing.
Comment 28 Joakim Ziegler 2019-08-12 16:37:18 UTC
That command returns nothing.
Comment 29 Joakim Ziegler 2019-08-12 21:06:53 UTC
Since that command returned nothing, I thought it might give some sort of point of reference to run just:

# trace-cmd list | grep rdma

Which gives:

rpcrdma:xprtrdma_conn_upcall
rpcrdma:xprtrdma_disconnect
rpcrdma:xprtrdma_conn_start
rpcrdma:xprtrdma_conn_tout
rpcrdma:xprtrdma_create
rpcrdma:xprtrdma_destroy
rpcrdma:xprtrdma_remove
rpcrdma:xprtrdma_reinsert
rpcrdma:xprtrdma_reconnect
rpcrdma:xprtrdma_inject_dsc
rpcrdma:xprtrdma_qp_error
rpcrdma:xprtrdma_createmrs
rpcrdma:xprtrdma_nomrs
rpcrdma:xprtrdma_read_chunk
rpcrdma:xprtrdma_write_chunk
rpcrdma:xprtrdma_reply_chunk
rpcrdma:xprtrdma_marshal
rpcrdma:xprtrdma_post_send
rpcrdma:xprtrdma_post_recv
rpcrdma:xprtrdma_wc_send
rpcrdma:xprtrdma_wc_receive
rpcrdma:xprtrdma_wc_fastreg
rpcrdma:xprtrdma_wc_li
rpcrdma:xprtrdma_wc_li_wake
rpcrdma:xprtrdma_localinv
rpcrdma:xprtrdma_dma_unmap
rpcrdma:xprtrdma_remoteinv
rpcrdma:xprtrdma_recover_mr
rpcrdma:xprtrdma_reply
rpcrdma:xprtrdma_defer_cmp
rpcrdma:xprtrdma_reply_vers
rpcrdma:xprtrdma_reply_rqst
rpcrdma:xprtrdma_reply_short
rpcrdma:xprtrdma_reply_hdr
rpcrdma:xprtrdma_fixup
rpcrdma:xprtrdma_fixup_pg
rpcrdma:xprtrdma_decode_seg
rpcrdma:xprtrdma_allocate
rpcrdma:xprtrdma_rpc_done
rpcrdma:xprtrdma_noreps
rpcrdma:xprtrdma_cb_setup
rpcrdma:xprtrdma_cb_call
rpcrdma:xprtrdma_cb_reply
Comment 30 Chuck Lever 2019-08-12 21:35:09 UTC
Right, those results mean that there are trace points only in the client-side transport code, and none in the server-side code.
Comment 31 Joakim Ziegler 2019-08-12 23:06:53 UTC
By the way, I'm not sure if this is related or not, but when running this with NFS v3, the server log gets quite a few of these when under load:

[Mon Aug 12 17:50:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 17:50:25 2019] ib0: packet len 3602 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 17:59:27 2019] ib0: packet len 9004 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 4082 (> 2048) too long to send, dropping
[Mon Aug 12 18:00:25 2019] ib0: packet len 3602 (> 2048) too long to send, dropping
Comment 32 Chuck Lever 2019-08-13 16:18:42 UTC
I've confirmed that RHEL7.6-based kernels do not have the svcrdma trace points, but 7.7-based kernels should have some or most of them. Joakim, is there a CentOS 7.7 available for you to try?
Comment 33 Joakim Ziegler 2019-08-13 18:12:26 UTC
RHEL 7.7 just came out less than a week ago, and CentOS 7.7 is not out yet. Normally CentOS takes between a month and a month and a half to get an equivalent release out after a minor RHEL release, so I expect it'll take a bit longer.

I can wait, or if there's some other way to get a newer kernel with the required tracepoints, I'm open to suggestions.
Comment 34 Chuck Lever 2019-08-13 18:53:43 UTC
(In reply to Joakim Ziegler from comment #33)
> RHEL 7.7 just came out less than a week ago, and CentOS 7.7 is not out yet.
> Normally CentOS takes between a month and a month and a half to get an
> equivalent release out after a minor RHEL release, so I expect it'll take a
> bit longer.

That's what I thought.

> I can wait, or if there's some other way to get a newer kernel with the
> required tracepoints, I'm open to suggestions.

rpcdebug works on the server side too. Maybe try "sudo rpcdebug -m rpc -s all" ?
Comment 35 Joakim Ziegler 2019-08-13 18:54:47 UTC
Will test that this weekend, maybe something useful will come out of it. Would you like me to run it both on the server and client side at the same time, or is server side sufficient?
Comment 36 Chuck Lever 2019-08-13 19:02:32 UTC
(In reply to Joakim Ziegler from comment #35)
> Will test that this weekend, maybe something useful will come out of it.
> Would you like me to run it both on the server and client side at the same
> time, or is server side sufficient?

Server-side only should be sufficient, and please add "sudo rpcdebug -m nfsd -s all". Thanks for testing!
Comment 37 Joakim Ziegler 2019-08-13 20:52:21 UTC
Thanks, just to clarify, should I run those two commands at the same time in two different shells, and each will generate its own output file?
Comment 38 Chuck Lever 2019-08-13 21:50:39 UTC
(In reply to Joakim Ziegler from comment #37)
> Thanks, just to clarify, should I run those two commands at the same time in
> two different shells, and each will generate its own output file?

Before starting the reproducer, run both of those commands on the server. All debugging output will then appear in /var/log/messages. When the reproducer is done, use "sudo rpcdebug -m rpc -c" and "sudo rpcdebug -m nfsd -c" to disable the messages and return the server to normal operation. Copy the debugging messages into another file, and attach it to this bug report.
Comment 39 Joakim Ziegler 2019-08-13 22:17:14 UTC
Understood. Will run this weekend, thanks.
Comment 40 Mikko Tanner 2019-08-17 11:34:00 UTC
Created attachment 284473 [details]
trace-cmd record -e sunrpc -e rpcrdma

Hello Sirs. Sorry for the delay, took longer than I expected to get the debugging done. The following sequence of commands was used to gather the RPC trace on the server side:

[server]
# trace-cmd record -e sunrpc -e rpcrdma
^CCPU0 data recorded at offset=0x631000
    1241088 bytes in size
CPU1 data recorded at offset=0x760000
    106496 bytes in size
CPU2 data recorded at offset=0x77a000
    118784 bytes in size
CPU3 data recorded at offset=0x797000
    36864 bytes in size
CPU4 data recorded at offset=0x7a0000
    73728 bytes in size
CPU5 data recorded at offset=0x7b2000
    12288 bytes in size
CPU6 data recorded at offset=0x7b5000
    20480 bytes in size
CPU7 data recorded at offset=0x7ba000
    69632 bytes in size
CPU8 data recorded at offset=0x7cb000
    0 bytes in size
CPU9 data recorded at offset=0x7cb000
    4096 bytes in size
CPU10 data recorded at offset=0x7cc000
    884736 bytes in size
CPU11 data recorded at offset=0x8a4000
    4096 bytes in size
CPU12 data recorded at offset=0x8a5000
    172032 bytes in size
CPU13 data recorded at offset=0x8cf000
    12288 bytes in size
CPU14 data recorded at offset=0x8d2000
    28672 bytes in size
CPU15 data recorded at offset=0x8d9000
    4096 bytes in size

[client]
# mount /files/temp/
# file /files/temp/test.file
^C
# umount -fl /files/temp

Trying to read a ~700kB test file expectedly results in errors in client and server syslog.

[server] svcrdma: Send: invalid request error (9/0x8a)
[client] rpcrdma: Recv: local length error (1/0xd7)

The server kernel has the following trace points:

# trace-cmd list | grep svcrdma
rpcrdma:svcrdma_xprt_accept
rpcrdma:svcrdma_xprt_fail
rpcrdma:svcrdma_xprt_free
rpcrdma:svcrdma_decode_rqst
rpcrdma:svcrdma_decode_short
rpcrdma:svcrdma_decode_badvers
rpcrdma:svcrdma_decode_drop
rpcrdma:svcrdma_decode_badproc
rpcrdma:svcrdma_decode_parse
rpcrdma:svcrdma_encode_rseg
rpcrdma:svcrdma_encode_wseg
rpcrdma:svcrdma_encode_pzr
rpcrdma:svcrdma_encode_write
rpcrdma:svcrdma_encode_reply
rpcrdma:svcrdma_encode_read
rpcrdma:svcrdma_err_vers
rpcrdma:svcrdma_err_chunk
rpcrdma:svcrdma_dma_map_page
rpcrdma:svcrdma_dma_map_rwctx
rpcrdma:svcrdma_send_failed
rpcrdma:svcrdma_post_send
rpcrdma:svcrdma_wc_send
rpcrdma:svcrdma_post_recv
rpcrdma:svcrdma_wc_receive
rpcrdma:svcrdma_post_rw
rpcrdma:svcrdma_wc_read
rpcrdma:svcrdma_wc_write
rpcrdma:svcrdma_cm_event
rpcrdma:svcrdma_qp_error
rpcrdma:svcrdma_sq_full
rpcrdma:svcrdma_sq_retry
Comment 41 Mikko Tanner 2019-08-17 11:38:57 UTC
Created attachment 284475 [details]
rpc debug on client side (different run)

Additionally, here is a client-side debug log from a previous test run gathered with the following sequence of commands:

# rpcdebug -m rpc -s call trans
rpc        call trans
# mount /files/temp/
# file /files/temp/test.file
^C
# umount -fl /files/temp
# rpcdebug -m rpc -c
rpc       <no flags set>
Comment 42 Chuck Lever 2019-08-19 23:25:26 UTC
Reviewing the trace.dat in attachment 284473 [details]:

            nfsd-10679 [012] 1629872.049831: svc_xprt_dequeue:     xprt=0xffff8b0d0415b800 addr=10.10.10.120:35006 flags=XPT_BUSY|XPT_DATA|XPT_TEMP|XPT_CONG_CTRL wakeup-us=52
            nfsd-10679 [012] 1629872.049834: svcrdma_decode_rqst:  xid=0xfe7765d9 vers=1 credits=128 proc=RDMA_MSG hdrlen=116
            nfsd-10679 [012] 1629872.049847: svc_xprt_do_enqueue:  xprt=0xffff8b0d0415b800 addr=10.10.10.120:35006 pid=10678 flags=XPT_BUSY|XPT_DATA|XPT_TEMP|XPT_CONG_CTRL
            nfsd-10679 [012] 1629872.049848: svc_handle_xprt:      xprt=0xffff8b0d0415b800 addr=10.10.10.120:35006 len=180 flags=XPT_BUSY|XPT_DATA|XPT_TEMP|XPT_CONG_CTRL
            nfsd-10679 [012] 1629872.049848: svc_recv:             addr=10.10.10.120:35006 xid=0xfe7765d9 len=180 flags=RQ_SECURE|RQ_USEDEFERRAL|RQ_SPLICE_OK|RQ_BUSY|0x80
            nfsd-10679 [012] 1629872.049854: svc_process:          addr=10.10.10.120:35006 xid=0xfe7765d9 service=nfsd vers=4 proc=1

            nfsd-10679 [012] 1629872.050054: svcrdma_encode_wseg:  122880@0xfe7765d94b848000:0x98010928
            nfsd-10679 [012] 1629872.050056: svcrdma_encode_wseg:  122880@0xfe7765d9ee5c1000:0x98010828
            nfsd-10679 [012] 1629872.050058: svcrdma_encode_wseg:  122880@0xfe7765d9508c3000:0x98010728
            nfsd-10679 [012] 1629872.050059: svcrdma_encode_wseg:  122880@0xfe7765d9effb7000:0x98010428
            nfsd-10679 [012] 1629872.050061: svcrdma_encode_wseg:  28776@0xfe7765d92be1e000:0x98010328
            nfsd-10679 [012] 1629872.050064: svcrdma_post_rw:      cqe=0xffff8b14fc9bc630 sqecount=13 status=0
            nfsd-10679 [012] 1629872.050065: svcrdma_encode_write: length=520296
            nfsd-10679 [012] 1629872.050068: svcrdma_post_send:    cqe=0xffff8b14c4c51038 num_sge=2 inv_rkey=0x00000000 status=0
            nfsd-10679 [012] 1629872.050069: svc_send:             addr=10.10.10.120:35006 xid=0xfe7765d9 status=0 flags=RQ_SECURE|RQ_USEDEFERRAL|RQ_SPLICE_OK|RQ_BUSY|0x80
    kworker/0:1H-3998  [000] 1629872.050187: svcrdma_wc_send:      cqe=0xffff8b14cb5ce038 status=REM_INV_REQ_ERR (9/0x8a)

IB_WC_REM_INV_REQ_ERR means that the server received a NAK from the client because the server sent it an invalid request. Not sure if we can rely on the Send completion to tell us the invalid request was indeed a Send WR. RDMA Writes are not ACK'd.

The failing NFS request looks like an NFSv4 READ operation of just under 512KB. Possibly reducing the rsize to 262144 might work around this bug.

In the meantime, the trace points are not telling us some important details about what the server is doing wrong. I will see if there's a way to get more information (via a kernel patch, as I'm not yet skilled with eBPF and kprobes).
Comment 43 Joakim Ziegler 2019-08-19 23:26:34 UTC
Just checking in, I didn't have time to do the rpcdebug stuff this weekend, would it still be useful for me to do it in light of Mikko's attachments?
Comment 44 Chuck Lever 2019-08-20 00:29:11 UTC
Created attachment 284523 [details]
patch that adds trace points around Send/DMA map/unmap

Please apply the attached patch to the server and try the test again.
Comment 45 Chuck Lever 2019-08-20 00:32:34 UTC
(In reply to Joakim Ziegler from comment #43)
> Just checking in, I didn't have time to do the rpcdebug stuff this weekend,
> would it still be useful for me to do it in light of Mikko's attachments?

Trace data is better than rpcdebug (even though I need Mikko to do it again :-)
Comment 46 Chuck Lever 2019-08-20 00:41:03 UTC
(In reply to Chuck Lever from comment #44)
> Created attachment 284523 [details]
> patch that adds trace points around Send/DMA map/unmap
> 
> Please apply the attached patch to the server and try the test again.

Actually, let's add NFSD tracing too. So after applying the patch, please use:

# trace-cmd record -e nfsd -e sunrpc -e rpcrdma

Thanks.
Comment 47 Joakim Ziegler 2019-08-20 02:31:59 UTC
It's been a while since I've built my own kernel (those were the days), but this recipe for CentOS seems to be up to date and not too daunting: https://wiki.centos.org/HowTos/Custom_Kernel

It'll probably be until next weekend before I can do this, but I'll try to get it done ASAP.
Comment 48 Mikko Tanner 2019-08-20 02:56:31 UTC
Will take me a few days to make enough time to rebuild the kernel and reboot the server unfortunately.
Comment 49 Mikko Tanner 2019-08-20 08:55:42 UTC
Setting rsize=262144,wsize=262144 as mount options does not work as a workaround.
Comment 50 Joakim Ziegler 2020-01-02 23:32:54 UTC
Hey guys, after something of a delay, I'm coming back to wanting to look at this again. My server is now running CentOS 7.7 (Kernel 3.10.0-1062.9.1.el7.x86_64). Would I still need to build a custom kernel to debug this? Would the above patch still apply?
Comment 51 Christian Kuntz 2020-01-08 01:13:33 UTC
Hello! 
I'm also running into this problem on Buster (Debian 10, Kernel ver. 4.19.67-2+deb10u2)
I'm fairly well equipped to roll custom kernels and all that jazz, so let me know what I can do to help debug this and I'll get it done.
Comment 52 Chuck Lever 2020-01-08 14:27:19 UTC
(In reply to Christian Kuntz from comment #51)
> Hello! 
> I'm also running into this problem on Buster (Debian 10, Kernel ver.
> 4.19.67-2+deb10u2)
> I'm fairly well equipped to roll custom kernels and all that jazz, so let me
> know what I can do to help debug this and I'll get it done.

I recently introduced additional tracepoints to the NFS server to help track down this issue.

1. Grab the latest upstream kernel and install it on your server.

2. Enable the relevant tracepoints with this command:

# trace-cmd record -e nfsd -e sunrpc -e rpcrdma

3. Reproduce the problem. I assume that the problem happens quickly so that the amount of generated trace data will be manageable.

4. Interrupt the trace-cmd command.

5. Attach the trace.dat file to this bug.
Comment 53 Christian Kuntz 2020-01-08 21:47:51 UTC
Okay, giving it a shot with 5.4.8, let me know if I need a later version.
Comment 54 Christian Kuntz 2020-01-10 22:17:17 UTC
Created attachment 286745 [details]
attachment-18182-0.html

Here's the trace from the following command:
fio --rw=randread --bs=64k --numjobs=1 --runtime=30 --time_based
--ioengine=libaio --direct=1  --name test1
--filename=/tank/od-speedtest.4.0 --size=100000

Ran all the way up to 100%, then plopped down to 40% and stayed there.
 trace.dat
<https://drive.google.com/a/opendrives.com/file/d/1KDbNQPdIgXv8AgJG2ir72ANL-S2BLb9j/view?usp=drive_web>

On Wed, Jan 8, 2020 at 6:27 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=198053
>
> --- Comment #52 from Chuck Lever (chucklever@gmail.com) ---
> (In reply to Christian Kuntz from comment #51)
> > Hello!
> > I'm also running into this problem on Buster (Debian 10, Kernel ver.
> > 4.19.67-2+deb10u2)
> > I'm fairly well equipped to roll custom kernels and all that jazz, so
> let me
> > know what I can do to help debug this and I'll get it done.
>
> I recently introduced additional tracepoints to the NFS server to help
> track
> down this issue.
>
> 1. Grab the latest upstream kernel and install it on your server.
>
> 2. Enable the relevant tracepoints with this command:
>
> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma
>
> 3. Reproduce the problem. I assume that the problem happens quickly so
> that the
> amount of generated trace data will be manageable.
>
> 4. Interrupt the trace-cmd command.
>
> 5. Attach the trace.dat file to this bug.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 55 Chuck Lever 2020-01-11 03:51:06 UTC
Thanks Christian, looking at this now.
Comment 56 Chuck Lever 2020-01-12 02:51:05 UTC
I was able to force the problem to occur by wiring my server to use the vectored read path instead of splice_read. Apparently ZFS does not have a fops->splice_read method.

The result is that svc_send is called with a screwed up reply XDR buf:

            nfsd-1808  [001]   401.548935: svcrdma_reply_msg:    hdrlen=52 head=[0xffff888fe5dde000,4096] page_len=61544 tail=[(nil),0] len=65640 buflen=1049480

Only 61544 bytes out of 65536 of the NFS READ payload are in the XDR buf's page vector, and the head+tail length is 4096 bytes: exactly the inline threshold, but dodgy nonetheless.

Here's what it looks like with splice reads:

            nfsd-1805  [003]  5413.155536: svcrdma_reply_msg:    hdrlen=52 head=[0xffff888efa7ed000,104] page_len=65536 tail=[0xffff888efa7ed068,0] len=65640 buflen=69632

The NFS READ payload is entirely in the XDR buf's page vector in this case, and the head+tail length is just 104 bytes. That's exactly how the reply should look.

Since I can now reproduce this at will in my lab, I should be able to chase this down in a few days.
Comment 57 Chuck Lever 2020-01-14 16:43:39 UTC
It appears that

commit b042098063849794d69b5322fcc6cf9fb5f2586e
Author:     J. Bruce Fields <bfields@redhat.com>
AuthorDate: Tue Mar 18 17:44:10 2014 -0400
Commit:     J. Bruce Fields <bfields@redhat.com>
CommitDate: Fri May 30 17:32:12 2014 -0400

    nfsd4: allow exotic read compounds

is the offending commit. I'm working on a fix.
Comment 58 Chuck Lever 2020-01-18 17:57:49 UTC
Progress update. There are two issues:

- nfsd4_encode_readv() is not placing the payload of the first READ operation in the xdr_buf's page vector. It assumes the transport is always TCP, and simply uses up the remaining amount of head[0]. I've got a working fix for this, but it gets NFSv4/RDMA splice-less READ working only for payload lengths that are a multiple of 4 bytes.

- nfsd4_encode_readv() always adds XDR padding to the READ payload, again assuming the transport is always going to be TCP. RPC/RDMA forbids the addition of an XDR pad in some common cases, and there isn't an easy way for the transport to discover the pad and strip it off. I'm experimenting with a small set of patches that should help address this.

When I've got a full set of fixes, I will attach them here and post them on linux-nfs@vger.kernel.org .


There's a secondary problem. b04209806384 attempts to support multiple NFSv4 READ operations per COMPOUND, but I'm not sure it's entirely successful. It certainly won't work for NFS/RDMA as it is currently implemented, though there are very few instances where a client will generate such a COMPOUND.

I'm not going to try to address this problem with my fixes, but I will try not to make things worse.
Comment 59 Chuck Lever 2020-02-02 17:00:59 UTC
Created attachment 287079 [details]
Proposed fix for bug 198053

Please try this out and let me know whether this patch (against v5.5) properly addresses the issue, or more work is necessary.
Comment 60 Mikko Tanner 2020-02-02 17:10:26 UTC
(In reply to Chuck Lever from comment #59)
> Please try this out and let me know whether this patch (against v5.5)
> properly addresses the issue, or more work is necessary.

Is this for server, client or both?
Comment 61 Chuck Lever 2020-02-02 20:37:56 UTC
(In reply to Mikko Tanner from comment #60)
> (In reply to Chuck Lever from comment #59)
> > Please try this out and let me know whether this patch (against v5.5)
> > properly addresses the issue, or more work is necessary.
> 
> Is this for server, client or both?

The fix patches fs/nfsd/nfs4xdr.c, which is the server-side XDR routines. The client implementation does not require a code change.
Comment 62 Ivaylo Markov 2020-04-29 17:11:21 UTC
Created attachment 288821 [details]
Patched kernel 5.5 traces

Hello,

I am having the same issue on reads. My setup is Mellanox ConnectX-3 in Ethernet mode,
a Cisco Nexus 3K switch, and Ubuntu 20.04 LTS with the 5.4.0-28-generic kernel from
the official repositories. The file system I am exporting is ext4 on top of mdadm RAID6.

I applied the patch from comment #59 to mainline tag v5.5.
When trying to do a write test from the client[0], I get a read speed of zero, and
"server X.X.X.X not responding, still trying" messages from the client kernel every 2-3 seconds.
Running a read test[1] also gets me zero transfer speed, but there are no kernel messages.

See the attached ZIP for traces.

[0] fio --name=write --ioengine=libaio --direct=1 --sync=0 --time_based --rw=write --bs=1M --iodepth=64 --runtime=1m --filename=/test-storage/sp03.fio --size=1G
[1] fio --name=read  --ioengine=libaio --direct=1 --sync=0 --time_based --rw=read  --bs=1M --iodepth=64 --runtime=1m --filename=/test-storage/sp03.fio --size=1G
Comment 63 Chuck Lever 2020-04-29 19:23:55 UTC
This does not appear to be the same problem.

trace-read.dat shows the transport's Send Queue getting backed up, then the connection is lost. The subsequent WC flushes do not show any particular errors (like LOCAL_LEN_ERR). Therefore, I don't believe that the server's RPC RDMA transport is malfunctioning.

ConnectX-3 in Ethernet mode, I assume, is going to use RoCE for RDMA operations. Are you certain your cards and switch are configured correctly to use Priority Flow Control?
Comment 64 Ivaylo Markov 2020-04-30 09:12:15 UTC
It seems like that is indeed the problem. Thanks and apologies for wasting your time!
Comment 65 Chuck Lever 2020-04-30 14:13:46 UTC
Commit 412055398b9e ("nfsd: Fix NFSv4 READ on RDMA when using readv") appears in the upstream v5.6 kernel.
Comment 66 Joakim Ziegler 2020-05-08 17:35:39 UTC
Chuck, thank you very much for your effort on this.

Slightly off-topic, is there any way to know/track when this will make its way into distro kernels? CentOS and possibly Ubuntu are most relevant for me.
Comment 67 Chuck Lever 2020-05-08 17:56:27 UTC
(In reply to Joakim Ziegler from comment #66)
> Slightly off-topic, is there any way to know/track when this will make its
> way into distro kernels? CentOS and possibly Ubuntu are most relevant for me.

Hi Joakim, the way I am most familiar with is to open a bug report with each distribution that requests the inclusion of the specific upstream commit ID with the fix. They usually respond with "we'll get on it" "already done" or "no way".
Comment 68 Joakim Ziegler 2020-05-08 18:11:27 UTC
Thanks for the tip, I'll see if I can do that.