Bug 209439 - knfsd + i40e performance regression since v5.7+
Summary: knfsd + i40e performance regression since v5.7+
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Chuck Lever
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-01 08:00 UTC by Daire Byrne
Modified: 2022-02-21 16:00 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.7
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
SUNRPC: Use zero-copy to perform socket send operations (2.49 KB, patch)
2020-11-02 18:57 UTC, Chuck Lever
Details | Diff
SUNRPC: Handle TCP socket sends with kernel_sendpage() again (4.32 KB, patch)
2020-12-11 22:29 UTC, Chuck Lever
Details | Diff

Description Daire Byrne 2020-10-01 08:00:09 UTC
Hi,

We have an NFS server equipped with 2 x i40e network cards that is capable of serving files from memory at close to 70Gbit to many hundreds of 1gbit clients simultaneously.

Since upgrading to v5.7+ kernels, there seems to be a performance regression such that we can no longer achieve much more than 30Gbit for the same workload.

So v5.6.19 performs well but v5.7.1 and every kernel up to v5.8.10 does not.

Looking at the perf report, it seems like we are spending lots of time in memcpy_from_page from v5.7 onwards whereas before (5.6 and below), this call trace does not show up as being significant.

 - 65.89% ret_from_fork
    - 65.89% kthread
       - 62.86% nfsd
          - 59.96% svc_process
             - 54.97% svc_send
                - 53.79% svc_tcp_sendto
                   - 53.75% xprt_sock_sendmsg
                      - 53.65% sock_sendmsg
                         - 53.64% inet_sendmsg
                            - 53.64% tcp_sendmsg
                               - 52.97% tcp_sendmsg_locked
                                  - 43.42% memcpy_from_page

My suspicion is that it is an interaction with the i40e driver rather than nfsd that is causing this, but I'm no expert. It's just that knfsd was a convenient (and real world) way for us to test and reproduce this.

It could also be that there are i40e module option changes required between kernel versions to maintain performance and I missed the memo....

Regards,

Daire
Comment 1 Daire Byrne 2020-10-26 09:33:43 UTC
I have done bunch of iperf tests using 10G clients and I cannot reproduce the same slowness (less than 70Gbit) on newer kernels. So maybe this is a specific interaction with knfsd after all?

Again the summary is that all kernels up to 5.6 perform well as an NFSv3 server with ~70Gbit performance with a dual port i40e card. But from v5.7 onwards the performance more than halves and the perf top trace shows lots of time in memcpy_from_page.

I will need someone from the knfsd team to verify...

Daire
Comment 2 Daire Byrne 2020-10-26 21:15:11 UTC
I bisected this regression to:

da1661b93bf489cdbc8bcea919b165d31b4810bf is the first bad commit
commit da1661b93bf489cdbc8bcea919b165d31b4810bf
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Mon Mar 2 15:20:33 2020 -0500

    SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
    
    xprt_sock_sendmsg uses the more efficient iov_iter-enabled kernel
    socket API, and is a pre-requisite for server send-side support for
    TLS.
    
    Note that svc_process no longer needs to reserve a word for the
    stream record marker, since the TCP transport now provides the
    record marker automatically in a separate buffer.
    
    The dprintk() in svc_send_common is also removed. It didn't seem
    crucial for field troubleshooting. If more is needed there, a trace
    point could be added in xprt_sock_sendmsg().
    
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

:040000 040000 8e02fc93a3f72b2e76fde45887029d6d095b3eb6 a81d612dda76b2bd737387828f77501477eec231 M      net

This seems to be really hurting performance on my servers when using >40gbit networking.

I anyone able to reassign this ticket to NFS triage?
Comment 3 Daire Byrne 2020-10-27 08:59:38 UTC
Trond,

I couldn't get this bugzilla reassigned so added you to the CC to get it on your radar.
Comment 4 Trond Myklebust 2020-10-27 13:48:31 UTC
Chuck, I saw you added yourself to the Cc list and assumed you were trying to take the bug, so I've assigned it to you.
Comment 5 Chuck Lever 2020-10-27 17:13:42 UTC
Commit da1661b93bf4 replaced kernel_sendpages() calls with the use of bvecs and iov_iter. The initial diagnosis is that the latter are performing full page memcpy from the RPC send buffer.

Still investigating.
Comment 6 Chuck Lever 2020-10-30 14:53:04 UTC
I've found a way to force tcp_sendmsg_locked() to use its zero-copy mechanism for xdr_buf->bvec. I'm testing the change now to ensure there are no functional regressions on either the client or server -- the client and server code both use the xprt_sock_sendmsg() path. Once that is done I will attach the patch here to confirm that it addresses the performance regression.
Comment 7 Chuck Lever 2020-11-02 18:57:55 UTC
Created attachment 293397 [details]
SUNRPC: Use zero-copy to perform socket send operations

I have a test suite that builds the git tool and then runs its regression suite using twelve threads. With this patch, the build and regression suite passes with sys, krb5, krb5i, and krb5p on TCP on NFSv2, NFSv3, NFSv4.0, NFSv4.1, and NFSv4.2 mount points.

The next step is for the reporter to see if this fix alleviates the observed performance regression.
Comment 8 Chuck Lever 2020-11-02 18:58:40 UTC
(And btw, the patch was applied on both client and server under test).
Comment 9 Daire Byrne 2020-11-02 20:44:44 UTC
Chuck,

I patched v5.9.1 and ran that on the server. All clients were 1G RHEL7.6 based and with ~100 clients I still see much better performance and the perf report has changed and it is not spending all that time in memcpy_from_page. Instead it is mostly memcpy_erms which I think we would expect.

However, the performance is still not quite at the expected 70gbit level (with v5.6) and has only improved from ~30gbit -> 50gbit.

I will run a few more tests and compare with v5.7 again to make something else hasn't changed in my benchmark setup and test.
Comment 10 Daire Byrne 2020-11-02 23:02:17 UTC
I ran a few more tests and compared to v5.6. The benchmark is still reproducible and I get ~70gbit with v5.6, ~30gbit with stock v5.9.1 and ~50gbit with your patched version.

It looks like with the patch we are now spending lots of time in ksoftirqd. This perf call graph looks to be a new consequence of the patch:

- 87.02%    87.02%    86.64%     0.38%  [other]
   - 47.19% ret_from_fork
      - 47.19% kthread
         - 37.49% smpboot_thread_fn
            - 36.06% run_ksoftirqd
               - 34.64% __softirqentry_text_start
                  - 33.81% net_rx_action
                     - 33.74% i40e_napi_poll
                        - 21.28% i40e_clean_rx_irq
                           - 19.61% napi_gro_receive
                              - 19.42% gro_normal_one
                                 - 19.40% gro_normal_list.part.0
                                    - 19.38% netif_receive_skb_list_internal
                                       - 19.32% __netif_receive_skb_list_core
                                       ...
                                       ...
                                           - 12.04% memcpy_erms

So it seems the patch has increased the interrupt load (compared to v5.6) and that is what is now limiting the throughput. I can't say if that is particular to the i40e driver or not.
Comment 11 Chuck Lever 2020-11-03 14:09:54 UTC
I'm not immediately coming up with a specific scenario that might result in a higher interrupt rate. Note that ksoftirq can be used for tasklets and other CPU-only activity. The presence of i40e_napi_poll in your call graph suggests that actual interrupts are being properly mitigated.

Meanwhile, my next step for the patch was to post it to netdev for comments. It's certainly possible that the fix is not quite appropriate or is missing other necessary changes. Shall I proceed with that, or would you like to take your report to netdev, since you have a reproducer and test environment?
Comment 12 Daire Byrne 2020-11-04 20:14:34 UTC
I spent a little time running iperf and NFS tests and all I can really say at this point is that the v5.6.19 kernel can sustain around 70gbit (bonded dual 40gbit) with around 8% softirq CPU usage but v5.9.1 with your patch settles down to around 50gbit with up to 35% softirq CPU usage. It does actually start at 70gbit for the first few seconds but then the softirq rises and the performance drops. 

This could well be due to some other unrelated change between v5.6 and v5.9.

I might try take your patch and test with v5.7 and v5.8 to see if there any obvious differences. Perhaps I could also try removing the xprt_sock_sendmsg commit too?

As for putting it to netdev, I think you are probably better informed than I to have that discussion. But I'm more than happy to re-test any changes to the patch based on their feedback. My test setup is not going anywhere for a while.

Many thanks by the way for looking into this. We are grateful to be able to restore some performance to our 40gbit+ NFS servers on newer kernels.
Comment 13 Chuck Lever 2020-11-04 21:29:46 UTC
(In reply to Daire Byrne from comment #12)
> This could well be due to some other unrelated change between v5.6 and v5.9.

Indeed.
 
> I might try take your patch and test with v5.7 and v5.8 to see if there any
> obvious differences.

You should be able to apply it immediately after da1661b93bf4.

> Perhaps I could also try removing the xprt_sock_sendmsg commit too?

That would be a workaround for your site, and naturally other commits that depend on that one would need to be reverted as well for that approach to succeed.

We plan to support RPC over TLS eventually. The kernel TLS API, I'm told, depends on the socket consumer using the sendmsg API instead of sendpage. Therefore we need to understand why sendmsg impacts aggregate throughput, and address that issue.
Comment 14 Chuck Lever 2020-11-05 15:30:54 UTC
(In reply to Daire Byrne from comment #12)
> I spent a little time running iperf and NFS tests and all I can really say
> at this point is that the v5.6.19 kernel can sustain around 70gbit (bonded
> dual 40gbit) with around 8% softirq CPU usage but v5.9.1 with your patch
> settles down to around 50gbit with up to 35% softirq CPU usage. It does
> actually start at 70gbit for the first few seconds but then the softirq
> rises and the performance drops. 

I want to better understand your reported results... are you saying that iperf also shows the performance regression, or is that the baseline that determines how fast the network _can_ go?

If iperf shows the aberrant softirq behavior, that would be interesting. If not, that points back to something happening in the NFSD network paths.
Comment 15 Daire Byrne 2020-11-05 22:33:07 UTC
Yea sorry, I should have been clearer. The iperf results look okay and are consistent across all kernel versions with or without the suggested patch. The softirq remains low (~3%) and performance is at the expected level.

However, the iperf test is slightly different in that I am using ~20 x 10GigE clients rather than the 100 x 1GigE clients that I used for the NFS tests. This was just because it was easier to setup the iperf tests that way. I was just measuring total throughput but there may be an interrupt difference due to number of flows and 10G vs 1G ethernet. I'll try to set up a more comparative iperf test with the 100 clients to rule that out.

I also tried adding your zero copy patch to v5.7.6 (because I had it ready to go) and it's a similar story; the throughput improves but so does the softirq until it limits the overall throughput so that it's still lower than v5.6.

I can only think that either something else went into v5.7 that effects softirq usage when doing NFS with ~100 GigE clients, or there is another aspect to either the xprt_sock_sendmsg patch or this new patch that results in higher (than v5.6) softirq utilisation (with i40e anyway).
Comment 16 Daire Byrne 2020-11-05 22:51:15 UTC
Actually, I just very quickly ran a NFS test with the 20 x 10G clients (because it was easy) with your zero copy patch and that does show less server softirq (~10%) and better aggregate performance (~70gbit). 100 x 1GigE clients by comparison results in 50gbit and 35% softirq

So now I just need to script something to do the 100 x 1 GigE client iperf test so we can definitively say whether this is specific to NFS or not.
Comment 17 Daire Byrne 2020-11-09 16:51:52 UTC
Sorry, I haven't really been able to use iperf to do a good 100 x 1GigE client -> server test. It seems like iperf was not really made for running so many processes and is being limited to around 30gbit.

It's similar for nuttcp and for both it and iperf, lots of time is spent in copy_user_enhanced_fast_string.

So I'm not really able to push the server hard enough to make it comparable to 100 NFS clients. The softirq remains low but that may simply be because I can't drive it hard enough.
Comment 18 Chuck Lever 2020-11-16 17:52:06 UTC
Daire, I'm not able to generate any appreciable ksoftirqd activity during testing on my server with either 1GbE or IPoIB via FDR InfiniBand. Can you characterize the workload one of your clients presents to your server? Such as:

- NFS READ-to-WRITE ratio
- Distribution of I/O sizes
- I/O-to-metadata ratio
- Frequency of small metadata modification operations (SETATTR, CREATE, etc)
Comment 19 Daire Byrne 2020-11-16 21:12:42 UTC
The test is super simple but perhaps a bit pathological.

100 clients running "dd if=/nfs/file of=/dev/null bs=1M". They are all reading the same file so it is completely cached in the nfs server's page cache (no disk IO).

The 100 clients are distributed across many different switches with 2 x 10G uplinks such that there is no contention. And so I can generate ~100gbit worth of aggregate client bandwidth.

I could try to take v5.6 and just apply the original xprt_sock_sendmsg and the zero copy patch to see if that's really all that's required to get into this high ksoftirqd situation? I'm not sure how cleanly those will apply without everything else that went into v5.7 though....

It could also be that I need to do some tuning of the i40e driver that was not required prior to the NFS changes.
Comment 20 Chuck Lever 2020-11-25 21:12:33 UTC
I set up my 100GbE rig today. Unfortunately, I have only one client, but both systems have CX-5 Ethernet cards. After applying the zero-copy patch to the NFS server, I ran cp/md5sum tests with large files over NFSv3 on TCP. I was not able to generate significant soft IRQ activity.

I've heard no additional comments from the netdev community after posting this patch for review.

Perhaps it's time to call it and apply the zero-copy patch for v5.11.
Comment 21 Daire Byrne 2020-11-25 22:20:27 UTC
Yea sorry, I haven't had time to do much more testing on this. I also couldn't reproduce it with 20 x 10G clients so it seems like it is related to high numbers (100) of 1G clients (and whatever ethernet buffering the switch has to do).

All I can say is that the maximum performance for that workload was better in 5.6 and wasn't limited by high softirq usage.

But I'm more than happy with the improvement this patch gives and maybe someone else will be able to contribute more useful information once it is more widely used (although nobody noticed this regression until now...).

It could also be that this pathological workload is not one that will come up that often (we do it all the time!).
Comment 22 Chuck Lever 2020-12-11 22:29:56 UTC
Created attachment 294103 [details]
SUNRPC: Handle TCP socket sends with kernel_sendpage() again

Without a reproducer I can run in my own lab, I've not been able to convince myself that the zero-copy approach is 100% correct. Thus I've pulled that patch out of the changeset for v5.11.

After some research, I've found that TLS sockets indeed do have a .sendpage method. Thus the in-kernel server can use kernel_sendpage() in its socket sendto path and manage support for TLS. I've come up with a patch to do this. I'd appreciate some test coverage if you can, Daire.
Comment 23 Chuck Lever 2020-12-12 19:15:21 UTC
Looks like this patch needs a minor fix:

-   xdr_alloc_bvec(xdr);
+   xdr_alloc_bvec(xdr, GPF_KERNEL);
Comment 24 Daire Byrne 2020-12-14 12:04:12 UTC
Okay, great. So I did the same (read from ram) tests using ~100 x 1Gig clients and a 2 x 40 gbit server running this latest patch on v5.9.14.

The performance reached around 55gbit and there was no sign of the high softirq this time (~8%). The CPU usage in general was pretty low and I couldn't really see anything out of the ordinary in the perf top output.

However, I should note for the record that running the same test again on v5.6.19 results in ~72gbit worth of read performance (from ram).

So my assumption that the previous result of ~50gbit with the zero-copy patch was being limited by the high softirq might have been wrong.

Perhaps something else in the gap from v5.6 to v5.9 is also contributing to a reduction in performance.

I tried to apply this new patch to v5.7 to narrow the version gap but it's not applying cleanly. I'll see if I can munge it in.
Comment 25 Chuck Lever 2020-12-14 15:15:29 UTC
Thank you for testing, Daire. An interesting, if disappointing, result. I am open to further input from you and your team.

It would help if you could discover a simpler performance loss reproducer that I can try here with a single client. For now, I am shooting in the dark!
Comment 26 Daire Byrne 2020-12-14 22:59:02 UTC
Apologies, but please ignore my last results. It looks like I typo'd the list of 100 client hosts to use, which meant that I was being limited by edge switch uplinks rather than the server.

I am confident that all my other results were correct, it's just this last set of benchmarks with the new patch that I messed up.

So, the good news is that this new patch (applied to v5.9.14) does indeed resolve the performance issues and I am now seeing ~70gbit reads with the same low 8% softirq. In other words, this is now practically the same as the v5.6 kernel (72gbit). Any minor differences now will probably average out with enough benchmark runs.

Sorry for the false report but the good news is that this patch is a winner! :)

Many thanks for sticking with it.

Oh, and in terms of a reproducer, all I can suggest is many clients and a very fast server. I could not reproduce this with 20 x 10gbit clients or conversely, a 20 gbit server. I only noticed it once we moved to a 2 x 40gbit server to increase the bandwidth possible and used many (50+) clients. I just don't have more than 20 10gbit clients otherwise that might also do it.
Comment 27 Chuck Lever 2022-02-21 16:00:04 UTC
Commit 4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again") addresses this issue, and was merged in v5.11.

Note You need to log in before you can comment on or make changes to this bug.