Bug 190951 - SoftRoCE throughput is too low
Summary: SoftRoCE throughput is too low
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Infiniband/RDMA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_infiniband-rdma
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-23 03:59 UTC by Weijia Song
Modified: 2017-01-11 04:21 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
SoftRoCE Performance with 10G ethernet (6.77 KB, application/pdf)
2016-12-23 03:59 UTC, Weijia Song
Details
ibv_rc_pingpong's perf record dump for event mode (51.68 KB, application/octet-stream)
2017-01-10 03:35 UTC, Weijia Song
Details
ibv_rc_pingpong's perf record dump for polling mode (1.14 MB, application/octet-stream)
2017-01-10 03:36 UTC, Weijia Song
Details
perf record -ags dump during ibv_rc_pingpong (event mode) (660.24 KB, application/zip)
2017-01-10 05:13 UTC, Weijia Song
Details
"perf record -ags" for ibv_rc_pingpong, then "perf report --header > perf.txt" (48.44 KB, application/zip)
2017-01-11 04:18 UTC, Weijia Song
Details

Description Weijia Song 2016-12-23 03:59:25 UTC
Created attachment 248401 [details]
SoftRoCE Performance with 10G ethernet

I found the SoftRoCE throughput is much lower than TCP or UDP. I used two high-end servers with Myricomm 10G dual port NIC. I ran a CentOS-7 virtual machine in each of them. I upgraded the virtual machine kernel to the lastest 4.9(2016-12-11) version:
--------------------------------------------------------------------------
[weijia@srvm1 ~]$ uname -a
Linux srvm1 4.9.0 #1 SMP Fri Dec 16 16:35:46 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
--------------------------------------------------------------------------
The two virtual machines use virtio nic driver so the network I/O over head is very low. The iperf tool show ~9Gbps peak throughput with both TCP/UDP:
--------------------------------------------------------------------------
[weijia@srvm1 ~]$ iperf3 -c 192.168.30.10
Connecting to host 192.168.30.10, port 5201
[  4] local 192.168.29.10 port 59986 connected to 192.168.30.10 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.06 GBytes  9.12 Gbits/sec    3   1.28 MBytes
[  4]   1.00-2.00   sec  1.09 GBytes  9.39 Gbits/sec    1   1.81 MBytes
[  4]   2.00-3.00   sec  1.06 GBytes  9.14 Gbits/sec    0   2.21 MBytes
[  4]   3.00-4.00   sec  1.09 GBytes  9.36 Gbits/sec    0   2.56 MBytes
[  4]   4.00-5.00   sec  1.07 GBytes  9.15 Gbits/sec    0   2.85 MBytes
[  4]   5.00-6.00   sec  1.09 GBytes  9.39 Gbits/sec    0   3.00 MBytes
[  4]   6.00-7.00   sec  1.07 GBytes  9.21 Gbits/sec    0   3.00 MBytes
[  4]   7.00-8.00   sec  1.09 GBytes  9.39 Gbits/sec    0   3.00 MBytes
[  4]   8.00-9.00   sec  1.09 GBytes  9.39 Gbits/sec    0   3.00 MBytes
[  4]   9.00-10.00  sec  1.09 GBytes  9.38 Gbits/sec    0   3.00 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  10.8 GBytes  9.29 Gbits/sec    4             sender
[  4]   0.00-10.00  sec  10.8 GBytes  9.29 Gbits/sec                  receiver

iperf Done.

[weijia@srvm1 ~]$ iperf3 -c 192.168.30.10 -u -b 15000m
Connecting to host 192.168.30.10, port 5201
[  4] local 192.168.29.10 port 50826 connected to 192.168.30.10 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec   976 MBytes  8.19 Gbits/sec  124931
[  4]   1.00-2.00   sec  1.00 GBytes  8.63 Gbits/sec  131657
[  4]   2.00-3.00   sec  1.02 GBytes  8.75 Gbits/sec  133452
[  4]   3.00-4.00   sec  1.05 GBytes  9.02 Gbits/sec  137581
[  4]   4.00-5.00   sec  1.05 GBytes  9.02 Gbits/sec  137567
[  4]   5.00-6.00   sec  1.02 GBytes  8.72 Gbits/sec  133102
[  4]   6.00-7.00   sec  1.00 GBytes  8.61 Gbits/sec  131386
[  4]   7.00-8.00   sec   994 MBytes  8.34 Gbits/sec  127229
[  4]   8.00-9.00   sec  1.04 GBytes  8.94 Gbits/sec  136484
[  4]   9.00-10.00  sec   839 MBytes  7.04 Gbits/sec  107376
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  9.92 GBytes  8.52 Gbits/sec  0.005 ms  323914/1300764 (25%)
[  4] Sent 1300764 datagrams

iperf Done.
--------------------------------------------------------------------------

Then I used ibv_rc_pingpong to test the bandwith between the two virtual machines. The result is extremely low:
--------------------------------------------------------------------------
[weijia@srvm1 ~]$ ibv_rc_pingpong -s 4096 -g 1 -n 1000000 192.168.30.10
  local address:  LID 0x0000, QPN 0x000011, PSN 0x3072e0, GID ::ffff:192.168.29.10
  remote address: LID 0x0000, QPN 0x000011, PSN 0xa54a62, GID ::ffff:192.168.30.10
8192000000 bytes in 220.23 seconds = 297.58 Mbit/sec
1000000 iters in 220.23 seconds = 220.23 usec/iter
[weijia@srvm1 ~]$ ibv_uc_pingpong -s 4096 -g 1 -n 10000 192.168.30.10
  local address:  LID 0x0000, QPN 0x000011, PSN 0x7daab0, GID ::ffff:192.168.29.10
  remote address: LID 0x0000, QPN 0x000011, PSN 0xdd96cf, GID ::ffff:192.168.30.10
81920000 bytes in 67.86 seconds = 9.66 Mbit/sec
10000 iters in 67.86 seconds = 6786.20 usec/iter

--------------------------------------------------------------------------

Then I repeated the ibv_rc_pingpong experiments with different message sizes, and tried both polling/event mode. And I also measured the CPU utilization of the ibv_rc_pingpong process. The result is shown in the attached figure. 'poll' means polling mode, where ibv_rc_pingpong is issued without '-e' option; while 'int' (interrupt mode) represents the event mode with '-e' enabled. It seems the CPU is saturated when SoftRoCE throughput goes up to ~2Gbit/s. This does not make sense since udp and tcp can do much better. Could there be some optimization for SoftRoCE implementation?

ibv_devinfo information:
--------------------------------------------------------------------------
[weijia@srvm1 ~]$ ibv_devinfo
hca_id: rxe0
        transport:                      InfiniBand (0)
        fw_ver:                         0.0.0
        node_guid:                      5054:00ff:fe4b:d859
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x0000
        vendor_part_id:                 0
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

--------------------------------------------------------------------------
Comment 1 Bart Van Assche 2017-01-02 08:10:06 UTC
Can you check with perf record / perf report what code is saturating the CPU when running softRoCE traffic? I'm wondering whether it's the CRC calculation code that saturates the CPU.
Comment 2 Weijia Song 2017-01-10 03:33:32 UTC
(In reply to Bart Van Assche from comment #1)
> Can you check with perf record / perf report what code is saturating the CPU
> when running softRoCE traffic? I'm wondering whether it's the CRC
> calculation code that saturates the CPU.

Sure.

So, I did the ibv_rc_pingpong tests as follows. I run the following command on servers side:
================================================================
[weijia@srvm1 ~]$ ibv_rc_pingpong -g 0 -s 65536 -n 100000 -e
  local address:  LID 0x0000, QPN 0x000011, PSN 0x897f42, GID fe80::5054:ff:fe4b:d859
  remote address: LID 0x0000, QPN 0x000011, PSN 0xeaf6ae, GID fe80::5054:ff:fe4b:d860
13107200000 bytes in 45.01 seconds = 2329.75 Mbit/sec
100000 iters in 45.01 seconds = 450.08 usec/iter
================================================================
while on the client side:
================================================================
[weijia@srvm2 ~]$ perf record ibv_rc_pingpong -g 0 192.168.29.10 -s 65536 -n 100000 -e                                                                                            
  local address:  LID 0x0000, QPN 0x000011, PSN 0xeaf6ae, GID fe80::5054:ff:fe4b:d860
  remote address: LID 0x0000, QPN 0x000011, PSN 0x897f42, GID fe80::5054:ff:fe4b:d859
13107200000 bytes in 45.01 seconds = 2329.75 Mbit/sec
100000 iters in 45.01 seconds = 450.08 usec/iter
================================================================
perf report shows the following:
================================================================
Overhead  Command          Shared Object        Symbol                                                                                                                           ◆
  21.53%  ibv_rc_pingpong  libibverbs.so.1.0.0  [.] ibv_get_cq_event                                                                                                             ▒
  13.13%  ibv_rc_pingpong  libibverbs.so.1.0.0  [.] ibv_cmd_req_notify_cq                                                                                                        ▒
  11.66%  ibv_rc_pingpong  libpthread-2.17.so   [.] pthread_spin_lock                                                                                                            ▒
   5.67%  ibv_rc_pingpong  librxe-rdmav2.so     [.] rxe_poll_cq                                                                                                                  ▒
   5.36%  ibv_rc_pingpong  libpthread-2.17.so   [.] __read_nocancel                                                                                                              ▒
   5.04%  ibv_rc_pingpong  libpthread-2.17.so   [.] __write_nocancel                                                                                                             ▒
   4.62%  ibv_rc_pingpong  libpthread-2.17.so   [.] __libc_write                                                                                                                 ▒
   3.99%  ibv_rc_pingpong  librxe-rdmav2.so     [.] rxe_post_send                                                                                                                ▒
   3.15%  ibv_rc_pingpong  libc-2.17.so         [.] __memcpy_ssse3_back                                                                                                          ▒
   2.84%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002764                                                                                                           ▒
   2.73%  ibv_rc_pingpong  libpthread-2.17.so   [.] __libc_read                                                                                                                  ▒
   2.73%  ibv_rc_pingpong  librxe-rdmav2.so     [.] convert_send_wr                                                                                                              ▒
   1.58%  ibv_rc_pingpong  librxe-rdmav2.so     [.] init_send_wqe                                                                                                                ▒
   1.47%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002749                                                                                                           ▒
   1.26%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002fe0                                                                                                           ▒
   1.16%  ibv_rc_pingpong  librxe-rdmav2.so     [.] post_send_db                                                                                                                 ▒
   0.74%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000003003                                                                                                           ▒
   0.74%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x000000000000306a                                                                                                           ▒
   0.74%  ibv_rc_pingpong  librxe-rdmav2.so     [.] pthread_spin_lock@plt                                                                                                        ▒
   0.63%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000003020                                                                                                           ▒
   0.63%  ibv_rc_pingpong  libpthread-2.17.so   [.] pthread_spin_unlock                                                                                                          ▒
   0.53%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002485                                                                                                           ▒
   0.53%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002768                                                                                                           ▒
   0.53%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x000000000000298d                                                                                                           ▒
   0.53%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002bc7                                                                                                           ▒
   0.42%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002476                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002453                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x000000000000247a                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002482                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x000000000000251c                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x00000000000025a0                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002745                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002995                                                                                                           ▒
   0.32%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002bca                                                                                                           ▒
   0.32%  ibv_rc_pingpong  librxe-rdmav2.so     [.] rxe_post_one_recv                                                                                                            ▒
   0.21%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002525                                                                                                           ▒
   0.21%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002990                                                                                                           ▒
   0.21%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002bd4                                                                                                           ▒
   0.21%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002bdb                                                                                                           ▒
   0.21%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000003065                                                                                                           ▒
   0.21%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000003073                                                                                                           ▒
   0.21%  ibv_rc_pingpong  librxe-rdmav2.so     [.] pthread_spin_unlock@plt                                                                                                      ▒
   0.21%  ibv_rc_pingpong  librxe-rdmav2.so     [.] rxe_post_recv                                                                                                                ▒
   0.11%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x000000000000247d                                                                                                           ▒
   0.11%  ibv_rc_pingpong  ibv_rc_pingpong      [.] 0x0000000000002500 
================================================================

I attached the perf dump as perf_evt.data in this discussoin thread(see "Attachments" at top). "perf_poll.data" is the perf dump for polling mode(ibv_rc_pingpong without '-e'). Thank you for looking at this problem. Sorry for my delay. I was busy on another project last week.
Comment 3 Weijia Song 2017-01-10 03:35:59 UTC
Created attachment 251041 [details]
ibv_rc_pingpong's perf record dump for event mode
Comment 4 Weijia Song 2017-01-10 03:36:34 UTC
Created attachment 251051 [details]
ibv_rc_pingpong's perf record dump for polling mode
Comment 5 Bart Van Assche 2017-01-10 03:48:30 UTC
Sorry that I wasn't more clear but what's needed is a system-wide trace (e.g. ibv_rc_pingpong -g 0 192.168.29.10 -s 65536 -n 100000 -e & perf record -ags sleep 10 & wait) instead of a trace of only ibv_rc_pingpong.
Comment 6 Weijia Song 2017-01-10 05:13:08 UTC
Created attachment 251061 [details]
perf record -ags dump during ibv_rc_pingpong (event mode)
Comment 7 Weijia Song 2017-01-10 05:21:21 UTC
Oh, thank you, Bart! So I started a long ibv_rc_pingpong session and started perf record -ags sleep 10 in another shell. Now the dump has more details. I just uploaded it. Hope this can help.

(In reply to Bart Van Assche from comment #5)
> Sorry that I wasn't more clear but what's needed is a system-wide trace
> (e.g. ibv_rc_pingpong -g 0 192.168.29.10 -s 65536 -n 100000 -e & perf record
> -ags sleep 10 & wait) instead of a trace of only ibv_rc_pingpong.
Comment 8 Bart Van Assche 2017-01-11 01:51:51 UTC
Can you also provide the perf data in ASCII format (perf report --header >perf.txt)? Unfortunately the binary perf data can only be analyzed on the system on which it has been captured.
Comment 9 Weijia Song 2017-01-11 04:18:41 UTC
Created attachment 251171 [details]
"perf record -ags" for ibv_rc_pingpong, then "perf report --header > perf.txt"
Comment 10 Weijia Song 2017-01-11 04:21:40 UTC
(In reply to Bart Van Assche from comment #8)
> Can you also provide the perf data in ASCII format (perf report --header
> >perf.txt)? Unfortunately the binary perf data can only be analyzed on the
> system on which it has been captured.

Sure I just uploaded the dump. Please note that the following message is reported when I run "perf report --header"
corrupted callchain. skipping...

I can also share the access to my experimental environment if you want.

Note You need to log in before you can comment on or make changes to this bug.