Created attachment 248401 [details] SoftRoCE Performance with 10G ethernet I found the SoftRoCE throughput is much lower than TCP or UDP. I used two high-end servers with Myricomm 10G dual port NIC. I ran a CentOS-7 virtual machine in each of them. I upgraded the virtual machine kernel to the lastest 4.9(2016-12-11) version: -------------------------------------------------------------------------- [weijia@srvm1 ~]$ uname -a Linux srvm1 4.9.0 #1 SMP Fri Dec 16 16:35:46 EST 2016 x86_64 x86_64 x86_64 GNU/Linux -------------------------------------------------------------------------- The two virtual machines use virtio nic driver so the network I/O over head is very low. The iperf tool show ~9Gbps peak throughput with both TCP/UDP: -------------------------------------------------------------------------- [weijia@srvm1 ~]$ iperf3 -c 192.168.30.10 Connecting to host 192.168.30.10, port 5201 [ 4] local 192.168.29.10 port 59986 connected to 192.168.30.10 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 1.06 GBytes 9.12 Gbits/sec 3 1.28 MBytes [ 4] 1.00-2.00 sec 1.09 GBytes 9.39 Gbits/sec 1 1.81 MBytes [ 4] 2.00-3.00 sec 1.06 GBytes 9.14 Gbits/sec 0 2.21 MBytes [ 4] 3.00-4.00 sec 1.09 GBytes 9.36 Gbits/sec 0 2.56 MBytes [ 4] 4.00-5.00 sec 1.07 GBytes 9.15 Gbits/sec 0 2.85 MBytes [ 4] 5.00-6.00 sec 1.09 GBytes 9.39 Gbits/sec 0 3.00 MBytes [ 4] 6.00-7.00 sec 1.07 GBytes 9.21 Gbits/sec 0 3.00 MBytes [ 4] 7.00-8.00 sec 1.09 GBytes 9.39 Gbits/sec 0 3.00 MBytes [ 4] 8.00-9.00 sec 1.09 GBytes 9.39 Gbits/sec 0 3.00 MBytes [ 4] 9.00-10.00 sec 1.09 GBytes 9.38 Gbits/sec 0 3.00 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 10.8 GBytes 9.29 Gbits/sec 4 sender [ 4] 0.00-10.00 sec 10.8 GBytes 9.29 Gbits/sec receiver iperf Done. [weijia@srvm1 ~]$ iperf3 -c 192.168.30.10 -u -b 15000m Connecting to host 192.168.30.10, port 5201 [ 4] local 192.168.29.10 port 50826 connected to 192.168.30.10 port 5201 [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-1.00 sec 976 MBytes 8.19 Gbits/sec 124931 [ 4] 1.00-2.00 sec 1.00 GBytes 8.63 Gbits/sec 131657 [ 4] 2.00-3.00 sec 1.02 GBytes 8.75 Gbits/sec 133452 [ 4] 3.00-4.00 sec 1.05 GBytes 9.02 Gbits/sec 137581 [ 4] 4.00-5.00 sec 1.05 GBytes 9.02 Gbits/sec 137567 [ 4] 5.00-6.00 sec 1.02 GBytes 8.72 Gbits/sec 133102 [ 4] 6.00-7.00 sec 1.00 GBytes 8.61 Gbits/sec 131386 [ 4] 7.00-8.00 sec 994 MBytes 8.34 Gbits/sec 127229 [ 4] 8.00-9.00 sec 1.04 GBytes 8.94 Gbits/sec 136484 [ 4] 9.00-10.00 sec 839 MBytes 7.04 Gbits/sec 107376 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 9.92 GBytes 8.52 Gbits/sec 0.005 ms 323914/1300764 (25%) [ 4] Sent 1300764 datagrams iperf Done. -------------------------------------------------------------------------- Then I used ibv_rc_pingpong to test the bandwith between the two virtual machines. The result is extremely low: -------------------------------------------------------------------------- [weijia@srvm1 ~]$ ibv_rc_pingpong -s 4096 -g 1 -n 1000000 192.168.30.10 local address: LID 0x0000, QPN 0x000011, PSN 0x3072e0, GID ::ffff:192.168.29.10 remote address: LID 0x0000, QPN 0x000011, PSN 0xa54a62, GID ::ffff:192.168.30.10 8192000000 bytes in 220.23 seconds = 297.58 Mbit/sec 1000000 iters in 220.23 seconds = 220.23 usec/iter [weijia@srvm1 ~]$ ibv_uc_pingpong -s 4096 -g 1 -n 10000 192.168.30.10 local address: LID 0x0000, QPN 0x000011, PSN 0x7daab0, GID ::ffff:192.168.29.10 remote address: LID 0x0000, QPN 0x000011, PSN 0xdd96cf, GID ::ffff:192.168.30.10 81920000 bytes in 67.86 seconds = 9.66 Mbit/sec 10000 iters in 67.86 seconds = 6786.20 usec/iter -------------------------------------------------------------------------- Then I repeated the ibv_rc_pingpong experiments with different message sizes, and tried both polling/event mode. And I also measured the CPU utilization of the ibv_rc_pingpong process. The result is shown in the attached figure. 'poll' means polling mode, where ibv_rc_pingpong is issued without '-e' option; while 'int' (interrupt mode) represents the event mode with '-e' enabled. It seems the CPU is saturated when SoftRoCE throughput goes up to ~2Gbit/s. This does not make sense since udp and tcp can do much better. Could there be some optimization for SoftRoCE implementation? ibv_devinfo information: -------------------------------------------------------------------------- [weijia@srvm1 ~]$ ibv_devinfo hca_id: rxe0 transport: InfiniBand (0) fw_ver: 0.0.0 node_guid: 5054:00ff:fe4b:d859 sys_image_guid: 0000:0000:0000:0000 vendor_id: 0x0000 vendor_part_id: 0 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet --------------------------------------------------------------------------
Can you check with perf record / perf report what code is saturating the CPU when running softRoCE traffic? I'm wondering whether it's the CRC calculation code that saturates the CPU.
(In reply to Bart Van Assche from comment #1) > Can you check with perf record / perf report what code is saturating the CPU > when running softRoCE traffic? I'm wondering whether it's the CRC > calculation code that saturates the CPU. Sure. So, I did the ibv_rc_pingpong tests as follows. I run the following command on servers side: ================================================================ [weijia@srvm1 ~]$ ibv_rc_pingpong -g 0 -s 65536 -n 100000 -e local address: LID 0x0000, QPN 0x000011, PSN 0x897f42, GID fe80::5054:ff:fe4b:d859 remote address: LID 0x0000, QPN 0x000011, PSN 0xeaf6ae, GID fe80::5054:ff:fe4b:d860 13107200000 bytes in 45.01 seconds = 2329.75 Mbit/sec 100000 iters in 45.01 seconds = 450.08 usec/iter ================================================================ while on the client side: ================================================================ [weijia@srvm2 ~]$ perf record ibv_rc_pingpong -g 0 192.168.29.10 -s 65536 -n 100000 -e local address: LID 0x0000, QPN 0x000011, PSN 0xeaf6ae, GID fe80::5054:ff:fe4b:d860 remote address: LID 0x0000, QPN 0x000011, PSN 0x897f42, GID fe80::5054:ff:fe4b:d859 13107200000 bytes in 45.01 seconds = 2329.75 Mbit/sec 100000 iters in 45.01 seconds = 450.08 usec/iter ================================================================ perf report shows the following: ================================================================ Overhead Command Shared Object Symbol ◆ 21.53% ibv_rc_pingpong libibverbs.so.1.0.0 [.] ibv_get_cq_event ▒ 13.13% ibv_rc_pingpong libibverbs.so.1.0.0 [.] ibv_cmd_req_notify_cq ▒ 11.66% ibv_rc_pingpong libpthread-2.17.so [.] pthread_spin_lock ▒ 5.67% ibv_rc_pingpong librxe-rdmav2.so [.] rxe_poll_cq ▒ 5.36% ibv_rc_pingpong libpthread-2.17.so [.] __read_nocancel ▒ 5.04% ibv_rc_pingpong libpthread-2.17.so [.] __write_nocancel ▒ 4.62% ibv_rc_pingpong libpthread-2.17.so [.] __libc_write ▒ 3.99% ibv_rc_pingpong librxe-rdmav2.so [.] rxe_post_send ▒ 3.15% ibv_rc_pingpong libc-2.17.so [.] __memcpy_ssse3_back ▒ 2.84% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002764 ▒ 2.73% ibv_rc_pingpong libpthread-2.17.so [.] __libc_read ▒ 2.73% ibv_rc_pingpong librxe-rdmav2.so [.] convert_send_wr ▒ 1.58% ibv_rc_pingpong librxe-rdmav2.so [.] init_send_wqe ▒ 1.47% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002749 ▒ 1.26% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002fe0 ▒ 1.16% ibv_rc_pingpong librxe-rdmav2.so [.] post_send_db ▒ 0.74% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000003003 ▒ 0.74% ibv_rc_pingpong ibv_rc_pingpong [.] 0x000000000000306a ▒ 0.74% ibv_rc_pingpong librxe-rdmav2.so [.] pthread_spin_lock@plt ▒ 0.63% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000003020 ▒ 0.63% ibv_rc_pingpong libpthread-2.17.so [.] pthread_spin_unlock ▒ 0.53% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002485 ▒ 0.53% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002768 ▒ 0.53% ibv_rc_pingpong ibv_rc_pingpong [.] 0x000000000000298d ▒ 0.53% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002bc7 ▒ 0.42% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002476 ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002453 ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x000000000000247a ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002482 ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x000000000000251c ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x00000000000025a0 ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002745 ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002995 ▒ 0.32% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002bca ▒ 0.32% ibv_rc_pingpong librxe-rdmav2.so [.] rxe_post_one_recv ▒ 0.21% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002525 ▒ 0.21% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002990 ▒ 0.21% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002bd4 ▒ 0.21% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002bdb ▒ 0.21% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000003065 ▒ 0.21% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000003073 ▒ 0.21% ibv_rc_pingpong librxe-rdmav2.so [.] pthread_spin_unlock@plt ▒ 0.21% ibv_rc_pingpong librxe-rdmav2.so [.] rxe_post_recv ▒ 0.11% ibv_rc_pingpong ibv_rc_pingpong [.] 0x000000000000247d ▒ 0.11% ibv_rc_pingpong ibv_rc_pingpong [.] 0x0000000000002500 ================================================================ I attached the perf dump as perf_evt.data in this discussoin thread(see "Attachments" at top). "perf_poll.data" is the perf dump for polling mode(ibv_rc_pingpong without '-e'). Thank you for looking at this problem. Sorry for my delay. I was busy on another project last week.
Created attachment 251041 [details] ibv_rc_pingpong's perf record dump for event mode
Created attachment 251051 [details] ibv_rc_pingpong's perf record dump for polling mode
Sorry that I wasn't more clear but what's needed is a system-wide trace (e.g. ibv_rc_pingpong -g 0 192.168.29.10 -s 65536 -n 100000 -e & perf record -ags sleep 10 & wait) instead of a trace of only ibv_rc_pingpong.
Created attachment 251061 [details] perf record -ags dump during ibv_rc_pingpong (event mode)
Oh, thank you, Bart! So I started a long ibv_rc_pingpong session and started perf record -ags sleep 10 in another shell. Now the dump has more details. I just uploaded it. Hope this can help. (In reply to Bart Van Assche from comment #5) > Sorry that I wasn't more clear but what's needed is a system-wide trace > (e.g. ibv_rc_pingpong -g 0 192.168.29.10 -s 65536 -n 100000 -e & perf record > -ags sleep 10 & wait) instead of a trace of only ibv_rc_pingpong.
Can you also provide the perf data in ASCII format (perf report --header >perf.txt)? Unfortunately the binary perf data can only be analyzed on the system on which it has been captured.
Created attachment 251171 [details] "perf record -ags" for ibv_rc_pingpong, then "perf report --header > perf.txt"
(In reply to Bart Van Assche from comment #8) > Can you also provide the perf data in ASCII format (perf report --header > >perf.txt)? Unfortunately the binary perf data can only be analyzed on the > system on which it has been captured. Sure I just uploaded the dump. Please note that the following message is reported when I run "perf report --header" corrupted callchain. skipping... I can also share the access to my experimental environment if you want.