When there are multiple requests arriving at PF requesting VF information, we're seeing these syslog messages: [Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12 [Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12 [Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12 Expanded with root@s56-t37-sut1:/sys/kernel/debug/i40e/0000:05:00.1# echo 'msg_enable -1' >command [1483319.788799] i40e 05:00.1 AQTX: desc and buffer writeback: [1483319.788800] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000 [1483319.788801] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000 [1483319.788803] i40e 05:00.1 param (0,1) 0x00000000 0x002029E8 [1483319.788804] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF [1483319.788807] i40e 05:00.1 ntc 234 head 235. [1483319.788809] i40e 05:00.1 AQTX: desc and buffer: [1483319.788810] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2000, datalen 0x0000, retval 0x0000 [1483319.788811] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000 [1483319.788812] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8 [1483319.788813] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF [1483319.788867] i40e 05:00.1 AQTX: desc and buffer writeback: [1483319.788868] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000 [1483319.788869] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000 [1483319.788871] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8 [1483319.788872] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF [1483319.802767] vfio-pci 0000:05:10.1: enabling device (0000 -> 0002) [1483319.826114] i40e 05:00.1 AQRX: desc and buffer: [1483319.826115] i40e 05:00.1 AQ CMD: opcode 0x0801, flags 0x1009, datalen 0x0008, retval 0x0070 [1483319.826116] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000 [1483319.826117] i40e 05:00.1 param (0,1) 0x00000000 0x00000000 [1483319.826118] i40e 05:00.1 addr (h,l) 0x00000000 0xFFBBB000 [1483319.826119] i40e 05:00.1 AQ CMD Buffer: [1483319.826121] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00 [1483319.826126] i40e 05:00.1 ntc 235 head 236. [1483319.826127] i40e 05:00.1 AQTX: desc and buffer: [1483319.826129] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3400, datalen 0x0008, retval 0x0000 [1483319.826130] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000 [1483319.826131] i40e 05:00.1 param (0,1) 0x00000070 0x00000000 [1483319.826132] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000 [1483319.826132] i40e 05:00.1 AQ CMD Buffer: [1483319.826134] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00 [1483319.826187] i40e 05:00.1 AQTX: Command completed with error 0xC. [1483319.826188] i40e 05:00.1 AQTX: desc and buffer writeback: [1483319.826189] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3407, datalen 0x0008, retval 0x000C [1483319.826190] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000 [1483319.826191] i40e 05:00.1 param (0,1) 0x00000070 0x00000000 [1483319.826192] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000 [1483319.826193] i40e 05:00.1 AQ CMD Buffer: [1483319.826194] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00 [1483319.826198] i40e 0000:05:00.1: Unable to send the message to VF 48 aq_err 12 Applications using this code (such as DPDK when it tries to verify PF/VF version) fail when this error happens - an application not receiving data suggest a real error when it's in fact a race condition. A solution could be retrying or using a lock mechanism. Related DPDK bug: https://bugs.dpdk.org/show_bug.cgi?id=578 Tested with kernel 4.15.0-55-generic NIC: Intel Ethernet Converged Network Adapter XL710-Q2 cat /sys/class/net/enp5s0f0/device/device 0x1583 cat /sys/class/net/enp5s0f0/device/vendor 0x8086 cat /sys/class/net/enp5s0f0/device/subsystem_device 0x0001 cat /sys/class/net/enp5s0f0/device/subsystem_vendor 0x8086 Firmware version: driver: i40e version: 2.1.14-k firmware-version: 6.01 0x800035da 1.1747.0 expansion-rom-version: bus-info: 0000:05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes Also tested with latest i40e driver (2.13.10) where the issue is still reproducible. Tested on x86_64 and aarch64 hardware, both reproducible.
Retested with the following versions: VPP 2009: 20.09.0-62~g4f88004bc, uses DPDK 20.08 driver: i40e version: 2.12.6 firmware-version: 7.30 0x80008387 1.2684.0 kernel 4.15.0-55-generic The versions were chosen according to https://doc.dpdk.org/guides/nics/i40e.html. The issue can be reproduced exactly the same. No logs are different.
Retested again with newer driver/FW/dpdk versions: VPP version: 21.06-rc0~693-gc44402f7b DPDK version that VPP uses: 21.02 FW/i40e version: sudo ethtool -i enp134s0f0 driver: i40e version: 2.14.13 firmware-version: 8.15 0x800095e3 1.2829.0 expansion-rom-version: bus-info: 0000:86:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes modinfo i40e filename: /lib/modules/4.15.0-23-generic/updates/drivers/net/ethernet/intel/i40e/i40e.ko version: 2.14.13 license: GPL description: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver author: Intel Corporation, <e1000-devel@lists.sourceforge.net> srcversion: 1A821F3D488396B967F338E alias: pci:v00008086d0000158Bsv*sd*bc*sc*i* alias: pci:v00008086d0000158Asv*sd*bc*sc*i* alias: pci:v00008086d000037D3sv*sd*bc*sc*i* alias: pci:v00008086d000037D2sv*sd*bc*sc*i* alias: pci:v00008086d000037D1sv*sd*bc*sc*i* alias: pci:v00008086d000037D0sv*sd*bc*sc*i* alias: pci:v00008086d000037CFsv*sd*bc*sc*i* alias: pci:v00008086d000037CEsv*sd*bc*sc*i* alias: pci:v00008086d00000D58sv*sd*bc*sc*i* alias: pci:v00008086d00000CF8sv*sd*bc*sc*i* alias: pci:v00008086d00001588sv*sd*bc*sc*i* alias: pci:v00008086d00001587sv*sd*bc*sc*i* alias: pci:v00008086d0000104Fsv*sd*bc*sc*i* alias: pci:v00008086d0000104Esv*sd*bc*sc*i* alias: pci:v00008086d000015FFsv*sd*bc*sc*i* alias: pci:v00008086d00001589sv*sd*bc*sc*i* alias: pci:v00008086d00001586sv*sd*bc*sc*i* alias: pci:v00008086d0000101Fsv*sd*bc*sc*i* alias: pci:v00008086d00001585sv*sd*bc*sc*i* alias: pci:v00008086d00001584sv*sd*bc*sc*i* alias: pci:v00008086d00001583sv*sd*bc*sc*i* alias: pci:v00008086d00001581sv*sd*bc*sc*i* alias: pci:v00008086d00001580sv*sd*bc*sc*i* alias: pci:v00008086d00001574sv*sd*bc*sc*i* alias: pci:v00008086d00001572sv*sd*bc*sc*i* depends: ptp retpoline: Y name: i40e vermagic: 4.15.0-23-generic SMP mod_unload parm: debug:Debug level (0=none,...,16=all) (int) parm: l4mode:L4 cloud filter mode: 0=UDP,1=TCP,2=Both,-1=Disabled(default) (int) Kernel version: uname -r 4.15.0-23-generic Even with these newer versions, we're still seeing the same issues: [ 1032.054512] i40e 0000:86:00.1: Unable to send the message to VF 4 aq_err 12 [ 1432.992673] i40e 0000:86:00.0: Unable to send the message to VF 33 aq_err 12 [ 1439.008590] i40e 0000:86:00.0: Unable to send the message to VF 48 aq_err 12 [ 1440.300862] i40e 0000:86:00.0: Unable to send the message to VF 51 aq_err 12 [ 1452.288202] i40e 0000:86:00.1: Unable to send the message to VF 10 aq_err 12 [ 1455.877058] i40e 0000:86:00.1: Unable to send the message to VF 19 aq_err 12 [ 1463.479826] i40e 0000:86:00.1: Unable to send the message to VF 38 aq_err 12 [ 1467.035731] i40e 0000:86:00.1: Unable to send the message to VF 47 aq_err 12 [ 1479.123392] i40e 0000:af:00.0: Unable to send the message to VF 6 aq_err 12 [ 1488.019159] i40e 0000:af:00.0: Unable to send the message to VF 28 aq_err 12 [ 1490.690969] i40e 0000:af:00.0: Unable to send the message to VF 35 aq_err 12 [ 1494.127675] i40e 0000:af:00.0: Unable to send the message to VF 43 aq_err 12 [ 1530.005928] i40e 0000:af:00.1: Unable to send the message to VF 61 aq_err 12 [ 1700.702161] i40e 0000:86:00.1: Unable to send the message to VF 0 aq_err 12 [ 1700.702220] i40e 0000:86:00.1: Unable to send the message to VF 2 aq_err 12 [ 1701.594627] i40e 0000:86:00.1: Unable to send the message to VF 1 aq_err 12 [ 1701.847322] i40e 0000:86:00.1: Unable to send the message to VF 3 aq_err 12 [ 1701.847378] i40e 0000:86:00.1: Unable to send the message to VF 4 aq_err 12 [ 1702.891186] i40e 0000:86:00.1: Unable to send the message to VF 5 aq_err 12 These manifest in VPP/DPDK this way: 2021/05/13 11:12:54:128 notice dpdk EAL init args: -c 200 -n 4 --log-level debug --in-memory --no-telemetry --file-prefix vpp -a 0000:86:0a.6 -a 0000:86:0a.7 --main-lcore 9 2021/05/13 11:12:57:094 warn dpdk/cryptodev dpdk_cryptodev_init: Failed to configure cryptodev 2021/05/13 11:12:57:101 notice dpdk EAL: Detected 112 lcore(s) 2021/05/13 11:12:57:101 notice dpdk EAL: Detected 2 NUMA nodes 2021/05/13 11:12:57:101 notice dpdk EAL: Detected static linkage of DPDK 2021/05/13 11:12:57:101 notice dpdk EAL: Selected IOVA mode 'PA' 2021/05/13 11:12:57:101 notice dpdk EAL: No available 1048576 kB hugepages reported 2021/05/13 11:12:57:101 notice dpdk EAL: No free 1048576 kB hugepages reported on node 0 2021/05/13 11:12:57:101 notice dpdk EAL: No free 1048576 kB hugepages reported on node 1 2021/05/13 11:12:57:101 notice dpdk EAL: No available 1048576 kB hugepages reported 2021/05/13 11:12:57:101 notice dpdk EAL: Probing VFIO support... 2021/05/13 11:12:57:101 notice dpdk EAL: VFIO support initialized 2021/05/13 11:12:57:101 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.6 (socket 1) 2021/05/13 11:12:57:101 notice dpdk i40evf_dev_init(): Init vf failed 2021/05/13 11:12:57:101 notice dpdk EAL: Releasing pci mapped resource for 0000:86:0a.6 2021/05/13 11:12:57:101 notice dpdk EAL: Calling pci_unmap_resource for 0000:86:0a.6 at 0x7f0400000000 2021/05/13 11:12:57:101 notice dpdk EAL: Calling pci_unmap_resource for 0000:86:0a.6 at 0x7f0400010000 2021/05/13 11:12:57:101 notice dpdk EAL: Requested device 0000:86:0a.6 cannot be used 2021/05/13 11:12:57:101 notice dpdk EAL: using IOMMU type 8 (No-IOMMU) 2021/05/13 11:12:57:101 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.7 (socket 1) Here, DPDK can't use 0000:86:0a.6 and fails in i40evf_dev_init, but it can use 0000:86:0a.7 right after that. Another log: 2021/05/13 11:12:56:933 notice dpdk EAL: Detected 112 lcore(s) 2021/05/13 11:12:56:933 notice dpdk EAL: Detected 2 NUMA nodes 2021/05/13 11:12:56:933 notice dpdk EAL: Detected static linkage of DPDK 2021/05/13 11:12:56:933 notice dpdk EAL: Selected IOVA mode 'PA' 2021/05/13 11:12:56:933 notice dpdk EAL: No available 1048576 kB hugepages reported 2021/05/13 11:12:56:933 notice dpdk EAL: No free 1048576 kB hugepages reported on node 0 2021/05/13 11:12:56:933 notice dpdk EAL: No free 1048576 kB hugepages reported on node 1 2021/05/13 11:12:56:933 notice dpdk EAL: No available 1048576 kB hugepages reported 2021/05/13 11:12:56:933 notice dpdk EAL: Probing VFIO support... 2021/05/13 11:12:56:933 notice dpdk EAL: VFIO support initialized 2021/05/13 11:12:56:933 notice dpdk EAL: using IOMMU type 8 (No-IOMMU) 2021/05/13 11:12:56:933 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.4 (socket 1) 2021/05/13 11:12:56:933 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1) 2021/05/13 11:12:56:933 notice dpdk i40evf_init_vf(): check_api version failed 2021/05/13 11:12:56:933 notice dpdk i40evf_dev_init(): Init vf failed 2021/05/13 11:12:56:933 notice dpdk EAL: Releasing pci mapped resource for 0000:86:0a.4 2021/05/13 11:12:56:933 notice dpdk EAL: Calling pci_unmap_resource for 0000:86:0a.4 at 0x7f5380000000 2021/05/13 11:12:56:933 notice dpdk EAL: Calling pci_unmap_resource for 0000:86:0a.4 at 0x7f5380010000 2021/05/13 11:12:56:933 notice dpdk EAL: Requested device 0000:86:0a.4 cannot be used 2021/05/13 11:12:56:933 notice dpdk EAL: using IOMMU type 8 (No-IOMMU) 2021/05/13 11:12:56:933 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.5 (socket 1) 2021/05/13 11:12:56:933 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1) 2021/05/13 11:12:56:933 notice dpdk i40evf_init_vf(): check_api version failed 2021/05/13 11:12:56:933 notice dpdk i40evf_dev_init(): Init vf failed 2021/05/13 11:12:56:933 notice dpdk EAL: Releasing pci mapped resource for 0000:86:0a.5 2021/05/13 11:12:56:933 notice dpdk EAL: Calling pci_unmap_resource for 0000:86:0a.5 at 0x7f5380014000 2021/05/13 11:12:56:933 notice dpdk EAL: Calling pci_unmap_resource for 0000:86:0a.5 at 0x7f5380024000 2021/05/13 11:12:56:933 notice dpdk --- message(s) throttled --- Here VPP/DPDK can't use both 0000:86:0a.4 and 0000:86:0a.5 and fails with a slightly different log message, hinting that it hit a race condition in a different place.
Hi Juraj, could you please share the minimal reproduction steps? -version of DPDK used; -how many VFs are spawned; -do I have to unbind all the VFs from the iavf driver and bind them to vfio-pci; -are the VFs passthrough to the VM or the reproduction is possible on HOST only; -which script should I run from the DPDK along with the parameters that trigger the unwanted messages; -any other useful info; Thank you, Slawomir Laba
Hi Slawomir, We are using docker containers to run multiple intances of VPP/DPDK. This is because we've hit this issue in CI which uses that setup. We haven't tried this with VMs. The lastest DPDK version which we were able to reproduce this with is 21.02. I'm not sure what you mean by spawned, we're creating sriov_numvfs vfs (our testing NIC is XXV710, so that would be 64 VFs per PF port). Each VPP/DPDK instance is using two VFs and we need at least two separate instances to reproduce this. We're mainly reproducing this with VPP (which is using DPDK), but I've also been able to reproduce this with just DPDK: docker exec -d ${CONTAINER} sh -c "/dpdk/build/app/dpdk-testpmd -n 4 --log-level debug --in-memory --no-telemetry --file-prefix vpp -a \"\$(cut -d ' ' -f 1 nic_pcis.txt)\" -a \"\$(cut -d ' ' -f 2 nic_pcis.txt)\" --main-lcore ${DPDK_CORE} > ${DPDK_LOG_FILE}" When using DPDK, the VFs need to be bound to vfio-pci, when using VPP, it'll bind then to vfio-pci automatically. Make sure to use multiple separate instances of VPP/DPDK that are started at the same time so that you are able to hit the race condition. One instance with two VFs never reproduces this (and more VFs in a single instance won't help IMO, since it seems the initialization is sequential), there must be simultaneous init attempts from separate instances. Juraj
The i40evf PMD used by DPDK is old and hasn't been updated much. In its stead, the iavf PMD has been developed and is actively maintained. The issue is not reproducible with the iavf PMD. Starting with 21.05, DPDK switched the default PMD for these VFs to iavf, thus resolving the issue.