Bug 210627 - [i40e] XL710 PF Unable to send the message to VF with simultaneous AQ CMD (aq_err 12)
Summary: [i40e] XL710 PF Unable to send the message to VF with simultaneous AQ CMD (aq...
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-12-11 08:57 UTC by Juraj Linkeš
Modified: 2021-06-04 08:32 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.15.0-55-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Juraj Linkeš 2020-12-11 08:57:26 UTC
When there are multiple requests arriving at PF requesting VF information, we're seeing these syslog messages:
[Thu Dec  3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12
[Thu Dec  3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12
[Thu Dec  3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12

Expanded with root@s56-t37-sut1:/sys/kernel/debug/i40e/0000:05:00.1# echo 'msg_enable -1' >command
[1483319.788799] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.788800] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000
[1483319.788801] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788803] i40e 05:00.1 param (0,1) 0x00000000 0x002029E8
[1483319.788804] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.788807] i40e 05:00.1 ntc 234 head 235.
[1483319.788809] i40e 05:00.1 AQTX: desc and buffer:
[1483319.788810] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2000, datalen 0x0000, retval 0x0000
[1483319.788811] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788812] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8
[1483319.788813] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.788867] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.788868] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000
[1483319.788869] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788871] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8
[1483319.788872] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.802767] vfio-pci 0000:05:10.1: enabling device (0000 -> 0002)
[1483319.826114] i40e 05:00.1 AQRX: desc and buffer:
[1483319.826115] i40e 05:00.1 AQ CMD: opcode 0x0801, flags 0x1009, datalen 0x0008, retval 0x0070
[1483319.826116] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826117] i40e 05:00.1 param (0,1) 0x00000000 0x00000000
[1483319.826118] i40e 05:00.1 addr (h,l) 0x00000000 0xFFBBB000
[1483319.826119] i40e 05:00.1 AQ CMD Buffer:
[1483319.826121] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826126] i40e 05:00.1 ntc 235 head 236.
[1483319.826127] i40e 05:00.1 AQTX: desc and buffer:
[1483319.826129] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3400, datalen 0x0008, retval 0x0000
[1483319.826130] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826131] i40e 05:00.1 param (0,1) 0x00000070 0x00000000
[1483319.826132] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000
[1483319.826132] i40e 05:00.1 AQ CMD Buffer:
[1483319.826134] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826187] i40e 05:00.1 AQTX: Command completed with error 0xC.
[1483319.826188] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.826189] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3407, datalen 0x0008, retval 0x000C
[1483319.826190] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826191] i40e 05:00.1 param (0,1) 0x00000070 0x00000000
[1483319.826192] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000
[1483319.826193] i40e 05:00.1 AQ CMD Buffer:
[1483319.826194] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826198] i40e 0000:05:00.1: Unable to send the message to VF 48 aq_err 12

Applications using this code (such as DPDK when it tries to verify PF/VF version) fail when this error happens - an application not receiving data suggest a real error when it's in fact a race condition.

A solution could be retrying or using a lock mechanism.

Related DPDK bug: https://bugs.dpdk.org/show_bug.cgi?id=578

Tested with kernel 4.15.0-55-generic

NIC:
Intel Ethernet Converged Network Adapter XL710-Q2
cat /sys/class/net/enp5s0f0/device/device 
0x1583
cat /sys/class/net/enp5s0f0/device/vendor 
0x8086
cat /sys/class/net/enp5s0f0/device/subsystem_device 
0x0001
cat /sys/class/net/enp5s0f0/device/subsystem_vendor 
0x8086

Firmware version:
driver: i40e
version: 2.1.14-k
firmware-version: 6.01 0x800035da 1.1747.0
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Also tested with latest i40e driver (2.13.10) where the issue is still reproducible.

Tested on x86_64 and aarch64 hardware, both reproducible.
Comment 1 Juraj Linkeš 2021-03-18 13:27:08 UTC
Retested with the following versions:
VPP 2009: 20.09.0-62~g4f88004bc, uses DPDK 20.08
driver: i40e
version: 2.12.6
firmware-version: 7.30 0x80008387 1.2684.0
kernel 4.15.0-55-generic

The versions were chosen according to https://doc.dpdk.org/guides/nics/i40e.html.

The issue can be reproduced exactly the same. No logs are different.
Comment 2 Juraj Linkeš 2021-05-13 13:10:57 UTC
Retested again with newer driver/FW/dpdk versions:
VPP version: 21.06-rc0~693-gc44402f7b
DPDK version that VPP uses: 21.02
FW/i40e version:
sudo ethtool -i enp134s0f0
driver: i40e
version: 2.14.13
firmware-version: 8.15 0x800095e3 1.2829.0
expansion-rom-version: 
bus-info: 0000:86:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

modinfo i40e
filename:       /lib/modules/4.15.0-23-generic/updates/drivers/net/ethernet/intel/i40e/i40e.ko
version:        2.14.13
license:        GPL
description:    Intel(R) 40-10 Gigabit Ethernet Connection Network Driver
author:         Intel Corporation, <e1000-devel@lists.sourceforge.net>
srcversion:     1A821F3D488396B967F338E
alias:          pci:v00008086d0000158Bsv*sd*bc*sc*i*
alias:          pci:v00008086d0000158Asv*sd*bc*sc*i*
alias:          pci:v00008086d000037D3sv*sd*bc*sc*i*
alias:          pci:v00008086d000037D2sv*sd*bc*sc*i*
alias:          pci:v00008086d000037D1sv*sd*bc*sc*i*
alias:          pci:v00008086d000037D0sv*sd*bc*sc*i*
alias:          pci:v00008086d000037CFsv*sd*bc*sc*i*
alias:          pci:v00008086d000037CEsv*sd*bc*sc*i*
alias:          pci:v00008086d00000D58sv*sd*bc*sc*i*
alias:          pci:v00008086d00000CF8sv*sd*bc*sc*i*
alias:          pci:v00008086d00001588sv*sd*bc*sc*i*
alias:          pci:v00008086d00001587sv*sd*bc*sc*i*
alias:          pci:v00008086d0000104Fsv*sd*bc*sc*i*
alias:          pci:v00008086d0000104Esv*sd*bc*sc*i*
alias:          pci:v00008086d000015FFsv*sd*bc*sc*i*
alias:          pci:v00008086d00001589sv*sd*bc*sc*i*
alias:          pci:v00008086d00001586sv*sd*bc*sc*i*
alias:          pci:v00008086d0000101Fsv*sd*bc*sc*i*
alias:          pci:v00008086d00001585sv*sd*bc*sc*i*
alias:          pci:v00008086d00001584sv*sd*bc*sc*i*
alias:          pci:v00008086d00001583sv*sd*bc*sc*i*
alias:          pci:v00008086d00001581sv*sd*bc*sc*i*
alias:          pci:v00008086d00001580sv*sd*bc*sc*i*
alias:          pci:v00008086d00001574sv*sd*bc*sc*i*
alias:          pci:v00008086d00001572sv*sd*bc*sc*i*
depends:        ptp
retpoline:      Y
name:           i40e
vermagic:       4.15.0-23-generic SMP mod_unload 
parm:           debug:Debug level (0=none,...,16=all) (int)
parm:           l4mode:L4 cloud filter mode: 0=UDP,1=TCP,2=Both,-1=Disabled(default) (int)

Kernel version:
uname -r
4.15.0-23-generic

Even with these newer versions, we're still seeing the same issues:
[ 1032.054512] i40e 0000:86:00.1: Unable to send the message to VF 4 aq_err 12
[ 1432.992673] i40e 0000:86:00.0: Unable to send the message to VF 33 aq_err 12
[ 1439.008590] i40e 0000:86:00.0: Unable to send the message to VF 48 aq_err 12
[ 1440.300862] i40e 0000:86:00.0: Unable to send the message to VF 51 aq_err 12
[ 1452.288202] i40e 0000:86:00.1: Unable to send the message to VF 10 aq_err 12
[ 1455.877058] i40e 0000:86:00.1: Unable to send the message to VF 19 aq_err 12
[ 1463.479826] i40e 0000:86:00.1: Unable to send the message to VF 38 aq_err 12
[ 1467.035731] i40e 0000:86:00.1: Unable to send the message to VF 47 aq_err 12
[ 1479.123392] i40e 0000:af:00.0: Unable to send the message to VF 6 aq_err 12
[ 1488.019159] i40e 0000:af:00.0: Unable to send the message to VF 28 aq_err 12
[ 1490.690969] i40e 0000:af:00.0: Unable to send the message to VF 35 aq_err 12
[ 1494.127675] i40e 0000:af:00.0: Unable to send the message to VF 43 aq_err 12
[ 1530.005928] i40e 0000:af:00.1: Unable to send the message to VF 61 aq_err 12
[ 1700.702161] i40e 0000:86:00.1: Unable to send the message to VF 0 aq_err 12
[ 1700.702220] i40e 0000:86:00.1: Unable to send the message to VF 2 aq_err 12
[ 1701.594627] i40e 0000:86:00.1: Unable to send the message to VF 1 aq_err 12
[ 1701.847322] i40e 0000:86:00.1: Unable to send the message to VF 3 aq_err 12
[ 1701.847378] i40e 0000:86:00.1: Unable to send the message to VF 4 aq_err 12
[ 1702.891186] i40e 0000:86:00.1: Unable to send the message to VF 5 aq_err 12

These manifest in VPP/DPDK this way:
2021/05/13 11:12:54:128 notice     dpdk           EAL init args: -c 200 -n 4 --log-level debug --in-memory --no-telemetry --file-prefix vpp -a 0000:86:0a.6 -a 0000:86:0a.7 --main-lcore 9 
2021/05/13 11:12:57:094 warn       dpdk/cryptodev dpdk_cryptodev_init: Failed to configure cryptodev
2021/05/13 11:12:57:101 notice     dpdk           EAL: Detected 112 lcore(s)
2021/05/13 11:12:57:101 notice     dpdk           EAL: Detected 2 NUMA nodes
2021/05/13 11:12:57:101 notice     dpdk           EAL: Detected static linkage of DPDK
2021/05/13 11:12:57:101 notice     dpdk           EAL: Selected IOVA mode 'PA'
2021/05/13 11:12:57:101 notice     dpdk           EAL: No available 1048576 kB hugepages reported
2021/05/13 11:12:57:101 notice     dpdk           EAL: No free 1048576 kB hugepages reported on node 0
2021/05/13 11:12:57:101 notice     dpdk           EAL: No free 1048576 kB hugepages reported on node 1
2021/05/13 11:12:57:101 notice     dpdk           EAL: No available 1048576 kB hugepages reported
2021/05/13 11:12:57:101 notice     dpdk           EAL: Probing VFIO support...
2021/05/13 11:12:57:101 notice     dpdk           EAL: VFIO support initialized
2021/05/13 11:12:57:101 notice     dpdk           EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.6 (socket 1)
2021/05/13 11:12:57:101 notice     dpdk           i40evf_dev_init(): Init vf failed
2021/05/13 11:12:57:101 notice     dpdk           EAL: Releasing pci mapped resource for 0000:86:0a.6
2021/05/13 11:12:57:101 notice     dpdk           EAL: Calling pci_unmap_resource for 0000:86:0a.6 at 0x7f0400000000
2021/05/13 11:12:57:101 notice     dpdk           EAL: Calling pci_unmap_resource for 0000:86:0a.6 at 0x7f0400010000
2021/05/13 11:12:57:101 notice     dpdk           EAL: Requested device 0000:86:0a.6 cannot be used
2021/05/13 11:12:57:101 notice     dpdk           EAL:   using IOMMU type 8 (No-IOMMU)
2021/05/13 11:12:57:101 notice     dpdk           EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.7 (socket 1)

Here, DPDK can't use 0000:86:0a.6 and fails in i40evf_dev_init, but it can use 0000:86:0a.7 right after that.

Another log:
2021/05/13 11:12:56:933 notice     dpdk           EAL: Detected 112 lcore(s)
2021/05/13 11:12:56:933 notice     dpdk           EAL: Detected 2 NUMA nodes
2021/05/13 11:12:56:933 notice     dpdk           EAL: Detected static linkage of DPDK
2021/05/13 11:12:56:933 notice     dpdk           EAL: Selected IOVA mode 'PA'
2021/05/13 11:12:56:933 notice     dpdk           EAL: No available 1048576 kB hugepages reported
2021/05/13 11:12:56:933 notice     dpdk           EAL: No free 1048576 kB hugepages reported on node 0
2021/05/13 11:12:56:933 notice     dpdk           EAL: No free 1048576 kB hugepages reported on node 1
2021/05/13 11:12:56:933 notice     dpdk           EAL: No available 1048576 kB hugepages reported
2021/05/13 11:12:56:933 notice     dpdk           EAL: Probing VFIO support...
2021/05/13 11:12:56:933 notice     dpdk           EAL: VFIO support initialized
2021/05/13 11:12:56:933 notice     dpdk           EAL:   using IOMMU type 8 (No-IOMMU)
2021/05/13 11:12:56:933 notice     dpdk           EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.4 (socket 1)
2021/05/13 11:12:56:933 notice     dpdk           i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
2021/05/13 11:12:56:933 notice     dpdk           i40evf_init_vf(): check_api version failed
2021/05/13 11:12:56:933 notice     dpdk           i40evf_dev_init(): Init vf failed
2021/05/13 11:12:56:933 notice     dpdk           EAL: Releasing pci mapped resource for 0000:86:0a.4
2021/05/13 11:12:56:933 notice     dpdk           EAL: Calling pci_unmap_resource for 0000:86:0a.4 at 0x7f5380000000
2021/05/13 11:12:56:933 notice     dpdk           EAL: Calling pci_unmap_resource for 0000:86:0a.4 at 0x7f5380010000
2021/05/13 11:12:56:933 notice     dpdk           EAL: Requested device 0000:86:0a.4 cannot be used
2021/05/13 11:12:56:933 notice     dpdk           EAL:   using IOMMU type 8 (No-IOMMU)
2021/05/13 11:12:56:933 notice     dpdk           EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:86:0a.5 (socket 1)
2021/05/13 11:12:56:933 notice     dpdk           i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
2021/05/13 11:12:56:933 notice     dpdk           i40evf_init_vf(): check_api version failed
2021/05/13 11:12:56:933 notice     dpdk           i40evf_dev_init(): Init vf failed
2021/05/13 11:12:56:933 notice     dpdk           EAL: Releasing pci mapped resource for 0000:86:0a.5
2021/05/13 11:12:56:933 notice     dpdk           EAL: Calling pci_unmap_resource for 0000:86:0a.5 at 0x7f5380014000
2021/05/13 11:12:56:933 notice     dpdk           EAL: Calling pci_unmap_resource for 0000:86:0a.5 at 0x7f5380024000
2021/05/13 11:12:56:933 notice     dpdk           --- message(s) throttled ---

Here VPP/DPDK can't use both 0000:86:0a.4 and 0000:86:0a.5 and fails with a slightly different log message, hinting that it hit a race condition in a different place.
Comment 3 Slawomir Laba 2021-05-17 12:48:08 UTC
Hi Juraj,

could you please share the minimal reproduction steps?

-version of DPDK used;
-how many VFs are spawned;
-do I have to unbind all the VFs from the iavf driver and bind them to vfio-pci;
-are the VFs passthrough to the VM or the reproduction is possible on HOST only;
-which script should I run from the DPDK along with the parameters that trigger the unwanted messages;
-any other useful info;

Thank you,
Slawomir Laba
Comment 4 Juraj Linkeš 2021-05-18 07:05:29 UTC
Hi Slawomir,

We are using docker containers to run multiple intances of VPP/DPDK. This is because we've hit this issue in CI which uses that setup. We haven't tried this with VMs.

The lastest DPDK version which we were able to reproduce this with is 21.02.
I'm not sure what you mean by spawned, we're creating sriov_numvfs vfs (our testing NIC is XXV710, so that would be 64 VFs per PF port). Each VPP/DPDK instance is using two VFs and we need at least two separate instances to reproduce this.

We're mainly reproducing this with VPP (which is using DPDK), but I've also been able to reproduce this with just DPDK:
docker exec -d ${CONTAINER} sh -c "/dpdk/build/app/dpdk-testpmd -n 4 --log-level debug --in-memory --no-telemetry --file-prefix vpp -a \"\$(cut -d ' ' -f 1 nic_pcis.txt)\" -a \"\$(cut -d ' ' -f 2 nic_pcis.txt)\" --main-lcore ${DPDK_CORE} > ${DPDK_LOG_FILE}"

When using DPDK, the VFs need to be bound to vfio-pci, when using VPP, it'll bind then to vfio-pci automatically.

Make sure to use multiple separate instances of VPP/DPDK that are started at the same time so that you are able to hit the race condition. One instance with two VFs never reproduces this (and more VFs in a single instance won't help IMO, since it seems the initialization is sequential), there must be simultaneous init attempts from separate instances.

Juraj
Comment 5 Juraj Linkeš 2021-06-04 08:32:58 UTC
The i40evf PMD used by DPDK is old and hasn't been updated much. In its stead, the iavf PMD has been developed and is actively maintained. The issue is not reproducible with the iavf PMD. Starting with 21.05, DPDK switched the default PMD for these VFs to iavf, thus resolving the issue.

Note You need to log in before you can comment on or make changes to this bug.