Bug 219453
Summary: | BCM57416 error in bnxt_re | ||
---|---|---|---|
Product: | Drivers | Reporter: | Pengyu Ma (mapengyu) |
Component: | Infiniband/RDMA | Assignee: | drivers_infiniband-rdma |
Status: | NEW --- | ||
Severity: | normal | CC: | kalesh-anakkur.purayil, leon, linux-rdma, mapengyu, selvin.xavier, selvin.xavier |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
bnxt_re cmd error
6.11 kernel CallTrace |
Created attachment 307114 [details]
6.11 kernel CallTrace
Nov 01 00:07:41 ubuntu kernel: bnxt_en 0000:82:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (41254 > 40000) msec a>
Nov 01 00:07:41 ubuntu kernel: bnxt_en 0000:82:00.0 bnxt_re0: Failed to modify HW QP
Nov 01 00:07:41 ubuntu kernel: infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
Nov 01 00:07:41 ubuntu kernel: infiniband bnxt_re0: Couldn't start port
Nov 01 00:07:41 ubuntu kernel: bnxt_en 0000:82:00.0 bnxt_re0: Failed to destroy HW QP
Nov 01 00:07:41 ubuntu kernel: ------------[ cut here ]------------
Nov 01 00:07:41 ubuntu kernel: WARNING: CPU: 36 PID: 2191 at drivers/infiniband/core/cq.c:322 ib_free_cq+0x10b/0x160 [ib_core]
Nov 01 00:07:41 ubuntu kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nvme_fabrics bnep intel_rapl_msr snd_hda_codec_rea>
Nov 01 00:07:41 ubuntu kernel: usb_storage hid amdxcp drm_exec gpu_sched drm_buddy video i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm ixg>
Nov 01 00:07:41 ubuntu kernel: CPU: 36 UID: 0 PID: 2191 Comm: systemd-udevd Not tainted 6.11.0-061100-generic #202409151536
Nov 01 00:07:41 ubuntu kernel: Hardware name: LENOVO ThinkStation P8/105E, BIOS S0GKT21A 07/26/2024
Nov 01 00:07:41 ubuntu kernel: RIP: 0010:ib_free_cq+0x10b/0x160 [ib_core]
Nov 01 00:07:41 ubuntu kernel: Code: ba 02 00 65 ff 0d ed 7c ee 3d 0f 85 70 ff ff ff 0f 1f 44 00 00 e9 66 ff ff ff 83 f8 03 0f 84 3a ff ff ff 0f >
Nov 01 00:07:41 ubuntu kernel: RSP: 0018:ff4eb0d48905b848 EFLAGS: 00010202
Nov 01 00:07:41 ubuntu kernel: RAX: 0000000000000002 RBX: ff303a857a600000 RCX: 0000000000000000
Nov 01 00:07:41 ubuntu kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff303a84e4f2f000
Nov 01 00:07:41 ubuntu kernel: RBP: ff4eb0d48905b8a8 R08: 0000000000000000 R09: 0000000000000000
Nov 01 00:07:41 ubuntu kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffff92
Nov 01 00:07:41 ubuntu kernel: R13: 00000000000000b0 R14: ff303a84f48be000 R15: ff303a84f48be8f8
Nov 01 00:07:41 ubuntu kernel: FS: 000077dc7425d8c0(0000) GS:ff303a87e9200000(0000) knlGS:0000000000000000
Nov 01 00:07:41 ubuntu kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 01 00:07:41 ubuntu kernel: CR2: 0000738e6caeadd0 CR3: 00000001264da004 CR4: 0000000000f71ef0
Nov 01 00:07:41 ubuntu kernel: PKRU: 55555554
Nov 01 00:07:41 ubuntu kernel: Call Trace:
Nov 01 00:07:41 ubuntu kernel: <TASK>
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? show_trace_log_lvl+0x273/0x310
Nov 01 00:07:41 ubuntu kernel: ? show_trace_log_lvl+0x273/0x310
Nov 01 00:07:41 ubuntu kernel: ? ib_mad_init_device+0x5c/0xd0 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ? show_regs.part.0+0x22/0x30
Nov 01 00:07:41 ubuntu kernel: ? show_regs.cold+0x8/0x10
Nov 01 00:07:41 ubuntu kernel: ? ib_free_cq+0x10b/0x160 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ? __warn.cold+0xa7/0x101
Nov 01 00:07:41 ubuntu kernel: ? ib_free_cq+0x10b/0x160 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ? report_bug+0x114/0x160
Nov 01 00:07:41 ubuntu kernel: ? handle_bug+0x51/0xa0
Nov 01 00:07:41 ubuntu kernel: ? exc_invalid_op+0x18/0x80
Nov 01 00:07:41 ubuntu kernel: ? asm_exc_invalid_op+0x1b/0x20
Nov 01 00:07:41 ubuntu kernel: ? ib_free_cq+0x10b/0x160 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? ib_mad_port_open+0x393/0x450 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ib_mad_init_device+0x5c/0xd0 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: add_client_context+0x115/0x1c0 [ib_core]
Nov 01 00:07:41 ubuntu kernel: enable_device_and_get+0xe8/0x1e0 [ib_core]
Nov 01 00:07:41 ubuntu kernel: ib_register_device+0xf4/0x180 [ib_core]
Nov 01 00:07:41 ubuntu kernel: bnxt_re_ib_init+0x141/0x160 [bnxt_re]
Nov 01 00:07:41 ubuntu kernel: bnxt_re_probe+0x14c/0x1b0 [bnxt_re]
Nov 01 00:07:41 ubuntu kernel: ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
Nov 01 00:07:41 ubuntu kernel: auxiliary_bus_probe+0x49/0x90
Nov 01 00:07:41 ubuntu kernel: ? driver_sysfs_add+0x66/0xd0
Nov 01 00:07:41 ubuntu kernel: really_probe+0xf6/0x370
Nov 01 00:07:41 ubuntu kernel: ? pm_runtime_barrier+0x55/0xa0
Nov 01 00:07:41 ubuntu kernel: __driver_probe_device+0x8c/0x140
Nov 01 00:07:41 ubuntu kernel: driver_probe_device+0x24/0xd0
Nov 01 00:07:41 ubuntu kernel: __driver_attach+0xe4/0x210
Nov 01 00:07:41 ubuntu kernel: ? __pfx___driver_attach+0x10/0x10
Nov 01 00:07:41 ubuntu kernel: bus_for_each_dev+0x8c/0xf0
Nov 01 00:07:41 ubuntu kernel: driver_attach+0x1e/0x30
Nov 01 00:07:41 ubuntu kernel: bus_add_driver+0x14e/0x240
Nov 01 00:07:41 ubuntu kernel: driver_register+0x73/0xf0
Nov 01 00:07:41 ubuntu kernel: __auxiliary_driver_register+0x73/0xf0
Nov 01 00:07:41 ubuntu kernel: ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
Nov 01 00:07:41 ubuntu kernel: bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
Nov 01 00:07:41 ubuntu kernel: ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
Nov 01 00:07:41 ubuntu kernel: do_one_initcall+0x5b/0x330
Nov 01 00:07:41 ubuntu kernel: do_init_module+0x97/0x280
Nov 01 00:07:41 ubuntu kernel: load_module+0x64d/0x750
Nov 01 00:07:41 ubuntu kernel: __do_sys_init_module+0x19e/0x1d0
Nov 01 00:07:41 ubuntu kernel: __x64_sys_init_module+0x1a/0x30
Nov 01 00:07:41 ubuntu kernel: x64_sys_call+0x1586/0x22b0
Nov 01 00:07:41 ubuntu kernel: do_syscall_64+0x7e/0x170
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? vfs_read+0x2a0/0x380
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? ksys_read+0x71/0x100
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? syscall_exit_to_user_mode+0x4e/0x250
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? do_syscall_64+0x8a/0x170
Nov 01 00:07:41 ubuntu kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Nov 01 00:07:41 ubuntu kernel: ? exc_page_fault+0x96/0x1c0
Nov 01 00:07:41 ubuntu kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Nov 01 00:07:41 ubuntu kernel: RIP: 0033:0x77dc74126bde
Nov 01 00:07:41 ubuntu kernel: Code: 48 8b 0d 55 32 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 >
Nov 01 00:07:41 ubuntu kernel: RSP: 002b:00007ffc48e754d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
Nov 01 00:07:41 ubuntu kernel: RAX: ffffffffffffffda RBX: 0000576973b18670 RCX: 000077dc74126bde
Nov 01 00:07:41 ubuntu kernel: RDX: 000077dc74482441 RSI: 0000000000064439 RDI: 0000576973bac3f0
Nov 01 00:07:41 ubuntu kernel: RBP: 0000576973bac3f0 R08: 27d4eb2f165667c5 R09: 85ebca77c2b2ae63
Nov 01 00:07:41 ubuntu kernel: R10: 0000000000002e81 R11: 0000000000000246 R12: 000077dc74482441
Nov 01 00:07:41 ubuntu kernel: R13: 0000576973b18e90 R14: 0000576973b27380 R15: 0000576973b06300
Nov 01 00:07:41 ubuntu kernel: </TASK>
Nov 01 00:07:41 ubuntu kernel: ---[ end trace 0000000000000000 ]---
Nov 01 00:07:41 ubuntu kernel: bnxt_en 0000:82:00.0 bnxt_re0: Free MW failed: 0xffffff92
Nov 01 00:07:41 ubuntu kernel: infiniband bnxt_re0: Couldn't open port 1
Nov 01 00:07:41 ubuntu kernel: infiniband bnxt_re0: Device registered with IB successfully
Nov 01 00:08:22 ubuntu kernel: bnxt_en 0000:82:00.1: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (40842 > 40000) msec a>
Nov 01 00:08:22 ubuntu kernel: bnxt_en 0000:82:00.1 bnxt_re1: Failed to modify HW QP
Nov 01 00:08:22 ubuntu kernel: infiniband bnxt_re1: Couldn't change QP1 state to INIT: -110
Nov 01 00:08:22 ubuntu kernel: infiniband bnxt_re1: Couldn't start port
Nov 01 00:08:22 ubuntu kernel: bnxt_en 0000:82:00.1 bnxt_re1: Failed to destroy HW QP
Nov 01 00:08:22 ubuntu kernel: bnxt_en 0000:82:00.1 bnxt_re1: Free MW failed: 0xffffff92
Nov 01 00:08:22 ubuntu kernel: infiniband bnxt_re1: Couldn't open port 1
Hi Pengyu Ma, Thank you for the report. We will take a look at this. Could you provide following outputs: 1. lspci -s 0000:82:00.0 -vvv 2. lspci -s 0000:82:00.0 -xxx 3. ethtool -i ethX : to know the firmware version you are using. Regards, Kalesh $ sudo lspci -s 0000:82:00.0 -vvv 82:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) Subsystem: Lenovo BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 86 IOMMU group: 24 Region 0: Memory at 4266c010000 (64-bit, prefetchable) [size=64K] Region 2: Memory at 4266bf00000 (64-bit, prefetchable) [size=1M] Region 4: Memory at 4266c022000 (64-bit, prefetchable) [size=8K] Expansion ROM at d0e80000 [disabled] [size=512K] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Product Name: Broadcom NX-E PCIe 10Gb 2-Port Base-T Ethernet Adapter Read-only fields: [PN] Part number: SN30L27797 [SN] Serial number: L0FG27N01VM [FN] FRU: 00YK535 [V0] Vendor specific: 214.0.286.17 [RV] Reserved: checksum good, 83 byte(s) reserved End Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [a0] MSI-X: Enable+ Count=74 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=000004a0 Capabilities: [ac] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset- MaxPayload 512 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x8 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Via WAKE#, AtomicOpsCtl: ReqEn+ LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+ MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 04008001 8000220f 82020000 00000000 Capabilities: [13c v1] Device Serial Number 00-62-0b-ff-fe-26-11-70 Capabilities: [150 v1] Power Budgeting <?> Capabilities: [160 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?> Capabilities: [1b0 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [230 v1] Transaction Processing Hints Interrupt vector mode supported Device specific mode supported Steering table in MSI-X table Capabilities: [300 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [200 v1] Precision Time Measurement PTMCap: Requester:+ Responder:- Root:- PTMClockGranularity: Unimplemented PTMControl: Enabled:- RootSelected:- PTMEffectiveGranularity: Unknown Kernel driver in use: bnxt_en Kernel modules: bnxt_en $ sudo lspci -s 0000:82:00.0 -xxx 82:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) 00: e4 14 d8 16 06 04 10 00 01 00 00 02 00 00 80 00 10: 0c 00 01 6c 26 04 00 00 0c 00 f0 6b 26 04 00 00 20: 0c 20 02 6c 26 04 00 00 00 00 00 00 aa 17 60 41 30: 00 00 e8 d0 48 00 00 00 00 00 00 00 ff 01 00 00 40: 00 00 00 00 00 00 00 00 01 50 03 c8 08 20 00 64 50: 03 58 c4 80 00 00 00 78 05 a0 86 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 11 ac 49 80 04 00 00 00 a4 04 00 00 10 00 02 00 b0: a2 8d 2c 11 57 2d 19 00 83 c0 44 00 40 00 83 10 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 1f 08 08 00 40 60 00 00 0e 00 00 00 01 00 1f 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 $ sudo ethtool -i enp130s0f0np0 driver: bnxt_en version: 6.11.0-061100-generic firmware-version: 214.4.9.9/pkg 214.0.286.17 expansion-rom-version: bus-info: 0000:82:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no Hi Pengyu Ma, Thank you for providing additional information. This is a BCM57416 device. I see that the FW version is quite old and this is an expected failure after the commit "a9a457f338e7 ("RDMA/bnxt_re: Update HW interface headers")". Broadcom RoCE solution was not GAed for this FW version and to use RoCE in production, you will have to update to a latest FW. Are you okay to update the FW to the latest versions? You can find the firmware location here: https://www.broadcom.com/support/download-search?pg=Ethernet+Connectivity,+Switching,+and+PHYs&pf=Ethernet+Connectivity,+Switching,+and+PHYs&pn=&pa=&po=&dk=&pl=&l=false Regards, Kalesh @Kalesh A P, Thanks for your great support. I will try to do it. |
Created attachment 307113 [details] bnxt_re cmd error After 6.3 kernel include bnxt_re driver for bnxt_en. The kernel show CallTrace when bootup: Tested on 6.12-rc5 today, there is still some error, but CallTrace is gone. [ 5.531490] bnxt_en 0000:82:00.0: QPLIB: cmdq[0x8]=0xf status 0x3 [ 5.531497] bnxt_en 0000:82:00.0 bnxt_re0: Failed to register fence-MR [ 5.531615] bnxt_en 0000:82:00.0 bnxt_re0: Failed to create Fence-MR [ 5.531694] bnxt_en 0000:82:00.0: QPLIB: cmdq[0xa]=0x9 status 0x3 [ 5.531702] bnxt_en 0000:82:00.0 bnxt_re0: Failed to create HW CQ [ 5.531706] infiniband bnxt_re0: Couldn't create ib_mad CQ [ 5.531709] infiniband bnxt_re0: Couldn't open port 1 [ 5.531872] infiniband bnxt_re0: Device registered with IB successfully It could be related to firmware. Log attached.