Bug 187781 - mlx4 driver failes to load
Summary: mlx4 driver failes to load
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-15 12:46 UTC by Johannes Thumshirn
Modified: 2017-01-25 22:59 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
screenshot of crash (206.65 KB, image/jpeg)
2016-11-15 12:46 UTC, Johannes Thumshirn
Details
Don't set RCB bit in LNKCTL if the upstream bridge hasn't (4.32 KB, patch)
2016-11-15 12:51 UTC, Johannes Thumshirn
Details | Diff
dmesg frokm boot with pci=realloc (144.12 KB, text/plain)
2016-11-15 12:54 UTC, Johannes Thumshirn
Details
acpidump from affected system (1.80 MB, text/plain)
2016-11-15 15:18 UTC, Johannes Thumshirn
Details
lspci -vvv output from affected system (353.69 KB, text/plain)
2016-11-15 15:21 UTC, Johannes Thumshirn
Details
/proc/iomem (7.70 KB, text/plain)
2016-11-23 01:34 UTC, Bjorn Helgaas
Details

Description Johannes Thumshirn 2016-11-15 12:46:17 UTC
The mlx4_core driver fails to load with the following messages in dmesg:

> [    8.648872] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
> [    8.648889] mlx4_core: Initializing 0000:41:00.0
> [   10.068642] mlx4_core 0000:41:00.0: command 0xfff failed: fw status = 0x1
> [   10.068645] mlx4_core 0000:41:00.0: MAP_FA command failed, aborting
> [   10.068659] mlx4_core 0000:41:00.0: Failed to start FW, aborting
> [   10.068661] mlx4_core 0000:41:00.0: Failed to init fw, aborting.
> [   11.071536] mlx4_core: probe of 0000:41:00.0 failed with error -5
Comment 1 Johannes Thumshirn 2016-11-15 12:46:40 UTC
Created attachment 244611 [details]
screenshot of crash
Comment 2 Johannes Thumshirn 2016-11-15 12:48:30 UTC
Kernel version 3.12 is working, 4.4 not.

Git bisect points to:
commit 7a1562d4f2d01721ad07c3a326db7512077ceea9
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Tue Nov 11 12:09:46 2014 -0800

    PCI: Apply _HPX Link Control settings to all devices with a link

    Previously we applied _HPX type 2 record Link Control register settings
    only to bridges with a subordinate bus.  But it's better to apply them to
    all devices with a link because if the subordinate bus has not been
    allocated yet, we won't apply settings to the device.
    
    Use pcie_cap_has_lnkctl() to determine whether the device has a Link
    Control register instead of looking at dev->subordinate.
Comment 3 Johannes Thumshirn 2016-11-15 12:51:19 UTC
Created attachment 244621 [details]
Don't set RCB bit in LNKCTL if the upstream bridge hasn't

Patch to fix crash
Comment 4 Johannes Thumshirn 2016-11-15 12:54:42 UTC
Created attachment 244631 [details]
dmesg frokm boot with pci=realloc
Comment 5 Johannes Thumshirn 2016-11-15 12:56:21 UTC
Patch from Comment 3 fixes the regression
Comment 6 Bjorn Helgaas 2016-11-15 15:14:56 UTC
Thanks, Johannes.  Can you please attach "lspci -vv" output, /proc/iomem contents, and an ACPI dump?
Comment 7 Johannes Thumshirn 2016-11-15 15:17:12 UTC
I don't have /proc/iomem but acpidump and lscpi are doable. lspci from the working and/or broken kernel?
Comment 8 Johannes Thumshirn 2016-11-15 15:18:22 UTC
Created attachment 244671 [details]
acpidump from affected system
Comment 9 Johannes Thumshirn 2016-11-15 15:21:42 UTC
Created attachment 244681 [details]
lspci -vvv output from affected system
Comment 10 Bjorn Helgaas 2016-11-23 01:34:12 UTC
Created attachment 245771 [details]
/proc/iomem

From Johannes (http://lkml.kernel.org/r/20161122105647.dyt5nsst2sqbdf4y@linux-x5ow.site)
Comment 11 Bjorn Helgaas 2017-01-25 22:59:48 UTC
Resolved by e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)"), which appeared in v4.9.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e42010d8207f

Note You need to log in before you can comment on or make changes to this bug.