Bug 218491 - ixgbe probe failure in Proxmox8
Summary: ixgbe probe failure in Proxmox8
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://forum.proxmox.com/threads/pro...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-13 23:37 UTC by Bjorn Helgaas
Modified: 2024-02-20 23:57 UTC (History)
3 users (show)

See Also:
Kernel Version: v6.1 (Proxmox)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
v6.1.10-1-pve dmesg log (142.99 KB, text/plain)
2024-02-13 23:37 UTC, Bjorn Helgaas
Details
List of PCIe devices on Qotom before and after load ixgbe module (9.79 KB, application/x-xz)
2024-02-14 21:07 UTC, Yohan Charbi
Details
Return of dmesg after a modprobe ixgbe (804 bytes, application/x-xz)
2024-02-14 22:33 UTC, Yohan Charbi
Details
before lspci (356.05 KB, text/plain)
2024-02-16 01:26 UTC, Jesse Brandeburg
Details
after lspci (356.05 KB, text/plain)
2024-02-16 01:26 UTC, Jesse Brandeburg
Details
dmesg after (149.37 KB, text/plain)
2024-02-16 01:27 UTC, Jesse Brandeburg
Details
dmesg with pci=noaer (124.79 KB, text/plain)
2024-02-17 18:19 UTC, Bjorn Helgaas
Details

Description Bjorn Helgaas 2024-02-13 23:37:25 UTC
Created attachment 305867 [details]
v6.1.10-1-pve dmesg log

Tim reported that in Proxmox 7 ixgbe works fine, but in Proxmox 8, ixgbe probe fails with -5 (EIO): https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/post-633851

I'm attaching the complete dmesg log from 6.1.10-1-pve here to preserve it in case it's useful in the future.

I don't think this is an ECAM/MCFG problem because that all looks fine; the ECAM area is reserved correctly and doesn't overlap any ixgbe resources:

  PCI: MMCONFIG for domain 0000 [bus 00-7f] at [mem 0xf0000000-0xf7ffffff] (base 0xf0000000)
  PCI: MMCONFIG at [mem 0xf0000000-0xf7ffffff] reserved in ACPI motherboard resources

The ixgbe-related things:

  ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-1f])
  acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
  acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
  pci_bus 0000:00: root bus resource [mem 0xea000000-0xefffffff window]
  pci 0000:00:03.1: PCI bridge to [bus 05-06]
  pci 0000:00:03.1:   bridge window [mem 0xee000000-0xeedfffff]
  pci 0000:05:00.0: [8086:1563] type 00 class 0x020000
  pci 0000:05:00.0: reg 0x10: [mem 0xee400000-0xee7fffff]
  pci 0000:05:00.0: reg 0x1c: [mem 0xeed04000-0xeed07fff]
  pci 0000:05:00.0: reg 0x30: [mem 0xee880000-0xee8fffff pref]
  pci 0000:05:00.0: reg 0x184: [mem 0xeec00000-0xeec03fff 64bit]
  pci 0000:05:00.0: VF(n) BAR0 space: [mem 0xeec00000-0xeecfffff 64bit] (contains BAR0 for 64 VFs)
  pci 0000:05:00.0: reg 0x190: [mem 0xeeb00000-0xeeb03fff 64bit]
  pci 0000:05:00.0: VF(n) BAR3 space: [mem 0xeeb00000-0xeebfffff 64bit] (contains BAR3 for 64 VFs)

  pcieport 0000:00:03.1: AER: Corrected error received: 0000:05:00.0
  pci 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
  pci 0000:05:00.0:   device [8086:1563] error status/mask=00002000/00000000
  pci 0000:05:00.0:    [13] NonFatalErr

  ixgbe 0000:05:00.0: enabling device (0000 -> 0002)
  ixgbe 0000:05:00.0: Adapter removed

  pcieport 0000:00:03.1: AER: Corrected error received: 0000:05:00.0
  ixgbe 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
  ixgbe 0000:05:00.0:   device [8086:1563] error status/mask=00002000/00000000
  ixgbe 0000:05:00.0:    [13] NonFatalErr
  pcieport 0000:00:03.1: AER: Corrected error received: 0000:05:00.0

  ixgbe: probe of 0000:05:00.0 failed with error -5

The "Adapter removed" appears to be from ixgbe_remove_adapter(), which happens when a config or MMIO read from the adapter returns ~0, which typically indicates a failed PCIe transaction.

This might be related to the Advisory Non-Fatal Error (0x2000).  This was logged by the ixgbe device at 05:00.0, which suggests it might have signaled an uncorrectable error for a Non-Posted Request (i.e., a read).  PCIe r6.0, sec 6.2.3.2.4.1, says ACS violations could cause this, which seems more likely than a link integrity or protocol error, and the Completer (05:00.0) would send a Completion with UR or CA status.

Per sec 6.2.3.2.5, the Root Port at 00:03.1 might handle this UR/CA Completion by returning ~0 to software, which would land us in ixgbe_remove_adapter().

This is a lot of speculation.  As an experiment, you might try booting with "amd_iommu=off", which should disable all ACS.  If ixgbe works then, ACS would be a place to look.
Comment 1 Yohan Charbi 2024-02-14 17:06:50 UTC
Hello,

(Please note that I don't speak English, sorry if the traction is not faithful to your language)

I'm adding my experience here in the hope of contributing something to the resolution of the problem that also concerns me under GNU/Linux Debian 12 - kernel 6.1.76 (and Sid - kernel 6.6.13). So it's not specific to Proxmox.
I should point out that under GNU/Linux Debian 11 - kernel 5.10, the network card (X553 via ixgbe) works without problems (so this is a relatively "recent" bug).

Other users have encountered this problem (see comments):
https://www.servethehome.com/the-everything-fanless-home-server-firewall-router-and-nas-appliance-qotom-qnap-teamgroup/
https://www.servethehome.com/intel-x553-networking-and-proxmox-ve-8-1-3/?unapproved=518173&moderation-hash=e57a05288058d3ff253ceb42e9ada905#comment-518173

For my part, here's my test environment:
- 1 Qotom Q20332G9-S10 (I used a 16GB Intel Optane M10 M.2 SSD with a fresh GNU/Linux Debian 12)
- 1 Cisco DAC cable (tested with a 1m and a 3M)
- 1 PC with Mellanox Connectx-3 2x SFP+ network card (running GNU/Linux Debian SID installed several years ago)
- 1 Cisco 3560CX-12PD-S switch (2 SFP+ ports) with IOS 15.2(7)E2

Connecting the Qotom Q20332G9-S10 (X553) to the Mellanox Connectx-3 works without a hitch and without any special handling (the linux-image-6.1.0-17-amd64 ixgbe driver works in this configuration). Full 10gbps speeds between the two with an "iperf".

At this stage, I've ruled out a hardware incompatibility (OSI level 1) since the DAC works with the X553. So there's no need to use compatibility tricks as suggested in the link comments with the "allow_unsupported_sfp=1" parameter. This will be useless in the following tests (I've checked).

Where it gets tricky is when you connect it (the Qotom) to the Cisco switch.
Before an "ip link eno1 up", the Cisco raises the link on its side, but the Debian doesn't (link DOWN). After the "ip link eno1 up", the link drops and never comes back. There does seem to be a driver problem in recent kernels (GNU/Linux Debian Stable and Sid).

After compiling the driver manually (https://downloadmirror.intel.com/812532/ixgbe-5.19.9.tar.gz
tar xf ixgbe-5.19.9.tar.gz) following the documentation already shared by others (https://www.xmodulo.com/download-install-ixgbe-driver-ubuntu-debian.html), it works with the Cisco (after a "shut/no shut" of the latter's 10gbe port).

So we end up with a working machine (I even configured and used the SR-IOV successfully right afterwards).

For the moment, the Qotom machine is dedicated to testing, so I'm available to carry out any manipulations you may wish to make to advance the subject. Don't hesitate!

Best regards.
Comment 2 Bjorn Helgaas 2024-02-14 20:45:36 UTC
(In reply to Yohan Charbi from comment #1)

Thanks for your report!  Unfortunately I don't know anything about the specifics of the NIC and there's not any information that would show a possible PCI issue, so I don't think I can help with this.
Comment 3 Yohan Charbi 2024-02-14 21:07:39 UTC
Created attachment 305876 [details]
List of PCIe devices on Qotom before and after load ixgbe module

I have taken the liberty of posting the returns of the order you requested from your contact person in your message https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/post-634424.

There should probably be some similarities with his problem.
Comment 4 Bjorn Helgaas 2024-02-14 22:13:31 UTC
(In reply to Yohan Charbi from comment #3)
> I have taken the liberty of posting the returns of the order you requested
> from your contact person in your message...

Thanks.  Would you mind also attaching the complete dmesg log after loading the ixgbe driver?  If it's a similar problem, you should see the "Adapter removed" message.
Comment 5 Yohan Charbi 2024-02-14 22:33:23 UTC
Created attachment 305877 [details]
Return of dmesg after a modprobe ixgbe

Here's a dmesg after a modprobe ixgbe (a rmmod ixgbe was done just before). I have truncated the return to keep only what is returned from the command.
I see no trace of an "Adapter removed" and a dmesg | grep -i "Adapter removed" returns nothing.
Comment 6 Bjorn Helgaas 2024-02-14 23:23:34 UTC
(In reply to Yohan Charbi from comment #5)
> Here's a dmesg after a modprobe ixgbe (a rmmod ixgbe was done just before).

OK.  I think this is a different problem, so if you want to pursue this, I suggest opening a separate bugzilla or just emailing the ixgbe maintainers:

  Jesse Brandeburg <jesse.brandeburg@intel.com> (supporter:INTEL ETHERNET DRIVERS)
  Tony Nguyen <anthony.l.nguyen@intel.com> (supporter:INTEL ETHERNET DRIVERS)
  intel-wired-lan@lists.osuosl.org (moderated list:INTEL ETHERNET DRIVERS)
  netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
Comment 7 Yohan Charbi 2024-02-14 23:46:28 UTC
Thank you very much for your time. I'm going to write to these people to see how best to follow up on this.
Good luck with the rest.
Comment 8 Jesse Brandeburg 2024-02-16 01:17:29 UTC
Hi Bjorn, I looked over the originally reported log, and I noticed that the BIOS still (or always) seems to be operating in 32 bit BAR mode, with lots of reported issues from the kernel where it's unable to reserve resources.

The reason the ixgbe driver fails to load is that the device BAR mapping either didn't work or is being ignored after the AER error, so all reads return 0xFFFFFFFF, which is also the behavior if ASPM is enabled and link doesn't come back.

I checked the latest logs Tim added, ASPM is not enabled. It appears that even with the before and after, the ixgbe device is not enabled and hasn't changed state that I can see.

Have a look at the 00:03.1 upstream bridge port, which is the parent port for 05:00.0/1

It's showing AER error for 0501 which I assume is 5:00.1

The device is definitely configured for 32-bit BARs, not 64-bit.

What if we just turn off AER? boot with noaer ?
Comment 9 Jesse Brandeburg 2024-02-16 01:26:20 UTC
Created attachment 305879 [details]
before lspci
Comment 10 Jesse Brandeburg 2024-02-16 01:26:56 UTC
Created attachment 305880 [details]
after lspci
Comment 11 Jesse Brandeburg 2024-02-16 01:27:29 UTC
Created attachment 305881 [details]
dmesg after
Comment 13 Bjorn Helgaas 2024-02-20 23:57:33 UTC
From attachment 305887 [details] (comment #12), this looks wrong:

  Command line: ... pci=noaer
  acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
  acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR DPC]

Linux apparently requested and was granted DPC control *without* requesting AER control, but if the OS requests DPC control, it is required to also request AER control (PCI Firmware r3.3, sec 4.5.1).

This looks like a Linux bug here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/acpi/pci_root.c?id=v6.7#n530.  I don't know if this has any connection to this problem, but I posted a patch for it: https://lore.kernel.org/linux-pci/20240220235520.1514548-1-helgaas@kernel.org/T/#u

Note You need to log in before you can comment on or make changes to this bug.