Bug 213481
Summary: | e1000e hardware failure due to PCI patch in all kernels 5.10.36+ with Intel I219-V | ||
---|---|---|---|
Product: | Drivers | Reporter: | Michael (phyre) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | high | CC: | admin, avi.shalev, bjorn.helgaas, carnil, jbrandeb, mika.westerberg, phyre, rjw, rui.zhang, sasha.neftin, todd.fujinaka, vitaly.lifshits, yu.c.chen |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 5.10.36 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Michael
2021-06-17 16:18:55 UTC
This is actually far more confusing after further investigation. I have tried a variety of Linux kernels to try and spot the change that breaks functionality- about 40 of them. I'll give a few highlights of a kernel compiled with the same options and same process. 5.3.0 - works perfectly 5.9.0 - works (boots fine, detects network, and works flawlessly) 5.10.35 (and under) - working 5.10.36+ - broken (tested 5.10.36, 5.10.37, 5.10.40, 5.10.44) 5.11.0 broken 5.13.0 broken Trying on any of the broken versions results in the hardware failure and zero network connectivity on this system (with the next word from the e1000e driver being a 'hardware failure' notice on shutdown). I have no reason to suspect actual hardware failure. So something has broken between 5.10.35 and 5.10.36 release tags (and any further releases). That said, more concerning is I don't see any actual changes to the e1000e driver between those releases. There was some changes to the PCI interfaces and power states within, which makes me think the e1000e driver is in some invalid state it needs to catch. The device is # lspci -nn| grep Ether 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (6) I219-V [8086:15be] (rev 30) If I revert commit 4514d991d99211f225d83b7e640285f29f0755d0 ( https://github.com/torvalds/linux/commit/4514d991d99211f225d83b7e640285f29f0755d0 ) from the 5.10.36 kernel, I do NOT have this hardware failure message. This commit relates to PCI power state. It would appear something about this change that the e1000e driver is not handling properly on this system, causing the card to not work. As mentioned this is in current kernel versions (5.13, 5.11, and 5.10.36+). Given that this is a regression, a breaking change for the usability of these cards, and we know exactly what causes it for a hopefully easy fix, I've upped the severity. I do not saw any problem with our TGL platforms. I also do not understand how this related to the ULP. Looks your system loss PHY access somehow. (In reply to Sasha Neftin from comment #3) > I do not saw any problem with our TGL platforms. I also do not understand > how this related to the ULP. Looks your system loss PHY access somehow. Yes but it loses it consistently only when this patch is included in every kernel version, and reverting the patch fixes the problem with the NIC. There's a discussion via e-mail with a bunch of folks (Rafael J. Wysocki & Bjorn Helgaas) who I think are looking to revert the PCI commit [4514d991d992 ("PCI: PM: Do not read power state in pci_enable_device_flags()")] noted in the comment. Maybe it's something with this Lenovo system itself, but irrespective that patch appears to not be without consequence in some systems. thread link about the revert https://lore.kernel.org/linux-pci/CAJZ5v0j9GS2y0tpnzaGu8n9=kbHD9QkBUDguANcJz01u+PX08g@mail.gmail.com/T/#u |