Created attachment 306419 [details] otcpl-z170x-ud5-freeze.html Starting in 6.10.0-rc2 4 of our lab machines have stopped being able to enter S2idle or S3 suspend as a result of an e1000e device suspend error. I've bisected the issue to this commit: commit bfd546a552e140b0a4c8a21527c39d6d21addb28 (refs/bisect/bad) Author: Hui Wang <hui.wang@canonical.com> Date: Tue May 28 15:06:04 2024 -0700 e1000e: move force SMBUS near the end of enable_ulp function The commit 861e8086029e ("e1000e: move force SMBUS from enable ulp function to avoid PHY loss issue") introduces a regression on The 4 machines affected are our otcpl-rkl-s-prod (ASRock Z590-C), otcpl-cfl-u (test board), otcpl-z170x-ud5 (GigaByte Z170X0UD5 board), and otcpl-mtl-s-2 (test board). I'm attaching the sleepgraph timelines for the otcpl-rkl-s-prod and otcpl-z170x-ud5.
Created attachment 306420 [details] otcpl-rkl-s-prod_freeze.html
click the red "ERROR" text in the timelines to jump to the exact line in dmesg where the failure occurs. [ 355.003472] e1000e 0000:00:1f.6: PM: pci_pm_suspend(): e1000e_pm_suspend [e1000e] returns -2 [ 355.003504] e1000e 0000:00:1f.6: PM: dpm_run_callback(): pci_pm_suspend returns -2 [ 355.003508] e1000e 0000:00:1f.6: PM: pci_pm_suspend returned -2 after 68795 usecs [ 355.003511] e1000e 0000:00:1f.6: PM: failed to suspend async: error -2 [ 355.053814] nvme 0000:0f:00.0: PM: pci_pm_suspend returned 0 after 119174 usecs [ 355.053946] async_continuing @ 4122 [echo] after 116305 usec [ 355.053966] PM: suspend of devices aborted after 120.782 msecs
could you please upload the pciid of the affected ethernet card? thx.
The "log" button in the upper right corner of the html timelines shows all the system data including lspci. If there's a more detailed command you want me to run on each machine let me know (I can also add it to sleepgraph). Here's sudo lspci -v. otcpl-z170x-ud5: 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31) Subsystem: Gigabyte Technology Co., Ltd Ethernet Connection (2) I219-V Flags: bus master, fast devsel, latency 0, IRQ 171, IOMMU group 16 Memory at ef400000 (32-bit, non-prefetchable) [size=128K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] PCI Advanced Features Kernel driver in use: e1000e Kernel modules: e1000e otcpl-rkl-s-prod: 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (14) I219-V (rev 11) Subsystem: ASRock Incorporation Ethernet Connection (14) I219-V Flags: bus master, fast devsel, latency 0, IRQ 142, IOMMU group 8 Memory at a0600000 (32-bit, non-prefetchable) [size=128K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Kernel driver in use: e1000e Kernel modules: e1000e otcpl-cfl-u: 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (6) I219-V (rev 30) Subsystem: Intel Corporation Ethernet Connection (6) I219-V Flags: bus master, fast devsel, latency 0, IRQ 142, IOMMU group 12 Memory at a1200000 (32-bit, non-prefetchable) [size=128K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Kernel driver in use: e1000e Kernel modules: e1000e
The logs also contain the outputs of ethtool, only the otcpl-z170x-ud5 had wake on lan set to g, the rest were set to d.
I found lots of log from the attached files, but couldn't find the pciid (sth like 8086:550a). I doubt the "e1000e_disable_phy_retry(hw);" introduced the regression. Could you help test the patch (0001-testing-e1000e-drop-disable_phy_retry-before-forcing.patch), or I appiled the testing patch to a kernel and built a kernel debian package, you could get it from https://people.canonical.com/~hwang4/e1000eregression/ (i915 driver probably has some problem on some platforms with this testing kernel).
Created attachment 306434 [details] testing patch
I've just tested the patch on my laptop. lspci -nn -s 00:1f.6 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-V [8086:15bc] (rev 10) Unfortunately the behaviour is still the same. The system suspends fine as long as the LAN cable is connected. Any attempt to suspend w/o a LAN connection failes. [ 146.954374] e1000e: EEE TX LPI TIMER: 00000011 [ 147.307629] e1000e 0000:00:1f.6: PM: pci_pm_suspend(): e1000e_pm_suspend [e1000e] returns -2 [ 147.307640] e1000e 0000:00:1f.6: PM: dpm_run_callback(): pci_pm_suspend returns -2 [ 147.307648] e1000e 0000:00:1f.6: PM: failed to suspend async: error -2 [ 147.590790] PM: Some devices failed to suspend, or early wake event detected
Hi Vitaly, Looks like we have to revert my patch from mainline kernel, and most likely have to revert yours and mine together. What is your opinion? Looks the change impacts too many platforms, we couldn't cover all all situation. Thanks.
TWIMC: this and Bug 218936 look like duplicates to me
Hi Dieter, If you revert commit bfd546a552e140b0a4c8a21527c39d6d21addb28 and commit 861e8086029e003305750b4126ecd6617465f5c7, does it fix the problem?
Hi Hui Wang, after reverting 1. commit bfd546a552e140b0a4c8a21527c39d6d21addb28 and 2. commit 861e8086029e003305750b4126ecd6617465f5c7 my system now suspends fine with LAN cable connected AND disconnected. looks good! thanks!
I can revert just bfd546a552e140b0a4c8a21527c39d6d21addb28 and it works fine on the tip of master. So it's definitely this commit.
sorry about missing the PCI ID, here's the command Dieter ran but on the otcpl-z170x-ud5: lspci -nn -s 00:1f.6 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8] (rev 31)
Hui, I just tried the testing patch and it made no difference, the failed S2idle issue still happens, so what you changed is not the culprit.
@Todd, I know the commit bfd546a552e140b0a4c8a21527c39d6d21addb28 is the cause of this issue, but it is for fixing the regression of 861e8086029e003305750b4126ecd6617465f5c7, that is why we test to revert 2 commits.
@Hui, any chance we can revert those two commits for the next -rc as we are still in an relatively early -rc phase? As I said before reverting both commits fixes the suspend issue on my machine without introducing any regressions. Hopefully Todd will be able to give us some feedback soon.
Created attachment 306456 [details] attachment-6309-0.html Thank you for your message. Please note that I am currently out of the office and will not be able to respond until my return on June 16. Best regards, Amir Avivi
(In reply to Dieter Mummenschanz from comment #17) > @Hui, > > any chance we can revert those two commits for the next -rc as we are still > in an relatively early -rc phase? As I said before reverting both commits > fixes the suspend issue on my machine without introducing any regressions. > Hopefully Todd will be able to give us some feedback soon. I couldn't agree more, I was actually hoping it would be removed by rc4 but I guess that's no longer a possibility. It affects 4 of our machines ability to run S2idle and S3 at all, so it's a pretty significant block on our testing.
FYI, I just built another kernel from the tip of master with these two commits reverted and it worked fine: 1. commit bfd546a552e140b0a4c8a21527c39d6d21addb28 2. commit 861e8086029e003305750b4126ecd6617465f5c7
I agree that reverting 2 commits is a better solution, I already sent a patch to revert my patch (bfd546a552e140b0a4c8a21527c39d6d21addb28).
@Hui, I cannot find the revert anywhere in netdev/net.git or torvalds.git. Any chance it will find its way into -rc5 or is it gonna take longer?
@Dieter, The reverting patch will be dropped, Vitaly will use a new patch to fix this regression: [Intel-wired-lan] [PATCH iwl-net v2 1/1] e1000e: fix force smbus during suspend flow And you are on the CC list of that patch.
Folks, what's the strory here? I don't mean to push but we're getting awful close to the 6.10 release. @Todd: Could you please verify as Vitaly asked so that we may get this patch finally pushed? Thanks.
FWIW, my plan was to raise attention to this issue today or tomorrow.
Yes, this is very frustrating. We have dozens of Lenovo notebooks in our company, and the suspend and shutdown functions do not work unless we press the power button to forcefully turn them off. This should be fixed during RC
(In reply to Dieter Mummenschanz from comment #24) > Folks, what's the strory here? I don't mean to push but we're getting awful > close to the 6.10 release. @Todd: Could you please verify as Vitaly asked so > that we may get this patch finally pushed? > > Thanks. I tested his fix patch and it worked on 3 our of 4 of our machines, only our internal meteor lake SDV still shows an issue so we're handling it internally. So I've already confirmed with Vitaly that his patch works as a fix.
I would remark though that a complete reversion of this commit would be more desirable until the precise cause of these malfunctions can be determined.
Created attachment 306522 [details] vitaly-e1000e.patch This is the patch Vitaly created which fixes 3 of 4 of our machines.
I attached the patch so you guys can test it to see if it works for you. It fixes all the 3 production machines I mentioned. It only fails for me on our internal Meteor Lake SDV boards (we're testing this separately), so it seems to fix things for production hardware.
rc7 got out today with no fix included. So what's holding it back?
(In reply to Dieter Mummenschanz from comment #31) > rc7 got out today with no fix included. So what's holding it back? Not sure, maybe it's in internal testing at intel. But it seems it's not considered a "blocker" anyway, see the quoted part in: https://lore.kernel.org/all/0b96edcc-6b5f-447f-8023-440427a9fff2@leemhuis.info/
in 6.10.0-2-cachyos suspend works as expected again
Has worked in our machines since 6.10, thanks for the fix.