Bug 218940 - e1000e patch in 6.10.0-rc2 breaks S2idle and S3 suspend on 4 of our machines
Summary: e1000e patch in 6.10.0-rc2 breaks S2idle and S3 suspend on 4 of our machines
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P3 high
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2024-06-05 22:07 UTC by Todd Brandt
Modified: 2024-08-25 06:14 UTC (History)
8 users (show)

See Also:
Kernel Version: 6.10.0-rc2
Subsystem:
Regression: Yes
Bisected commit-id: bfd546a552e140b0a4c8a21527c39d6d21addb28


Attachments
otcpl-z170x-ud5-freeze.html (398.72 KB, text/html)
2024-06-05 22:07 UTC, Todd Brandt
Details
otcpl-rkl-s-prod_freeze.html (363.10 KB, text/html)
2024-06-05 22:07 UTC, Todd Brandt
Details
testing patch (1.71 KB, application/mbox)
2024-06-07 02:20 UTC, Hui Wang
Details
attachment-6309-0.html (2.00 KB, text/html)
2024-06-13 19:37 UTC, amir.avivi
Details
vitaly-e1000e.patch (4.32 KB, text/plain)
2024-07-02 06:34 UTC, Todd Brandt
Details

Description Todd Brandt 2024-06-05 22:07:03 UTC
Created attachment 306419 [details]
otcpl-z170x-ud5-freeze.html

Starting in 6.10.0-rc2 4 of our lab machines have stopped being able to enter S2idle or S3 suspend as a result of an e1000e device suspend error. I've bisected the issue to this commit:

commit bfd546a552e140b0a4c8a21527c39d6d21addb28 (refs/bisect/bad)
Author: Hui Wang <hui.wang@canonical.com>
Date:   Tue May 28 15:06:04 2024 -0700

    e1000e: move force SMBUS near the end of enable_ulp function
    
    The commit 861e8086029e ("e1000e: move force SMBUS from enable ulp
    function to avoid PHY loss issue") introduces a regression on


The 4 machines affected are our otcpl-rkl-s-prod (ASRock Z590-C), otcpl-cfl-u (test board), otcpl-z170x-ud5 (GigaByte Z170X0UD5 board), and otcpl-mtl-s-2 (test board). I'm attaching the sleepgraph timelines for the otcpl-rkl-s-prod and otcpl-z170x-ud5.
Comment 1 Todd Brandt 2024-06-05 22:07:41 UTC
Created attachment 306420 [details]
otcpl-rkl-s-prod_freeze.html
Comment 2 Todd Brandt 2024-06-05 22:09:53 UTC
click the red "ERROR" text in the timelines to jump to the exact line in dmesg where the failure occurs.

[  355.003472] e1000e 0000:00:1f.6: PM: pci_pm_suspend(): e1000e_pm_suspend [e1000e] returns -2
[  355.003504] e1000e 0000:00:1f.6: PM: dpm_run_callback(): pci_pm_suspend returns -2
[  355.003508] e1000e 0000:00:1f.6: PM: pci_pm_suspend returned -2 after 68795 usecs
[  355.003511] e1000e 0000:00:1f.6: PM: failed to suspend async: error -2
[  355.053814] nvme 0000:0f:00.0: PM: pci_pm_suspend returned 0 after 119174 usecs
[  355.053946] async_continuing @ 4122 [echo] after 116305 usec
[  355.053966] PM: suspend of devices aborted after 120.782 msecs
Comment 3 Hui Wang 2024-06-06 00:35:36 UTC
could you please upload the pciid of the affected ethernet card?

thx.
Comment 4 Todd Brandt 2024-06-06 15:43:09 UTC
The "log" button in the upper right corner of the html timelines shows all the system data including lspci. If there's a more detailed command you want me to run on each machine let me know (I can also add it to sleepgraph). Here's sudo lspci -v.

otcpl-z170x-ud5:
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
	Subsystem: Gigabyte Technology Co., Ltd Ethernet Connection (2) I219-V
	Flags: bus master, fast devsel, latency 0, IRQ 171, IOMMU group 16
	Memory at ef400000 (32-bit, non-prefetchable) [size=128K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] PCI Advanced Features
	Kernel driver in use: e1000e
	Kernel modules: e1000e

otcpl-rkl-s-prod:
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (14) I219-V (rev 11)
	Subsystem: ASRock Incorporation Ethernet Connection (14) I219-V
	Flags: bus master, fast devsel, latency 0, IRQ 142, IOMMU group 8
	Memory at a0600000 (32-bit, non-prefetchable) [size=128K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Kernel driver in use: e1000e
	Kernel modules: e1000e

otcpl-cfl-u:
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (6) I219-V (rev 30)
	Subsystem: Intel Corporation Ethernet Connection (6) I219-V
	Flags: bus master, fast devsel, latency 0, IRQ 142, IOMMU group 12
	Memory at a1200000 (32-bit, non-prefetchable) [size=128K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Kernel driver in use: e1000e
	Kernel modules: e1000e
Comment 5 Todd Brandt 2024-06-06 15:48:58 UTC
The logs also contain the outputs of ethtool, only the otcpl-z170x-ud5 had wake on lan set to g, the rest were set to d.
Comment 6 Hui Wang 2024-06-07 02:17:46 UTC
I found lots of log from the attached files, but couldn't find the pciid (sth like 8086:550a).

I doubt the "e1000e_disable_phy_retry(hw);" introduced the regression. Could you help test the patch (0001-testing-e1000e-drop-disable_phy_retry-before-forcing.patch), or I appiled the testing patch to a kernel and built a kernel debian package, you could get it from https://people.canonical.com/~hwang4/e1000eregression/ (i915 driver probably has some problem on some platforms with this testing kernel).
Comment 7 Hui Wang 2024-06-07 02:20:23 UTC
Created attachment 306434 [details]
testing patch
Comment 8 Dieter Mummenschanz 2024-06-07 05:55:10 UTC
I've just tested the patch on my laptop.

lspci -nn -s 00:1f.6
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-V [8086:15bc] (rev 10)

Unfortunately the behaviour is still the same. The system suspends fine as long as the LAN cable is connected. Any attempt to suspend w/o a LAN connection failes.

[  146.954374] e1000e: EEE TX LPI TIMER: 00000011
[  147.307629] e1000e 0000:00:1f.6: PM: pci_pm_suspend(): e1000e_pm_suspend [e1000e] returns -2
[  147.307640] e1000e 0000:00:1f.6: PM: dpm_run_callback(): pci_pm_suspend returns -2
[  147.307648] e1000e 0000:00:1f.6: PM: failed to suspend async: error -2
[  147.590790] PM: Some devices failed to suspend, or early wake event detected
Comment 9 Hui Wang 2024-06-07 06:13:17 UTC
Hi Vitaly,

Looks like we have to revert my patch from mainline kernel, and most likely have to revert yours and mine together. What is your opinion? Looks the change impacts too many platforms, we couldn't cover all all situation.

Thanks.
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-06-07 11:47:35 UTC
TWIMC: this and Bug 218936 look like duplicates to me
Comment 11 Hui Wang 2024-06-07 12:38:03 UTC
Hi Dieter,

If you revert commit bfd546a552e140b0a4c8a21527c39d6d21addb28 and commit 861e8086029e003305750b4126ecd6617465f5c7, does it fix the problem?
Comment 12 Dieter Mummenschanz 2024-06-07 13:18:10 UTC
Hi Hui Wang,

after reverting

1. commit bfd546a552e140b0a4c8a21527c39d6d21addb28
and 
2. commit 861e8086029e003305750b4126ecd6617465f5c7

my system now suspends fine with LAN cable connected AND disconnected. looks good! thanks!
Comment 13 Todd Brandt 2024-06-07 17:21:11 UTC
I can revert just bfd546a552e140b0a4c8a21527c39d6d21addb28 and it works fine on the tip of master. So it's definitely this commit.
Comment 14 Todd Brandt 2024-06-07 17:39:34 UTC
sorry about missing the PCI ID, here's the command Dieter ran but on the otcpl-z170x-ud5:

lspci -nn -s 00:1f.6
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8] (rev 31)
Comment 15 Todd Brandt 2024-06-07 18:15:10 UTC
Hui, I just tried the testing patch and it made no difference, the failed S2idle issue still happens, so what you changed is not the culprit.
Comment 16 Hui Wang 2024-06-08 00:49:29 UTC
@Todd,

I know the commit bfd546a552e140b0a4c8a21527c39d6d21addb28 is the cause of this issue, but it is for fixing the regression of 861e8086029e003305750b4126ecd6617465f5c7, that is why we test to revert 2 commits.
Comment 17 Dieter Mummenschanz 2024-06-13 19:37:22 UTC
@Hui,

any chance we can revert those two commits for the next -rc as we are still in an relatively early -rc phase? As I said before reverting both commits fixes the suspend issue on my machine without introducing any regressions. Hopefully Todd will be able to give us some feedback soon.
Comment 18 amir.avivi 2024-06-13 19:37:48 UTC
Created attachment 306456 [details]
attachment-6309-0.html

Thank you for your message. Please note that I am currently out of the office and will not be able to respond until my return on June 16.
Best regards, Amir Avivi
Comment 19 Todd Brandt 2024-06-13 19:41:28 UTC
(In reply to Dieter Mummenschanz from comment #17)
> @Hui,
> 
> any chance we can revert those two commits for the next -rc as we are still
> in an relatively early -rc phase? As I said before reverting both commits
> fixes the suspend issue on my machine without introducing any regressions.
> Hopefully Todd will be able to give us some feedback soon.

I couldn't agree more, I was actually hoping it would be removed by rc4 but I guess that's no longer a possibility. It affects 4 of our machines ability to run S2idle and S3 at all, so it's a pretty significant block on our testing.
Comment 20 Todd Brandt 2024-06-14 00:19:54 UTC
FYI, I just built another kernel from the tip of master with these two commits reverted and it worked fine:

1. commit bfd546a552e140b0a4c8a21527c39d6d21addb28
2. commit 861e8086029e003305750b4126ecd6617465f5c7
Comment 21 Hui Wang 2024-06-14 01:57:38 UTC
I agree that reverting 2 commits is a better solution, I already sent a patch to revert my patch (bfd546a552e140b0a4c8a21527c39d6d21addb28).
Comment 22 Dieter Mummenschanz 2024-06-20 19:57:19 UTC
@Hui, I cannot find the revert anywhere in netdev/net.git or torvalds.git. Any chance it will find its way into -rc5 or is it gonna take longer?
Comment 23 Hui Wang 2024-06-21 02:46:20 UTC
@Dieter,

The reverting patch will be dropped, Vitaly will use a new patch to fix this regression:

[Intel-wired-lan] [PATCH iwl-net v2 1/1] e1000e: fix force smbus during suspend flow

And you are on the CC list of that patch.
Comment 24 Dieter Mummenschanz 2024-07-01 05:19:18 UTC
Folks, what's the strory here? I don't mean to push but we're getting awful close to the 6.10 release. @Todd: Could you please verify as Vitaly asked so that we may get this patch finally pushed?

Thanks.
Comment 25 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-07-01 05:27:54 UTC
FWIW, my plan was to raise attention to this issue today or tomorrow.
Comment 26 roman 2024-07-01 11:29:21 UTC
Yes, this is very frustrating. We have dozens of Lenovo notebooks in our company, and the suspend and shutdown functions do not work unless we press the power button to forcefully turn them off.

This should be fixed during RC
Comment 27 Todd Brandt 2024-07-02 02:29:03 UTC
(In reply to Dieter Mummenschanz from comment #24)
> Folks, what's the strory here? I don't mean to push but we're getting awful
> close to the 6.10 release. @Todd: Could you please verify as Vitaly asked so
> that we may get this patch finally pushed?
> 
> Thanks.

I tested his fix patch and it worked on 3 our of 4 of our machines, only our internal meteor lake SDV still shows an issue so we're handling it internally. So I've already confirmed with Vitaly that his patch works as a fix.
Comment 28 Todd Brandt 2024-07-02 03:33:30 UTC
I would remark though that a complete reversion of this commit would be more desirable until the precise cause of these malfunctions can be determined.
Comment 29 Todd Brandt 2024-07-02 06:34:28 UTC
Created attachment 306522 [details]
vitaly-e1000e.patch

This is the patch Vitaly created which fixes 3 of 4 of our machines.
Comment 30 Todd Brandt 2024-07-02 06:37:12 UTC
I attached the patch so you guys can test it to see if it works for you. It fixes all the 3 production machines I mentioned. It only fails for me on our internal Meteor Lake SDV boards (we're testing this separately), so it seems to fix things for production hardware.
Comment 31 Dieter Mummenschanz 2024-07-08 07:07:01 UTC
rc7 got out today with no fix included. So what's holding it back?
Comment 32 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-07-08 08:44:51 UTC
(In reply to Dieter Mummenschanz from comment #31)
> rc7 got out today with no fix included. So what's holding it back?

Not sure, maybe it's in internal testing at intel. But it seems it's not considered a "blocker" anyway, see the quoted part in:
https://lore.kernel.org/all/0b96edcc-6b5f-447f-8023-440427a9fff2@leemhuis.info/
Comment 33 roman 2024-07-16 07:31:44 UTC
in 6.10.0-2-cachyos suspend works as expected again
Comment 34 Todd Brandt 2024-07-30 21:55:58 UTC
Has worked in our machines since 6.10, thanks for the fix.

Note You need to log in before you can comment on or make changes to this bug.