Bug 205015

Summary: e1000e: Hardware Error - kills subsequent suspends
Product: Drivers Reporter: Todd Brandt (todd.e.brandt)
Component: NetworkAssignee: Chen Yu (yu.c.chen)
Status: ASSIGNED ---    
Severity: normal CC: jeffrey.t.kirsher, lenb, oleksandr, rui.zhang, sasha.neftin, venkateswara.rao, vitaly.lifshits, yu.c.chen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.3.0, 5.4.0 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: issue.def
whl-u_mem_#69_ERR.html
whl-u_mem_#70_POSTERR.html
Reproduced on Kaby Lake Intel NUC
Reproduced on Gigabyte desktop board
Reproduced on Lenovo T61 laptop
Lenovo T16 details on 9,000 ms delay
e1000e patch 1 [e1000e-save-the-return-value-of-e1000e_reset]
e1000e patch 2 [PM-sleep-export-device_pm_remove-for-driver-use]
e1000e patch 3 [e1000e-Introduce-workqueue-to-disable-the-power-mana]
e1000e patch 4 [e1000e-Disable-the-power-management-if-hardware-erro]
A hacky patch to ignore the return error when setting WOL feature during suspend.

Description Todd Brandt 2019-09-27 01:16:22 UTC
Created attachment 285203 [details]
issue.def

e1000e hardware error during resume causes all subsequent suspends to fail. Appears to only happen in S3 mem suspend.
Comment 1 Todd Brandt 2019-10-04 23:53:43 UTC
Created attachment 285345 [details]
whl-u_mem_#69_ERR.html
Comment 2 Todd Brandt 2019-10-04 23:54:17 UTC
Created attachment 285347 [details]
whl-u_mem_#70_POSTERR.html

All subsequent runs look like this til reboot
Comment 3 Len Brown 2019-11-26 02:05:25 UTC
Created attachment 286049 [details]
Reproduced on Kaby Lake Intel NUC

This is common on the Intel NUC (Kaby Lake).
In this example, the e1000e died on the 9th suspend, and refused to suspend on all subsequent (> 3,500) suspends.
Comment 4 Len Brown 2019-11-26 02:12:12 UTC
Created attachment 286051 [details]
Reproduced on Gigabyte desktop board

e1000e failed on test 1477 on this motherboard,
and all of the remaining suspend tests (> 2,100) failed
because e1000e failed to suspend.

[11395.312240] ata6: SATA link down (SStatus 4 SControl 300)
[11395.328675] ata1.00: configured for UDMA/133
[11396.296312] e1000e 0000:00:1f.6 enp0s31f6: PHY Wakeup cause - Unicast Packet
[11397.165686] e1000e 0000:00:1f.6 enp0s31f6: Hardware Error
[11397.661768] PM: resume of devices complete after 2723.421 msecs
[11397.662630] acpi LNXPOWER:16: Turning OFF
[11397.662686] acpi LNXPOWER:15: Turning OFF
Comment 5 Len Brown 2019-11-26 02:28:27 UTC
Created attachment 286053 [details]
Reproduced on Lenovo T61 laptop

On this laptop, the e1000e has no cable connected, as WIFI was in use.

The e1000e hardware error occurred on the 1st resume.

[  388.380449] ata3.00: configured for UDMA/133
[  388.424246] usb 3-2: reset full-speed USB device number 2 using uhci_hcd
[  388.472636] e1000e 0000:00:19.0 enp0s25: Hardware Error
[  388.708283] firewire_core 0000:15:00.1: rediscovered device fw0

All (2,000) subsequent suspends "worked", but the e1000e took over 9,000 ms to suspend!

All subsequent resumes saw the same hardware error:

[  414.812541] ata3.00: configured for UDMA/133
[  414.848253] usb 3-2: reset full-speed USB device number 2 using uhci_hcd
[  414.908007] e1000e 0000:00:19.0 enp0s25: Hardware Error
[  415.144303] firewire_core 0000:15:00.1: rediscovered device fw0
Comment 6 Len Brown 2019-11-26 03:08:44 UTC
Created attachment 286055 [details]
Lenovo T16 details on 9,000 ms delay

On the Lenovo T61, the 9,000 ms delay comes from e1000_acquire_swflag_ich8lan().

First it does 1,000 udelay(1), followed by 8,000 udelay(1).

This is currently happening if the links is present or not, and the e1000e is getting the error on boot -- so maybe the NIC had died and it is not failing gradefully?

rmmod e1000e
results in suspend taking 400 ms, instead of 9,000 ms.
Comment 7 Len Brown 2019-11-26 18:21:50 UTC
Re: T61

The T61 e1000e doesn't *always* fail.

Rebooting the same 5.4.0 kernel as above on the same T61...
It completed the requested 2,002 suspend/resume cycles with the
cable connected the whole time, ethernet working before, during and after.

So given that it sometimes functions properly, what about the performance?

The maximum e1000e suspend time: 187 ms
The maximum e1000e resume time: 386 ms

These numbers are too high, and under some conditions, the e1000e
would be the slowest device in the system, by a large margin.  But
this performance issue is not the subject of this bug report.
Comment 8 Chen Yu 2020-06-18 04:08:51 UTC
Found similar issue during suspend to mem on a BDW-NUC platform:

[ 1663.694828] e1000e 0000:00:19.0 eth0: Failed to disable ULP
[ 1664.731040] asix 2-3:1.0 eth1: link up, 100Mbps, full-duplex, lpa 0xC1E1
[ 1665.093513] e1000e 0000:00:19.0 eth0: Hardware Error
[ 1665.596760] e1000e 0000:00:19.0: pci_pm_resume+0x0/0x80 returned 0 after 2975399 usecs
Comment 9 Zhang Rui 2020-07-01 13:37:45 UTC
The problem seems to be more serious.

I scheduled more than 10 rounds of tests for each commit, and each test run 10 suspend-to-mem iterations, and the problem is reproduced
v5.5 0/10
v5.6 0/10
v5.7 0/28
v5.8-rc1 16/28
v5.8-rc3 15/20

So previously, the problem is hard to reproduce, but it seems that it is really easy to trigger this in latest -rc kernel.
Comment 10 Zhang Rui 2020-07-02 12:09:59 UTC
This should be introduced by the following commit which is shipped in 5.8-rc1.
commit 0c80cdbf33207c7aea1aec9027ad8011bd6178ad
Author:     Kai-Heng Feng <kai.heng.feng@canonical.com>
AuthorDate: Thu May 7 14:25:45 2020 +0800
Commit:     Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CommitDate: Fri May 22 18:21:52 2020 -0700

    e1000e: Warn if disabling ULP failed
    
    The hardware may stop working if driver failed to disable ULP mode.
    
    Take the return value of e1000_disable_ulp_lpt_lp() into account, and
    pass up the error if it fails.
    
    Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
    Tested-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Actually, this regression has been detected/bisected and reported by 0Day before in upstream.
According to Kai-Heng, "the patch does catch issues previously ignored."
And now the problems is that, we used to run 2000 suspend-resume iterations, but now it fails at around 10 suspend-resume cycles, and suspend-to-mem will never work again until a reboot.
And on my NUC machine, I even don't have network cable plugged.
Comment 11 Zhang Rui 2020-10-29 02:57:46 UTC
The regression introduced by 0c80cdbf33207c7aea1aec9027ad8011bd6178ad has been fixed. But still, we have the original problem.

Now, there are three symptons we've observed.
1. during suspend we can see
[ 1663.694828] e1000e 0000:00:19.0 eth0: Failed to disable ULP
[ 1664.731040] asix 2-3:1.0 eth1: link up, 100Mbps, full-duplex, lpa 0xC1E1
[ 1665.093513] e1000e 0000:00:19.0 eth0: Hardware Error
[ 1665.596760] e1000e 0000:00:19.0: pci_pm_resume+0x0/0x80 returned 0 after 2975399 usecs
and then the system never recovers from it, and all the following suspend failed due to this
[22909.393854] PM: pci_pm_suspend(): e1000e_pm_suspend+0x0/0x760 [e1000e] returns -2
[22909.393858] PM: dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -2
[22909.393861] PM: Device 0000:00:1f.6 failed to suspend async: error -2

2. it is very similar to symptom 1, but the only difference is that, e1000e seems to recover after the first suspend failure, and all the subsequential suspend continues to work.

3. on some other platforms, we can only see 
[  239.783549] e1000e 0000:00:19.0 eth0: Failed to disable ULP
[  239.861701] e1000e 0000:00:19.0: pci_pm_resume+0x0/0x80 returned 0 after 325716 usecs
during resume, but the next suspend seems to work just fine.
And this is 100%, 60%, and 1% reproducible during the whole stress test (around 2000 suspend cycles), on three different platforms.
Comment 12 Chen Yu 2020-11-04 07:52:46 UTC
Proposed a patchset to disable power management of e1000e if the hardware error was detected during resume. So that we don't have to suspend the broken NIC in next system suspend.
Comment 13 Sasha Neftin 2020-11-04 07:52:59 UTC
Created attachment 293431 [details]
attachment-3360-0.html

Out of office. Expected delayed response
Comment 14 Chen Yu 2020-11-04 07:56:17 UTC
Created attachment 293433 [details]
e1000e patch 1   [e1000e-save-the-return-value-of-e1000e_reset]
Comment 15 Chen Yu 2020-11-04 08:02:54 UTC
Created attachment 293435 [details]
e1000e patch 2   [PM-sleep-export-device_pm_remove-for-driver-use]
Comment 16 Chen Yu 2020-11-04 08:03:48 UTC
Created attachment 293437 [details]
e1000e patch 3   [e1000e-Introduce-workqueue-to-disable-the-power-mana]
Comment 17 Chen Yu 2020-11-04 08:04:34 UTC
Created attachment 293439 [details]
e1000e patch 4  [e1000e-Disable-the-power-management-if-hardware-erro]
Comment 18 Venkat Tadikonda 2023-02-22 07:37:19 UTC
any fix for this issue in upstream kernel? can you please kernel version with fix for this?
Comment 19 Chen Yu 2023-02-22 08:22:14 UTC
Created attachment 303770 [details]
A hacky patch to ignore the return error when setting WOL feature during suspend.