Created attachment 285203 [details] issue.def e1000e hardware error during resume causes all subsequent suspends to fail. Appears to only happen in S3 mem suspend.
Created attachment 285345 [details] whl-u_mem_#69_ERR.html
Created attachment 285347 [details] whl-u_mem_#70_POSTERR.html All subsequent runs look like this til reboot
Created attachment 286049 [details] Reproduced on Kaby Lake Intel NUC This is common on the Intel NUC (Kaby Lake). In this example, the e1000e died on the 9th suspend, and refused to suspend on all subsequent (> 3,500) suspends.
Created attachment 286051 [details] Reproduced on Gigabyte desktop board e1000e failed on test 1477 on this motherboard, and all of the remaining suspend tests (> 2,100) failed because e1000e failed to suspend. [11395.312240] ata6: SATA link down (SStatus 4 SControl 300) [11395.328675] ata1.00: configured for UDMA/133 [11396.296312] e1000e 0000:00:1f.6 enp0s31f6: PHY Wakeup cause - Unicast Packet [11397.165686] e1000e 0000:00:1f.6 enp0s31f6: Hardware Error [11397.661768] PM: resume of devices complete after 2723.421 msecs [11397.662630] acpi LNXPOWER:16: Turning OFF [11397.662686] acpi LNXPOWER:15: Turning OFF
Created attachment 286053 [details] Reproduced on Lenovo T61 laptop On this laptop, the e1000e has no cable connected, as WIFI was in use. The e1000e hardware error occurred on the 1st resume. [ 388.380449] ata3.00: configured for UDMA/133 [ 388.424246] usb 3-2: reset full-speed USB device number 2 using uhci_hcd [ 388.472636] e1000e 0000:00:19.0 enp0s25: Hardware Error [ 388.708283] firewire_core 0000:15:00.1: rediscovered device fw0 All (2,000) subsequent suspends "worked", but the e1000e took over 9,000 ms to suspend! All subsequent resumes saw the same hardware error: [ 414.812541] ata3.00: configured for UDMA/133 [ 414.848253] usb 3-2: reset full-speed USB device number 2 using uhci_hcd [ 414.908007] e1000e 0000:00:19.0 enp0s25: Hardware Error [ 415.144303] firewire_core 0000:15:00.1: rediscovered device fw0
Created attachment 286055 [details] Lenovo T16 details on 9,000 ms delay On the Lenovo T61, the 9,000 ms delay comes from e1000_acquire_swflag_ich8lan(). First it does 1,000 udelay(1), followed by 8,000 udelay(1). This is currently happening if the links is present or not, and the e1000e is getting the error on boot -- so maybe the NIC had died and it is not failing gradefully? rmmod e1000e results in suspend taking 400 ms, instead of 9,000 ms.
Re: T61 The T61 e1000e doesn't *always* fail. Rebooting the same 5.4.0 kernel as above on the same T61... It completed the requested 2,002 suspend/resume cycles with the cable connected the whole time, ethernet working before, during and after. So given that it sometimes functions properly, what about the performance? The maximum e1000e suspend time: 187 ms The maximum e1000e resume time: 386 ms These numbers are too high, and under some conditions, the e1000e would be the slowest device in the system, by a large margin. But this performance issue is not the subject of this bug report.
Found similar issue during suspend to mem on a BDW-NUC platform: [ 1663.694828] e1000e 0000:00:19.0 eth0: Failed to disable ULP [ 1664.731040] asix 2-3:1.0 eth1: link up, 100Mbps, full-duplex, lpa 0xC1E1 [ 1665.093513] e1000e 0000:00:19.0 eth0: Hardware Error [ 1665.596760] e1000e 0000:00:19.0: pci_pm_resume+0x0/0x80 returned 0 after 2975399 usecs
The problem seems to be more serious. I scheduled more than 10 rounds of tests for each commit, and each test run 10 suspend-to-mem iterations, and the problem is reproduced v5.5 0/10 v5.6 0/10 v5.7 0/28 v5.8-rc1 16/28 v5.8-rc3 15/20 So previously, the problem is hard to reproduce, but it seems that it is really easy to trigger this in latest -rc kernel.
This should be introduced by the following commit which is shipped in 5.8-rc1. commit 0c80cdbf33207c7aea1aec9027ad8011bd6178ad Author: Kai-Heng Feng <kai.heng.feng@canonical.com> AuthorDate: Thu May 7 14:25:45 2020 +0800 Commit: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CommitDate: Fri May 22 18:21:52 2020 -0700 e1000e: Warn if disabling ULP failed The hardware may stop working if driver failed to disable ULP mode. Take the return value of e1000_disable_ulp_lpt_lp() into account, and pass up the error if it fails. Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Actually, this regression has been detected/bisected and reported by 0Day before in upstream. According to Kai-Heng, "the patch does catch issues previously ignored." And now the problems is that, we used to run 2000 suspend-resume iterations, but now it fails at around 10 suspend-resume cycles, and suspend-to-mem will never work again until a reboot. And on my NUC machine, I even don't have network cable plugged.
The regression introduced by 0c80cdbf33207c7aea1aec9027ad8011bd6178ad has been fixed. But still, we have the original problem. Now, there are three symptons we've observed. 1. during suspend we can see [ 1663.694828] e1000e 0000:00:19.0 eth0: Failed to disable ULP [ 1664.731040] asix 2-3:1.0 eth1: link up, 100Mbps, full-duplex, lpa 0xC1E1 [ 1665.093513] e1000e 0000:00:19.0 eth0: Hardware Error [ 1665.596760] e1000e 0000:00:19.0: pci_pm_resume+0x0/0x80 returned 0 after 2975399 usecs and then the system never recovers from it, and all the following suspend failed due to this [22909.393854] PM: pci_pm_suspend(): e1000e_pm_suspend+0x0/0x760 [e1000e] returns -2 [22909.393858] PM: dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -2 [22909.393861] PM: Device 0000:00:1f.6 failed to suspend async: error -2 2. it is very similar to symptom 1, but the only difference is that, e1000e seems to recover after the first suspend failure, and all the subsequential suspend continues to work. 3. on some other platforms, we can only see [ 239.783549] e1000e 0000:00:19.0 eth0: Failed to disable ULP [ 239.861701] e1000e 0000:00:19.0: pci_pm_resume+0x0/0x80 returned 0 after 325716 usecs during resume, but the next suspend seems to work just fine. And this is 100%, 60%, and 1% reproducible during the whole stress test (around 2000 suspend cycles), on three different platforms.
Proposed a patchset to disable power management of e1000e if the hardware error was detected during resume. So that we don't have to suspend the broken NIC in next system suspend.
Created attachment 293431 [details] attachment-3360-0.html Out of office. Expected delayed response
Created attachment 293433 [details] e1000e patch 1 [e1000e-save-the-return-value-of-e1000e_reset]
Created attachment 293435 [details] e1000e patch 2 [PM-sleep-export-device_pm_remove-for-driver-use]
Created attachment 293437 [details] e1000e patch 3 [e1000e-Introduce-workqueue-to-disable-the-power-mana]
Created attachment 293439 [details] e1000e patch 4 [e1000e-Disable-the-power-management-if-hardware-erro]
any fix for this issue in upstream kernel? can you please kernel version with fix for this?
Created attachment 303770 [details] A hacky patch to ignore the return error when setting WOL feature during suspend.