Bug 214821 - ADL e1000e "Detected Hardware Unit Hang" after system resume
Summary: ADL e1000e "Detected Hardware Unit Hang" after system resume
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-26 06:32 UTC by Kai-Heng Feng
Modified: 2022-04-26 07:55 UTC (History)
12 users (show)

See Also:
Kernel Version: mainline, linux-next
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (82.74 KB, text/plain)
2021-10-26 06:33 UTC, Kai-Heng Feng
Details
lspci (28.07 KB, text/plain)
2021-10-26 06:34 UTC, Kai-Heng Feng
Details
attachment-13473-0.html (1.78 KB, text/html)
2021-10-26 08:40 UTC, amir.avivi
Details
enforce the CSME settings (1.26 KB, application/mbox)
2021-11-02 06:55 UTC, Sasha Neftin
Details
debug prints for CSME none provisioned (2.30 KB, application/mbox)
2021-11-02 17:09 UTC, Sasha Neftin
Details
dmesg with debug prints (81.70 KB, text/plain)
2021-11-03 06:53 UTC, Kai-Heng Feng
Details
debug prints for CSME none provisioned 2 (2.75 KB, application/mbox)
2021-11-03 10:37 UTC, Sasha Neftin
Details
dmesg with debug prints v2 (84.75 KB, text/plain)
2021-11-03 11:02 UTC, Kai-Heng Feng
Details
attachment-5844-0.html (2.64 KB, text/html)
2022-01-03 02:44 UTC, Peter Zhang
Details
v1-0001-e1000e-Fix-possible-HW-unit-hang-after-an-s0ix-ex.patch (4.20 KB, application/mbox)
2022-01-03 07:39 UTC, Sasha Neftin
Details
Enable the GPT clock (1.30 KB, application/mbox)
2022-04-26 07:50 UTC, Sasha Neftin
Details

Description Kai-Heng Feng 2021-10-26 06:32:57 UTC
On ADL systems, ethernet may stop working after system resume, and "Detected Hardware Unit Hang" message can be found in kernel log.

It's because DPG_EXIT_DONE is always flagged so the polling logic exit immediately.
Comment 1 Kai-Heng Feng 2021-10-26 06:33:38 UTC
Created attachment 299317 [details]
dmesg
Comment 2 Kai-Heng Feng 2021-10-26 06:34:02 UTC
Created attachment 299319 [details]
lspci
Comment 3 amir.avivi 2021-10-26 08:40:04 UTC
Created attachment 299321 [details]
attachment-13473-0.html

Thanks for your email, I will be OOO until November 6th.
You can contact me by Teams, but response might take some time.
For urgent issue, please talk to my manager shmuel.ben-nisan@intel.com.

Thanks,
Amir
Comment 4 Sasha Neftin 2021-10-26 12:44:18 UTC
Please, provide us with dmidecode -t 0 and cat /sys/class/mei/mei0/fw_ver output
Comment 5 Sasha Neftin 2021-10-26 12:51:30 UTC
please, provide output:
dmidecode -t 0
cat /sys/class/mei/mei0/fw_ver
Comment 6 Sasha Neftin 2021-11-02 06:55:52 UTC
Created attachment 299399 [details]
enforce the CSME settings
Comment 7 Sasha Neftin 2021-11-02 06:57:42 UTC
1. use privilege flags to disable s0ix flow for problematic system (power will increased)
ethtool --set-priv-flags enp0s31f6 s0ix-enabled off
ethtool --show-priv-flags enp0s31f6
Private flags for enp0s31f6:
s0ix-enabled: off
2. delay as you suggested - less preferable I though
3. I would like to suggest enforcing the CSME settings in case that DPG_EXIT_DONE not cleareded
Comment 8 Sasha Neftin 2021-11-02 07:01:22 UTC
Kai-Heng,
could you check to please: when DGP_EXIT_DONE has not cleared on your system, what is values in:

Extended Device Control Register 0x00018 bit 11
SMBus Control Register PHY Address 01, Page 769, Register 23

it will tell us if the CSME tried transit to SMB mode or not.
Comment 9 Kai-Heng Feng 2021-11-02 12:05:09 UTC
(In reply to Sasha Neftin from comment #6)
> Created attachment 299399 [details]
> enforce the CSME settings

This patch doesn't fix the issue on affected ADL and TGL systems.
Comment 10 Kai-Heng Feng 2021-11-02 12:05:59 UTC
(In reply to Sasha Neftin from comment #8)
> Kai-Heng,
> could you check to please: when DGP_EXIT_DONE has not cleared on your
> system, what is values in:
> 
> Extended Device Control Register 0x00018 bit 11
> SMBus Control Register PHY Address 01, Page 769, Register 23
> 
> it will tell us if the CSME tried transit to SMB mode or not.

Any quick way like lspci to read these values?
Comment 11 Sasha Neftin 2021-11-02 17:09:26 UTC
Created attachment 299407 [details]
debug prints for CSME none provisioned
Comment 12 Sasha Neftin 2021-11-02 17:12:58 UTC
(In reply to Kai-Heng Feng from comment #10)
> (In reply to Sasha Neftin from comment #8)
> > Kai-Heng,
> > could you check to please: when DGP_EXIT_DONE has not cleared on your
> > system, what is values in:
> > 
> > Extended Device Control Register 0x00018 bit 11
> > SMBus Control Register PHY Address 01, Page 769, Register 23
> > 
> > it will tell us if the CSME tried transit to SMB mode or not.
> 
> Any quick way like lspci to read these values?
I've added an example. Let's do it before sending H2ME request to unconfigure s0ix.
Comment 13 Sasha Neftin 2021-11-03 05:53:15 UTC
Let's do as follow:
1. Take NPK logs in fail case
2. Take NPK logs with "delay" after sending unconfigure request to CSME

We need to understand what is CSME state we hit. (I believe CSME does not complete some flow, but we need to understand what)
I will ask our PAE engineer to contact you and assist.
Comment 14 Kai-Heng Feng 2021-11-03 06:53:01 UTC
Created attachment 299415 [details]
dmesg with debug prints
Comment 15 Sasha Neftin 2021-11-03 10:36:33 UTC
[  324.696750] e1000e: CV_SMB_CTRL register: 0x00008000
[  324.696750] e1000e: CTRL_EXT register: 0x815a102e
=> CSME not performed transition to SMB mode.

Let's check what is MAC-PHY interface mode is a bit later. I've added examples with the same prints at end of the 'e1000e_s0ix_exit' method.

anyway, we will need NPK logs (for CSME) from this system for further analysis. (I will ask our customer support engineer to assist)
Comment 16 Sasha Neftin 2021-11-03 10:37:20 UTC
Created attachment 299433 [details]
debug prints for CSME none provisioned 2
Comment 17 Kai-Heng Feng 2021-11-03 11:02:10 UTC
Created attachment 299435 [details]
dmesg with debug prints v2
Comment 18 Loïc Yhuel 2022-01-03 02:43:43 UTC
I have a similar issue on a Dell Latitude 5521 (TGL-H, I219-LM).
BIOS 1.7.1 (latest)
ME version : 15.0.30.1902

After resuming, the link goes down/up several times (with "PHY Wakeup cause - Link Status Change" message too), and I sometimes get "Detected Hardware Unit Hang", but not always.
Even if it stops with the link being up, it doesn't work.
The interface only works after unplugging the Ethernet cable, waiting a few seconds, then plugging it back.

With "ethtool --set-priv-flags enp0s31f6 s0ix-enabled off", I only have one link down/up on resume, and it seems to work.
But sometimes it's at the wrong speed : "NIC Link is Up 10 Mbps Half Duplex, Flow Control: None" instead of "NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx".

With "ethtool -s enp0s31f6 wol d" (ie the link being down on suspend), I get less down/up messages on resume, but most of the time I still need to unplug the cable.
Comment 19 Peter Zhang 2022-01-03 02:44:03 UTC
Created attachment 300207 [details]
attachment-5844-0.html

Out of office from Dec 31 to Jan 3rd for New Year holiday.
Please call me if it's urgent  : (+86) 181-1611-8005.

Thanks,
Peter
Comment 20 Sasha Neftin 2022-01-03 07:39:49 UTC
Created attachment 300209 [details]
v1-0001-e1000e-Fix-possible-HW-unit-hang-after-an-s0ix-ex.patch

v1-0001-e1000e-Fix-possible-HW-unit-hang-after-an-s0ix-ex.patch
Engineering version
Comment 21 Sasha Neftin 2022-01-03 07:41:18 UTC
(In reply to Loïc Yhuel from comment #18)
> I have a similar issue on a Dell Latitude 5521 (TGL-H, I219-LM).
> BIOS 1.7.1 (latest)
> ME version : 15.0.30.1902
> 
> After resuming, the link goes down/up several times (with "PHY Wakeup cause
> - Link Status Change" message too), and I sometimes get "Detected Hardware
> Unit Hang", but not always.
> Even if it stops with the link being up, it doesn't work.
> The interface only works after unplugging the Ethernet cable, waiting a few
> seconds, then plugging it back.
> 
> With "ethtool --set-priv-flags enp0s31f6 s0ix-enabled off", I only have one
> link down/up on resume, and it seems to work.
> But sometimes it's at the wrong speed : "NIC Link is Up 10 Mbps Half Duplex,
> Flow Control: None" instead of "NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx".
> 
> With "ethtool -s enp0s31f6 wol d" (ie the link being down on suspend), I get
> less down/up messages on resume, but most of the time I still need to unplug
> the cable.
unfortunately, TGL with AMT/CSME not supported. Please, contact your PC vendor and get IFWI/BIOS consumer (no AMT/CSME)
Comment 22 Sasha Neftin 2022-01-03 07:44:31 UTC
(In reply to Kai-Heng Feng from comment #17)
> Created attachment 299435 [details]
> dmesg with debug prints v2

I've attached the candidate to fix the problem on ADL. (yet under testing in our labls)
This patch should be applied on top of
e1000e: Separate ADP board type from TGP
e1000e: Handshake with CSME starts from ADL platforms

(from https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git/log/drivers/net/ethernet/intel/e1000e?h=dev-queue)
Comment 23 Loïc Yhuel 2022-01-03 16:57:11 UTC
(In reply to Sasha Neftin from comment #21)
> unfortunately, TGL with AMT/CSME not supported. Please, contact your PC
> vendor and get IFWI/BIOS consumer (no AMT/CSME)

What do you mean by not supported ?
Is it only on Linux, or does Dell enable AMT on unsupported platforms ?
I can disable AMT in the BIOS setup, but it doesn't change anything.

What would happen if we modify e1000e_s0ix_entry_flow / e1000e_s0ix_exit_flow to follow the driver path on TGL, instead of the CSME path (like reverting "e1000e: Add handshake with the CSME to support S0ix") ?
Even if this breaks AMT, IMHO it's better than breaking Linux.
Comment 24 Sasha Neftin 2022-01-04 06:13:53 UTC
(In reply to Loïc Yhuel from comment #23)
> (In reply to Sasha Neftin from comment #21)
> > unfortunately, TGL with AMT/CSME not supported. Please, contact your PC
> > vendor and get IFWI/BIOS consumer (no AMT/CSME)
> 
> What do you mean by not supported ?
> Is it only on Linux, or does Dell enable AMT on unsupported platforms ?
> I can disable AMT in the BIOS setup, but it doesn't change anything.
> 
> What would happen if we modify e1000e_s0ix_entry_flow /
> e1000e_s0ix_exit_flow to follow the driver path on TGL, instead of the CSME
> path (like reverting "e1000e: Add handshake with the CSME to support S0ix") ?
we have submitted two patches specially for this (legacy path for TGL):
e1000e: Separate ADP board type from TGP
e1000e: Handshake with CSME starts from ADL platforms
> Even if this breaks AMT, IMHO it's better than breaking Linux.
Comment 25 Loïc Yhuel 2022-01-15 22:51:51 UTC
(In reply to Sasha Neftin from comment #24)
> we have submitted two patches specially for this (legacy path for TGL):
> e1000e: Separate ADP board type from TGP
> e1000e: Handshake with CSME starts from ADL platforms

With these patches on top of 5.15.13, ethernet works on resume.

But S0ix seems to have issues :
 - after reloading the module, with the cable unplugged : GBE_PG_STS 1 (there are still Int_Timer_SS_Wake*_Pol_STS preventing S0ix in substate_requirements, and others in traces, but that's probably not related to the driver)
 - with the cable plugged : GBE_PG_STS is 0
 - with the cable unplugged, after having plugged it at least once since the driver reload : no PC10
Comment 26 Sasha Neftin 2022-04-26 07:50:50 UTC
Created attachment 300804 [details]
Enable the GPT clock
Comment 27 Sasha Neftin 2022-04-26 07:55:49 UTC
Attached 'v1-0001-e1000e-Enable-the-GPT-clock-before-sending-messag.patch'
I believe this one works even without 'e1000e: Fix possible HW unit hang after an s0ix exit' (1866aa0d0d6492bc2f8d22d0df49abaccf50cddd)
I continue my investigation and consider revert/re-work (move to s0ix flow only) 1866aa0d0d64 patch.
Target - ADL platforms. No TGL.
Comment 28 Sasha Neftin 2022-04-26 07:55:57 UTC
Attached 'v1-0001-e1000e-Enable-the-GPT-clock-before-sending-messag.patch'
I believe this one works even without 'e1000e: Fix possible HW unit hang after an s0ix exit' (1866aa0d0d6492bc2f8d22d0df49abaccf50cddd)
I continue my investigation and consider revert/re-work (move to s0ix flow only) 1866aa0d0d64 patch.
Target - ADL platforms. No TGL.

Note You need to log in before you can comment on or make changes to this bug.