Bug 214821
Summary: | ADL e1000e "Detected Hardware Unit Hang" after system resume | ||
---|---|---|---|
Product: | Drivers | Reporter: | Kai-Heng Feng (kai.heng.feng) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | normal | CC: | alexander.usyskin, amir.avivi, avi.shalev, claudio.glickman, devora.fuxbrumer, dima.ruinskiy, loic.yhuel, michael.edri, nechamax.kraus, nir.efrati, sasha.neftin, zhangfp1 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | mainline, linux-next | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
lspci attachment-13473-0.html enforce the CSME settings debug prints for CSME none provisioned dmesg with debug prints debug prints for CSME none provisioned 2 dmesg with debug prints v2 attachment-5844-0.html v1-0001-e1000e-Fix-possible-HW-unit-hang-after-an-s0ix-ex.patch Enable the GPT clock |
Description
Kai-Heng Feng
2021-10-26 06:32:57 UTC
Created attachment 299317 [details]
dmesg
Created attachment 299319 [details]
lspci
Created attachment 299321 [details] attachment-13473-0.html Thanks for your email, I will be OOO until November 6th. You can contact me by Teams, but response might take some time. For urgent issue, please talk to my manager shmuel.ben-nisan@intel.com. Thanks, Amir Please, provide us with dmidecode -t 0 and cat /sys/class/mei/mei0/fw_ver output please, provide output: dmidecode -t 0 cat /sys/class/mei/mei0/fw_ver Created attachment 299399 [details]
enforce the CSME settings
1. use privilege flags to disable s0ix flow for problematic system (power will increased) ethtool --set-priv-flags enp0s31f6 s0ix-enabled off ethtool --show-priv-flags enp0s31f6 Private flags for enp0s31f6: s0ix-enabled: off 2. delay as you suggested - less preferable I though 3. I would like to suggest enforcing the CSME settings in case that DPG_EXIT_DONE not cleareded Kai-Heng, could you check to please: when DGP_EXIT_DONE has not cleared on your system, what is values in: Extended Device Control Register 0x00018 bit 11 SMBus Control Register PHY Address 01, Page 769, Register 23 it will tell us if the CSME tried transit to SMB mode or not. (In reply to Sasha Neftin from comment #6) > Created attachment 299399 [details] > enforce the CSME settings This patch doesn't fix the issue on affected ADL and TGL systems. (In reply to Sasha Neftin from comment #8) > Kai-Heng, > could you check to please: when DGP_EXIT_DONE has not cleared on your > system, what is values in: > > Extended Device Control Register 0x00018 bit 11 > SMBus Control Register PHY Address 01, Page 769, Register 23 > > it will tell us if the CSME tried transit to SMB mode or not. Any quick way like lspci to read these values? Created attachment 299407 [details]
debug prints for CSME none provisioned
(In reply to Kai-Heng Feng from comment #10) > (In reply to Sasha Neftin from comment #8) > > Kai-Heng, > > could you check to please: when DGP_EXIT_DONE has not cleared on your > > system, what is values in: > > > > Extended Device Control Register 0x00018 bit 11 > > SMBus Control Register PHY Address 01, Page 769, Register 23 > > > > it will tell us if the CSME tried transit to SMB mode or not. > > Any quick way like lspci to read these values? I've added an example. Let's do it before sending H2ME request to unconfigure s0ix. Let's do as follow: 1. Take NPK logs in fail case 2. Take NPK logs with "delay" after sending unconfigure request to CSME We need to understand what is CSME state we hit. (I believe CSME does not complete some flow, but we need to understand what) I will ask our PAE engineer to contact you and assist. Created attachment 299415 [details]
dmesg with debug prints
[ 324.696750] e1000e: CV_SMB_CTRL register: 0x00008000 [ 324.696750] e1000e: CTRL_EXT register: 0x815a102e => CSME not performed transition to SMB mode. Let's check what is MAC-PHY interface mode is a bit later. I've added examples with the same prints at end of the 'e1000e_s0ix_exit' method. anyway, we will need NPK logs (for CSME) from this system for further analysis. (I will ask our customer support engineer to assist) Created attachment 299433 [details]
debug prints for CSME none provisioned 2
Created attachment 299435 [details]
dmesg with debug prints v2
I have a similar issue on a Dell Latitude 5521 (TGL-H, I219-LM). BIOS 1.7.1 (latest) ME version : 15.0.30.1902 After resuming, the link goes down/up several times (with "PHY Wakeup cause - Link Status Change" message too), and I sometimes get "Detected Hardware Unit Hang", but not always. Even if it stops with the link being up, it doesn't work. The interface only works after unplugging the Ethernet cable, waiting a few seconds, then plugging it back. With "ethtool --set-priv-flags enp0s31f6 s0ix-enabled off", I only have one link down/up on resume, and it seems to work. But sometimes it's at the wrong speed : "NIC Link is Up 10 Mbps Half Duplex, Flow Control: None" instead of "NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx". With "ethtool -s enp0s31f6 wol d" (ie the link being down on suspend), I get less down/up messages on resume, but most of the time I still need to unplug the cable. Created attachment 300207 [details]
attachment-5844-0.html
Out of office from Dec 31 to Jan 3rd for New Year holiday.
Please call me if it's urgent : (+86) 181-1611-8005.
Thanks,
Peter
Created attachment 300209 [details]
v1-0001-e1000e-Fix-possible-HW-unit-hang-after-an-s0ix-ex.patch
v1-0001-e1000e-Fix-possible-HW-unit-hang-after-an-s0ix-ex.patch
Engineering version
(In reply to Loïc Yhuel from comment #18) > I have a similar issue on a Dell Latitude 5521 (TGL-H, I219-LM). > BIOS 1.7.1 (latest) > ME version : 15.0.30.1902 > > After resuming, the link goes down/up several times (with "PHY Wakeup cause > - Link Status Change" message too), and I sometimes get "Detected Hardware > Unit Hang", but not always. > Even if it stops with the link being up, it doesn't work. > The interface only works after unplugging the Ethernet cable, waiting a few > seconds, then plugging it back. > > With "ethtool --set-priv-flags enp0s31f6 s0ix-enabled off", I only have one > link down/up on resume, and it seems to work. > But sometimes it's at the wrong speed : "NIC Link is Up 10 Mbps Half Duplex, > Flow Control: None" instead of "NIC Link is Up 1000 Mbps Full Duplex, Flow > Control: Rx/Tx". > > With "ethtool -s enp0s31f6 wol d" (ie the link being down on suspend), I get > less down/up messages on resume, but most of the time I still need to unplug > the cable. unfortunately, TGL with AMT/CSME not supported. Please, contact your PC vendor and get IFWI/BIOS consumer (no AMT/CSME) (In reply to Kai-Heng Feng from comment #17) > Created attachment 299435 [details] > dmesg with debug prints v2 I've attached the candidate to fix the problem on ADL. (yet under testing in our labls) This patch should be applied on top of e1000e: Separate ADP board type from TGP e1000e: Handshake with CSME starts from ADL platforms (from https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git/log/drivers/net/ethernet/intel/e1000e?h=dev-queue) (In reply to Sasha Neftin from comment #21) > unfortunately, TGL with AMT/CSME not supported. Please, contact your PC > vendor and get IFWI/BIOS consumer (no AMT/CSME) What do you mean by not supported ? Is it only on Linux, or does Dell enable AMT on unsupported platforms ? I can disable AMT in the BIOS setup, but it doesn't change anything. What would happen if we modify e1000e_s0ix_entry_flow / e1000e_s0ix_exit_flow to follow the driver path on TGL, instead of the CSME path (like reverting "e1000e: Add handshake with the CSME to support S0ix") ? Even if this breaks AMT, IMHO it's better than breaking Linux. (In reply to Loïc Yhuel from comment #23) > (In reply to Sasha Neftin from comment #21) > > unfortunately, TGL with AMT/CSME not supported. Please, contact your PC > > vendor and get IFWI/BIOS consumer (no AMT/CSME) > > What do you mean by not supported ? > Is it only on Linux, or does Dell enable AMT on unsupported platforms ? > I can disable AMT in the BIOS setup, but it doesn't change anything. > > What would happen if we modify e1000e_s0ix_entry_flow / > e1000e_s0ix_exit_flow to follow the driver path on TGL, instead of the CSME > path (like reverting "e1000e: Add handshake with the CSME to support S0ix") ? we have submitted two patches specially for this (legacy path for TGL): e1000e: Separate ADP board type from TGP e1000e: Handshake with CSME starts from ADL platforms > Even if this breaks AMT, IMHO it's better than breaking Linux. (In reply to Sasha Neftin from comment #24) > we have submitted two patches specially for this (legacy path for TGL): > e1000e: Separate ADP board type from TGP > e1000e: Handshake with CSME starts from ADL platforms With these patches on top of 5.15.13, ethernet works on resume. But S0ix seems to have issues : - after reloading the module, with the cable unplugged : GBE_PG_STS 1 (there are still Int_Timer_SS_Wake*_Pol_STS preventing S0ix in substate_requirements, and others in traces, but that's probably not related to the driver) - with the cable plugged : GBE_PG_STS is 0 - with the cable unplugged, after having plugged it at least once since the driver reload : no PC10 Created attachment 300804 [details]
Enable the GPT clock
Attached 'v1-0001-e1000e-Enable-the-GPT-clock-before-sending-messag.patch' I believe this one works even without 'e1000e: Fix possible HW unit hang after an s0ix exit' (1866aa0d0d6492bc2f8d22d0df49abaccf50cddd) I continue my investigation and consider revert/re-work (move to s0ix flow only) 1866aa0d0d64 patch. Target - ADL platforms. No TGL. Attached 'v1-0001-e1000e-Enable-the-GPT-clock-before-sending-messag.patch' I believe this one works even without 'e1000e: Fix possible HW unit hang after an s0ix exit' (1866aa0d0d6492bc2f8d22d0df49abaccf50cddd) I continue my investigation and consider revert/re-work (move to s0ix flow only) 1866aa0d0d64 patch. Target - ADL platforms. No TGL. |