Bug 213667

Summary: e1000e leads to lockup while accessing its nvm
Product: Drivers Reporter: AceLan Kao (acelan)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED CODE_FIX    
Severity: normal CC: amir.avivi, avi.shalev, dima.ruinskiy, gianluca.pindinelli, marcinropa, martin.hamant, rex.tsai, sasha.neftin, vitaly.lifshits
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.10 & 5.13 Tree: Mainline
Regression: No
Attachments: dmesg.log
v1-0001-e1000e-Do-not-take-care-about-NVM-checksum
v1-0001-e1000e-Do-not-take-care-about-NVM-checksumBACK
attachment-32179-0.html

Description AceLan Kao 2021-07-07 06:08:58 UTC
Created attachment 297775 [details]
dmesg.log

On Dell Precision 7760, it takes around 2 mins to boot into desktop.
It looks like e1000e lockup the system while accessing it nvm.
blacklist e1000e and load it after booted into desktop can't duplicate this issue.

[   28.145341] u-Precision-7760 kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [systemd-udevd:235]
[   28.145342] u-Precision-7760 kernel: Modules linked in: hid_sensor_custom hid_sensor_hub intel_ishtp_loader intel_ishtp_hid nvidia_drm(POE) nvidia_modeset(POE) hid_ge
neric nvidia(POE) i915(+) nvme nvme_core rtsx_pci_sdmmc i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec crc32_pclmul rc_core intel_lpss_pci
 psmouse e1000e(+) intel_lpss rtsx_pci xhci_pci idma64 i2c_hid vmd virt_dma drm thunderbolt i2c_i801 intel_ish_ipc i2c_smbus intel_ishtp wmi xhci_pci_renesas fjes(-) hid
 video pinctrl_tigerlake
[   28.145357] u-Precision-7760 kernel: CPU: 6 PID: 235 Comm: systemd-udevd Tainted: P        W  OE     5.10.0-1035-oem #36
[   28.145357] u-Precision-7760 kernel: Hardware name: Dell Inc. Precision 7760/, BIOS 1.0.0 05/07/2021
[   28.145364] u-Precision-7760 kernel: RIP: 0010:e1000_flash_cycle_ich8lan.constprop.0+0x5c/0x90 [e1000e]
[   28.145365] u-Precision-7760 kernel: Code: 00 00 0b 76 42 c1 e0 10 89 42 04 bb 81 96 98 00 eb 0f bf c7 10 00 00 e8 b2 1a b5 f0 83 eb 01 74 10 49 8b 44 24 10 66 8b 40 
04 <41> 89 c5 a8 01 74 e1 41 83 e5 03 31 c0 5b 41 5c 41 80 fd 01 41 5d
[   28.145366] u-Precision-7760 kernel: RSP: 0018:ffff9fdf80513920 EFLAGS: 00000202
[   28.145367] u-Precision-7760 kernel: RAX: ffff9fdf809a4028 RBX: 0000000000349c5e RCX: 0000000000000006
[   28.145367] u-Precision-7760 kernel: RDX: 0000000000000c86 RSI: 0000000000000006 RDI: 0000000000000c5a
[   28.145368] u-Precision-7760 kernel: RBP: ffff9fdf80513938 R08: 0000001c2f5d11d1 R09: 0000000000000006
[   28.145368] u-Precision-7760 kernel: R10: ffff922857a21030 R11: 0000000000000000 R12: ffff922857a20f38
[   28.145369] u-Precision-7760 kernel: R13: 00000000809a4028 R14: 0000000000000000 R15: ffff922857a20f38
[   28.145369] u-Precision-7760 kernel: FS:  00007f098f44c880(0000) GS:ffff922fafd80000(0000) knlGS:0000000000000000
[   28.145370] u-Precision-7760 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.145370] u-Precision-7760 kernel: CR2: 0000563528b67010 CR3: 000000011607a006 CR4: 0000000000770ee0
[   28.145370] u-Precision-7760 kernel: PKRU: 55555554
[   28.145371] u-Precision-7760 kernel: Call Trace:
[   28.145375] u-Precision-7760 kernel:  e1000_erase_flash_bank_ich8lan+0xa0/0x1a0 [e1000e]
[   28.145380] u-Precision-7760 kernel:  e1000_update_nvm_checksum_spt+0x1f5/0x340 [e1000e]
[   28.145383] u-Precision-7760 kernel:  e1000_validate_nvm_checksum_ich8lan+0xa1/0xd0 [e1000e]
[   28.145390] u-Precision-7760 kernel:  e1000_probe+0x65f/0xc90 [e1000e]
[   28.145399] u-Precision-7760 kernel:  local_pci_probe+0x48/0x80
[   28.145400] u-Precision-7760 kernel:  pci_device_probe+0x10f/0x1c0
[   28.145402] u-Precision-7760 kernel:  really_probe+0xfb/0x420
[   28.145403] u-Precision-7760 kernel:  driver_probe_device+0xe9/0x160
[   28.145404] u-Precision-7760 kernel:  device_driver_attach+0x5d/0x70
[   28.145405] u-Precision-7760 kernel:  __driver_attach+0x8f/0x150
[   28.145406] u-Precision-7760 kernel:  ? device_driver_attach+0x70/0x70
[   28.145407] u-Precision-7760 kernel:  bus_for_each_dev+0x7e/0xc0
[   28.145408] u-Precision-7760 kernel:  driver_attach+0x1e/0x20
[   28.145408] u-Precision-7760 kernel:  bus_add_driver+0x152/0x1f0
[   28.145409] u-Precision-7760 kernel:  driver_register+0x74/0xd0
[   28.145410] u-Precision-7760 kernel:  ? 0xffffffffc04ee000
[   28.145411] u-Precision-7760 kernel:  __pci_register_driver+0x54/0x60
[   28.145415] u-Precision-7760 kernel:  e1000_init_module+0x3b/0x1000 [e1000e]
[   28.145416] u-Precision-7760 kernel:  do_one_initcall+0x48/0x1d0
[   28.145418] u-Precision-7760 kernel:  ? _cond_resched+0x19/0x30
[   28.145420] u-Precision-7760 kernel:  ? kmem_cache_alloc_trace+0x37a/0x430
[   28.145421] u-Precision-7760 kernel:  ? do_init_module+0x28/0x250
[   28.145422] u-Precision-7760 kernel:  do_init_module+0x62/0x250
[   28.145423] u-Precision-7760 kernel:  load_module+0x11ac/0x1370
[   28.145425] u-Precision-7760 kernel:  ? security_kernel_post_read_file+0x5c/0x70
[   28.145425] u-Precision-7760 kernel:  ? security_kernel_post_read_file+0x5c/0x70
[   28.145427] u-Precision-7760 kernel:  __do_sys_finit_module+0xc2/0x120
[   28.145427] u-Precision-7760 kernel:  ? __do_sys_finit_module+0xc2/0x120
[   28.145428] u-Precision-7760 kernel:  __x64_sys_finit_module+0x1a/0x20
[   28.145430] u-Precision-7760 kernel:  do_syscall_64+0x38/0x90
[   28.145431] u-Precision-7760 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   28.145431] u-Precision-7760 kernel: RIP: 0033:0x7f098f9ce89d
[   28.145432] u-Precision-7760 kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
[   28.145433] u-Precision-7760 kernel: RSP: 002b:00007ffc40ddeaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   28.145433] u-Precision-7760 kernel: RAX: ffffffffffffffda RBX: 0000563528b826a0 RCX: 00007f098f9ce89d
[   28.145434] u-Precision-7760 kernel: RDX: 0000000000000000 RSI: 00007f098f8abded RDI: 0000000000000005
[   28.145434] u-Precision-7760 kernel: RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000000
[   28.145435] u-Precision-7760 kernel: R10: 0000000000000005 R11: 0000000000000246 R12: 00007f098f8abded
[   28.145435] u-Precision-7760 kernel: R13: 0000000000000000 R14: 0000563528b8aa40 R15: 0000563528b826a0
Comment 1 Sasha Neftin 2021-07-14 17:57:22 UTC
why NVM checksum is bad?
(in original NVM with good checksum the driver won't try to recovery checksum)
Comment 2 Sasha Neftin 2021-07-14 17:57:43 UTC
why NVM checksum is bad?
(in original NVM with good checksum the driver won't try to recovery checksum)
Comment 3 Sasha Neftin 2021-07-14 18:03:18 UTC
I looked at the attached log - first crash reported from i915 (graphic)
I would suggest work with your vendor and check your system.
Comment 4 Rex Tsai 2021-07-14 18:44:02 UTC
Created attachment 297863 [details]
v1-0001-e1000e-Do-not-take-care-about-NVM-checksum
Comment 5 Rex Tsai 2021-07-14 18:44:27 UTC
Created attachment 297865 [details]
v1-0001-e1000e-Do-not-take-care-about-NVM-checksumBACK
Comment 6 Rex Tsai 2021-07-14 18:46:45 UTC
Hi,
Intel worked with Dell and their partners to confirm that the issue is related to incorrect checksum in GbE NVM. Dell and their partners are updating their process to write correct checksum in GbE NVM after they update anything in GbE NVM. In parallel, Intel will propose to remove checksum correction from e1000e driver due to new design change on GbE since recent platforms.

I upload two patches; I will request Dell and their partners to verify.

Rex
Comment 7 AceLan Kao 2021-07-20 03:36:22 UTC
Confirmed the patches fix the issue.
Comment 8 Sasha Neftin 2021-07-20 04:24:38 UTC
Thanks for the confirmation. I will provide the commit id for tracking.
Please, work with your vendor to properly calculate checksum
Comment 9 Sasha Neftin 2021-07-20 17:24:41 UTC
commit	7b0827770dee8c5c08f97ea65568b834e60438f6
Comment 10 Marcin 2021-11-15 12:46:45 UTC
On the Dell Precision 7560 the e1000e driver doesn't load due to an error:

e1000e 0000:00:1f.6: The NVM Checksum Is Not Valid

I tested this with Gentoo with kernel 5.10.76 and also tried Ubuntu 20.04.3 LTS
and 21.10 (kernel 5.13)

It seems that the patch causes the NVM is not updated, however, the checksum is
still verified, and if its not correct, the driver is not loaded, network card
doesn't work.

+		if (hw->mac.type < e1000_pch_cnp) {
+			data |= valid_csum_mask;
+			ret_val = e1000_write_nvm(hw, word, 1, &data);
+			if (ret_val)
+				return ret_val;
+			ret_val = e1000e_update_nvm_checksum(hw);
+			if (ret_val)
+				return ret_val;
+		}


I added some prints to the driver and it seems that the expected checksum is 
0xbaba but the calculated value is 0xbcba. I do not know if it helps but:

The e1000_probe function selects

const struct e1000_info *ei = e1000_info_tbl[ent->driver_data];
ent->driver_data = 14 => board_pch_tgp

I checked EEPROM and the selected element is:

#define E1000_DEV_ID_PCH_TGP_I219_LM14  0x15F9


I know I can remove the checksum verification code, but I am curious if there is
any other option. Otherwise, I will have to do this every time my distribution
updates the kernel.

Many thanks
Marcin
Comment 11 Marcin 2021-11-19 17:41:11 UTC
Can anyone advise me at least if this error is important? What does it really mean?
Can this problem be solved somehow or is it better to return the hardware?

-Marcin
Comment 12 Sasha Neftin 2021-11-19 17:42:05 UTC
Created attachment 299645 [details]
attachment-32179-0.html

Out of office. Expected delayed response
Comment 13 pindi 2021-11-28 14:27:00 UTC
I have same identical problem on my Precision 7560 with Ubuntu 20.04 OEM, kernel 5.13.0-1019-oem:

+ e1000e 0000:00:1f.6: The NVM Checksum Is Not Valid
+ e1000e: probe of 0000:00:1f.6 failed with error -5 

Is there any plans to fix this at the kernel level or is it necessary to request hardware replacement (even if it works correctly despite the wrong checksum)?
Unfortunately it is not possible for me to compile the form at each update nor is it possible on this model to overwrite the checksum ("Unable to write default configuration to EEPROM").
Comment 14 Sasha Neftin 2021-11-29 07:28:22 UTC
(In reply to pindi from comment #13)
> I have same identical problem on my Precision 7560 with Ubuntu 20.04 OEM,
> kernel 5.13.0-1019-oem:
> 
> + e1000e 0000:00:1f.6: The NVM Checksum Is Not Valid
> + e1000e: probe of 0000:00:1f.6 failed with error -5 
> 
> Is there any plans to fix this at the kernel level or is it necessary to
> request hardware replacement (even if it works correctly despite the wrong
> checksum)?
> Unfortunately it is not possible for me to compile the form at each update
> nor is it possible on this model to overwrite the checksum ("Unable to write
> default configuration to EEPROM").

There is no need HW replacement I thought. You should contact your PC vendor and update NVM (part of BIOS)
There is no more option to update NVM by SW tools on new HW (drivers, etc...)
Comment 15 Martin 2022-01-04 11:37:30 UTC
I have a brand new Dell Latitude E5420 with latest BIOS (1.14.1) and got the same error message as previous comment about NVM.

Running Ubuntu 20.04.3 LTS and kernel 5.11.0-43-generic #47~20.04.2-Ubuntu SMP

I don't see any download for a NVM update

What should I do ?
Comment 16 pindi 2022-01-04 15:26:37 UTC
(In reply to Sasha Neftin from comment #14)
> (In reply to pindi from comment #13)
> > I have same identical problem on my Precision 7560 with Ubuntu 20.04 OEM,
> > kernel 5.13.0-1019-oem:
> > 
> > + e1000e 0000:00:1f.6: The NVM Checksum Is Not Valid
> > + e1000e: probe of 0000:00:1f.6 failed with error -5 
> > 
> > Is there any plans to fix this at the kernel level or is it necessary to
> > request hardware replacement (even if it works correctly despite the wrong
> > checksum)?
> > Unfortunately it is not possible for me to compile the form at each update
> > nor is it possible on this model to overwrite the checksum ("Unable to
> write
> > default configuration to EEPROM").
> 
> There is no need HW replacement I thought. You should contact your PC vendor
> and update NVM (part of BIOS)
> There is no more option to update NVM by SW tools on new HW (drivers, etc...)

The only solution to the NIC problem was to replace the motherboard, as DELL does not allow overwriting of the ROM.
Any other solution applied does not solve the problem in any way (except by rewrite the e1000 module and recompiling the kernel, but it is not applicable in my case).
Regards.
Comment 17 Martin 2022-01-04 15:37:37 UTC
On my side, it seems to work with Intel's upstream version 3.8.7 from https://sourceforge.net/projects/e1000/. I installed it with https://github.com/koljah-de/e1000e-dkms-debian, so as DKMS. The question remains regarding what to do in order to make the Ubuntu driver to work natively...

I also find strange that now the bootutil64e utility is complaining, while the NIC seems to work:

---
Connection to QV driver failed - please reinstall it!

Intel(R) Ethernet Flash Firmware Utility
BootUtil version 1.37.34.3
Copyright (C) 2003-2021 Intel Corporation

ERROR: The adapter (location 0:31.6) cannot be initialized due to inaccessible device memory.
Update the device driver and reboot the system before running this utility again.
Consult the utility documentation for more information.

Type BootUtil -? for help

Port Network Address Location Series  WOL Flash Firmware                Version
==== =============== ======== ======= === ============================= =======
  1   (Cannot initialize adapter)
Comment 18 Martin 2022-01-04 16:08:18 UTC
and then I hit https://bugzilla.kernel.org/show_bug.cgi?id=213651 with a network speed issue...
Comment 19 AceLan Kao 2022-02-25 02:34:12 UTC
This issue has been fixed since v5.14 by below commit

commit 4051f68318ca9f3d3becef3b54e70ad2c146df97
Author: Sasha Neftin <sasha.neftin@intel.com>
Date:   Sun Jul 18 07:10:31 2021 +0300

    e1000e: Do not take care about recovery NVM checksum
    
    On new platforms, the NVM is read-only. Attempting to update the NVM
    is causing a lockup to occur. Do not attempt to write to the NVM
    on platforms where it's not supported.
    Emit an error message when the NVM checksum is invalid.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213667
    Fixes: fb776f5d57ee ("e1000e: Add support for Tiger Lake")
    Suggested-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
    Suggested-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
    Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
    Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Comment 20 Martin 2022-07-19 07:46:05 UTC
I'm running `5.15.0-41-generic` on Ubuntu 20.04 LTS and still have the issue, the interface is not working (This is a Dell 5420):


[    0.944599] e1000e: Intel(R) PRO/1000 Network Driver
[    0.944601] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[    0.945221] e1000e 0000:00:1f.6: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    1.166829] e1000e 0000:00:1f.6: The NVM Checksum Is Not Valid
[    1.231542] e1000e: probe of 0000:00:1f.6 failed with error -5