Bug 215689

Summary: e1000e: regression for I219-V for kernel 5.14 onwards
Product: Drivers Reporter: James (jahutchinson99)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: NEW ---    
Severity: high CC: amir.avivi, devora.fuxbrumer, dima.ruinskiy, sasha.neftin
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.15.x Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Fix possible overflow for LTR decoding

Description James 2022-03-15 13:45:38 UTC
I run Arch linux on an Intel NUC 8i3BEH which has the following network card:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (6) I219-V (rev 30)
        DeviceName:  LAN
        Subsystem: Intel Corporation Device 2074
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 135
        Region 0: Memory at c0b00000 (32-bit, non-prefetchable) [size=128K]
        Capabilities: [c8] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee003d8  Data: 0000
        Kernel driver in use: e1000e
        Kernel modules: e1000e

I found a major regression since the previous few kernel versions which causes several odd issues, most noteably I use the machine to stream live tv via TVheadend and was finding this to be unusable (picture freezes and sound breaks up very badly with continuity errors in the TVheadend logfile).

I found the issue was introduced since the 5.14 kernel, and have eventually got round to performing a git bisect, which landed upon the following commit:

44a13a5: e1000e: Fix the max snoop/no-snoop latency for 10M 

Indeed, if I revert this single commit then the problem is resolved.

To help diagnose the issue I applied the following patch to capture the values of the lat_enc, max_ltr_enc vs lat_enc_d, max_ltr_enc_d variables:

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index d60e2016d..f4e5ffbcd 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1012,6 +1012,7 @@ static s32 e1000_platform_pm_pch_lpt(struct e1000_hw *hw, bool link)
        u16 max_ltr_enc_d = 0;  /* maximum LTR decoded by platform */
        u16 lat_enc_d = 0;      /* latency decoded */
        u16 lat_enc = 0;        /* latency encoded */
+       struct e1000_adapter *adapter = hw->adapter;

        if (link) {
                u16 speed, duplex, scale = 0;
@@ -1074,6 +1075,9 @@ static s32 e1000_platform_pm_pch_lpt(struct e1000_hw *hw, bool link)
                                 ((max_ltr_enc & E1000_LTRV_SCALE_MASK)
                                 >> E1000_LTRV_SCALE_SHIFT)));

+               e_info("e1000e: lat_enc=%d max_ltr_enc=%d", lat_enc, max_ltr_enc);
+               e_info("e1000e: lat_enc_d=%u max_ltr_enc_d=%u", lat_enc_d, max_ltr_enc_d);
+
                if (lat_enc_d > max_ltr_enc_d)
                        lat_enc = max_ltr_enc;
        }

With this in place I see the following in dmesg:

[    3.241215] e1000e: Intel(R) PRO/1000 Network Driver
[    3.241217] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[    3.243382] e1000e 0000:00:1f.6: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    3.749009] e1000e 0000:00:1f.6 0000:00:1f.6 (uninitialized): registered PHC clock
[    3.824751] e1000e 0000:00:1f.6 eth0: (PCI Express:2.5GT/s:Width x1) 94:c6:91:ae:b3:7b
[    3.824765] e1000e 0000:00:1f.6 eth0: Intel(R) PRO/1000 Network Connection
[    3.824849] e1000e 0000:00:1f.6 eth0: MAC: 13, PHY: 12, PBA No: FFFFFF-0FF
[    6.949327] e1000e 0000:00:1f.6 eth0: e1000e: lat_enc=2233 max_ltr_enc=4099
[    6.949331] e1000e 0000:00:1f.6 eth0: e1000e: lat_enc_d=58368 max_ltr_enc_d=0
[    6.951165] e1000e 0000:00:1f.6 eth0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Notice that lat_enc_d=58368 and max_ltr_enc_d=0 !

lat_enc_d is greater than max_ltr_enc_d so it's setting snoop latency to max_ltr_enc (i.e. 4099) where it would have previously been set to 2233 in this particular example. This seems to be where the problem lies.

Prior to commit 44a13a5:

if (lat_enc > max_ltr_enc)
  lat_enc = max_ltr_enc;

After commit 44a13a5:

if (lat_enc_d > max_ltr_enc_d)
  lat_enc = max_ltr_enc;


I'm not sure whether it was intended for this new code to take effect for an I219 since the commit message on 44a13a5 indicates it was aimed at I217/I218. Seems strange that max_ltr_enc_d is getting set to 0?
Comment 1 Sasha Neftin 2022-03-24 20:43:33 UTC
Created attachment 300613 [details]
Fix possible overflow for LTR decoding
Comment 2 Sasha Neftin 2022-03-25 07:29:21 UTC
Decoding should be:
scale 0 - 1         (2^(5*0)) = 2^0
scale 1 - 32        (2^(5 *1))= 2^5
scale 2 - 1024      (2^(5 *2)) =2^10
scale 3 - 32768     (2^(5 *3)) =2^15
scale 4 – 1048576   (2^(5 *4)) = 2^20
scale 5 – 33554432  (2^(5 *4)) = 2^25

We need (not u16):
u32 max_ltr_enc_d = 0;	/* maximum LTR decoded by platform */
u32 lat_enc_d = 0;	/* latency decoded */

When Link 1G LTR should be:
LTR: RAW: 0x88b988b9 Non-Snoop(ns): 189440              Snoop(ns): 189440
Comment 3 Sasha Neftin 2022-03-25 07:29:46 UTC
Decoding should be:
scale 0 - 1         (2^(5*0)) = 2^0
scale 1 - 32        (2^(5 *1))= 2^5
scale 2 - 1024      (2^(5 *2)) =2^10
scale 3 - 32768     (2^(5 *3)) =2^15
scale 4 – 1048576   (2^(5 *4)) = 2^20
scale 5 – 33554432  (2^(5 *4)) = 2^25

We need (not u16):
u32 max_ltr_enc_d = 0;	/* maximum LTR decoded by platform */
u32 lat_enc_d = 0;	/* latency decoded */

When Link 1G LTR should be:
LTR: RAW: 0x88b988b9 Non-Snoop(ns): 189440              Snoop(ns): 189440
Comment 4 James 2022-03-25 08:41:54 UTC
Hi Sasha,

I reported the issue originally, and have just been testing the proposed patch:
v1-0001-e1000e-Fix-possible-overflow-in-LTR-decoding.patch

I'm happy to report that this resolves the issue - my live tv streams via Tvheadend are back to normal operation with this patch in place  :-)

I can also see that the ltr is now being set with the correct value you'd anticipated:

$ sudo cat /sys/kernel/debug/pmc_core/ltr_show
SOUTHPORT_A       LTR: RAW: 0x0         Non-Snoop(ns): 0       Snoop(ns): 0
SOUTHPORT_B       LTR: RAW: 0x0         Non-Snoop(ns): 0       Snoop(ns): 0
SATA              LTR: RAW: 0x883c      Non-Snoop(ns): 0       Snoop(ns): 61440
GIGABIT_ETHERNET  LTR: RAW: 0x88b988b9  Non-Snoop(ns): 189440  Snoop(ns): 189440
XHCI              LTR: RAW: 0x8b54      Non-Snoop(ns): 0       Snoop(ns): 872448
...


Regards,
James Hutchinson <jahutchinson99@googlemail.com>
Comment 5 Sasha Neftin 2022-03-27 05:42:52 UTC
Thanks James (I will add the "Reported-by" tag)