Bug 205067

Summary: e1000e: regression, Networkmanager isn't able to bring interface up until a manual down/up
Product: Networking Reporter: Thomas Mann (rauchwolke)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: normal CC: andymann375, c.hargr, detlev.casanova, dvalter, medhefgo, rauchwolke, sam.saffron, tabaire
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.3.x Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Proposed fix for watchdog
Proposed fix for watchdog and hibernate

Description Thomas Mann 2019-10-01 18:44:44 UTC
after the update to 5.3 i get a weird Networkmanger behaviour.

Sometimes the interface goes up and get an IP and instantly goes down after a few seconds or the interface doesn't get an IP anymore until i down and up the interface.

It seems to be a regression as i can't reproduce the error with linux 5.2

I always get this message before the interface goes down:

device (eth0): state change: activated -> unavailable (reason 'carrier-changed', sys-iface-state: 'managed')

between it's an e1000e card
Comment 1 Thomas Mann 2019-10-01 20:15:12 UTC
reverting 

commit 59653e6497d16f7ac1d9db088f3959f57ee8c3db
Author: Detlev Casanova <detlev.casanova@gmail.com>
Date:   Sat Jun 22 23:14:37 2019 -0400

    e1000e: Make watchdog use delayed work


fixes the problem
Comment 2 Sam Saffron 2019-10-02 23:05:40 UTC
also related:

https://bugzilla.kernel.org/show_bug.cgi?id=205047
Comment 3 Thomas Mann 2019-10-05 23:15:34 UTC
bug still present in 5.3.4 - reverting 59653e6497d16f7ac1d9db088f3959f57ee8c3db fixes the problem

lspci -k

00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 04)
        Subsystem: Lenovo ThinkPad T520
        Kernel driver in use: e1000e
Comment 4 Thomas Mann 2019-10-09 17:40:41 UTC
bug still present in 5.3.5 - reverting 59653e6497d16f7ac1d9db088f3959f57ee8c3db fixes the problem
Comment 5 Thomas Mann 2019-10-12 12:04:01 UTC
bug still present in 5.3.6 - reverting 59653e6497d16f7ac1d9db088f3959f57ee8c3db fixes the problem
Comment 6 Dmitry Valter 2019-10-13 12:06:41 UTC
I can confirm this issue affects I219-V on KabyLake-R platform (ThinkPad T480) and MSI Z370 boards.

Additional info:
1. No real outgoing DHCP packets could be captured either on the affected machine or any other in the network.
2. Some set of actions including connections with NetworkManager using different settings may lead to complete e1000e outage (until the module will be reloaded)
Comment 7 Thomas Mann 2019-10-30 02:09:01 UTC
5.3.8: bug still unfixed

Reverting 59653e6497d16f7ac1d9db088f3959f57ee8c3db fixes the problem.
Comment 8 Dmitry Valter 2019-11-03 00:24:38 UTC
I went through NetworkManager logs and code, and some assumptions about this bug are surprising to say the least.

So NM actually waits some time for NIC to establish a link and if carrier is still not available, NM resets link over and over again until retry counter is over. Since NetworkManager 1.11 this timeout is exactly 6 seconds (5s previously) and specified at src/devices/nm-device.c +94.

Since 59653e6 6 seconds is just not enough for Intel NIC's to get configured on 1000 Mbps. It's a so fine margin, that sometimes excessive `trace` logging may slow NM down enough to e1000e get up and work.

After changing this timeout to 7.5s NM works fine. Tested with NetworkManager 1.20.4 git, Linux 5.3.8-arch1-1 and Intel I219-V.

So the regression is "e1000e works too slow with gigabit links on certain hardware, that some network setup software runs out of time to wait for it".
Comment 9 Thomas Mann 2019-11-03 12:09:57 UTC
i tried the networkmanager patch you mentioned and i end up with the same behavior, only reverting 59653e6497d16f7ac1d9db088f3959f57ee8c3db fixes the bug for me

--- a/src/devices/nm-device.c
+++ b/src/devices/nm-device.c
@@ -92,7 +92,7 @@ _LOG_DECLARE_SELF (NMDevice);
 #define DEFAULT_AUTOCONNECT    TRUE
 #define DHCP_GRACE_PERIOD_SEC  480
 
-#define CARRIER_WAIT_TIME_MS 6000
+#define CARRIER_WAIT_TIME_MS 7500
 #define CARRIER_WAIT_TIME_AFTER_MTU_MS 10000
 
 #define NM_DEVICE_AUTH_RETRIES_UNSET    -1
Comment 10 Dmitry Valter 2019-11-04 11:56:27 UTC
Sounds interesting. On my laptop this patch works perfectly fine.
I guess, on your hardware it could either require longer delay for ethernet properties to be applied or be broken by 59653e6 in a very different way.

There's another workaround I've found working in my case. If NetworkManager does not have both duplex and speed properties for the connection, it won't set them via ethtool interface. NIC defaults would be used instead. In my case defaults include auto from 10/Half to 1000/Full and therefore no timeout problems with NM happen.

It could be achieved by manual editing an nmconnection file (NM restart required), setting "Ignored" negotiation mode in XFCE NM GUI (or KDE with https://phabricator.kde.org/D24866 patch by jgrulich) or deleting all wired connections. New wired connections use these by default.
I do not know how could it be done with any other NM frontend.
Comment 11 Thomas Mann 2019-11-04 17:55:45 UTC
Even setting a timeout of 10 seconds doesn't solve problem. I'm using the latest kde version which allows to set "Allow auto-negotiation" but this doesn't solve the problem.

Something was changed in the kernel that breaks software in the user space that is working normally without the patch. So in my opinion this patch should get reverted or fixed.
Comment 12 Dmitry Valter 2019-11-04 20:04:45 UTC
(In reply to Thomas Mann from comment #11)
> Even setting a timeout of 10 seconds doesn't solve problem.

Interesting. Here's another way to reproduce this bug in my case.
Running this as root

date +%s.%N &&\
  ip link set $ETH up &&\
    date +%s.%N &&\
  ethtool -s $ETH advertise $MODE &&\
    date +%s.%N &&\
  ip link set $ETH down &&\
    date +%s.%N &&\
  dhclient $ETH &&\
    date +%s.%N

where ETH is the interface e.g. eth0 or enp1s0
and mode is advertise mode according to man 8 ethtool.
In my case difference between (MODE & 0x020) (1000base, like 0x02c) and !(MODE & 0x020) (10/100base like 0x0c) is pretty visible (like 11-13s vs 6-8s)

> I'm using the latest kde version which allows to set "Allow auto-negotiation"
> but this doesn't solve the problem.

In NetworkManager "auto-negotiation" does not mean it won't set anything. For this case NetworkManager has a specific mode called "Ignore" in XFCE. Upstream KDE is yet to include the same. 

 
> Something was changed in the kernel that breaks software in the user space
> that is working normally without the patch. So in my opinion this patch
> should get reverted or fixed.

I'm not sure about reverting it since patch looks straight forward and solving that TODOs may lead to the same later. But it definitely should be fixed
Comment 13 Thomas Mann 2019-11-04 22:47:48 UTC
you are right, the problem is the link negotiation.

All these interfaces just configure nm and when i disable it with

auto-negotiate=false, without defining "speed" and "duplex" properties it works.

It's explained here

https://developer.gnome.org/NetworkManager/unstable/nm-settings.html

under auto-negotiate

"When FALSE, "speed" and "duplex" properties should be both set or link configuration will be skipped."

which should be the equivalent to your "Ignored" value.
Comment 14 Thomas Mann 2019-11-04 22:51:37 UTC
This is the work around but how can we find out what the actual problem is?
Comment 15 Dmitry Valter 2019-11-13 14:57:35 UTC
Now I'm trying to figure out, where the time goes between code with 59653e64 and with 59653e64 reverted.

No clear answer so far, but looks like in my case watchdog changes add just that 1 second likely due to `del_timer_sync` in `e1000e_down`.

P.S.
5.3.9: bug still unfixed

Reverting 59653e6497d16f7ac1d9db088f3959f57ee8c3db fixes the problem.
Comment 16 Detlev Casanova 2019-11-17 01:04:19 UTC
Created attachment 285953 [details]
Proposed fix for watchdog

Hi all,

After having a better look at the patch and with the help of Thomas to run some tests, I propose the following patch.

Please test so that we can fix this issue mainline.

Detlev.
Comment 17 Jan Janssen 2019-11-17 11:34:28 UTC
With the patch applied network still sometimes doesn't come up when I resume form suspend.
Comment 18 Thomas Mann 2019-11-17 12:11:45 UTC
(In reply to Jan Janssen from comment #17)
> With the patch applied network still sometimes doesn't come up when I resume
> form suspend.

I didn't test hibernation by now, same for me, when i hibernate the interface doesn't come up again until i do a

interface down and up
Comment 19 Detlev Casanova 2019-11-18 00:17:16 UTC
Created attachment 285963 [details]
Proposed fix for watchdog and hibernate

Please try this new patch
Comment 20 Jan Janssen 2019-11-18 15:44:40 UTC
This patch doesn't fix it either. Though, I feel like it reduced the chance of it happening to about 1 out of 10 tries.
Comment 21 Thomas Mann 2019-11-25 17:06:48 UTC
I could reproduce the hibernate bug.

Bug is still unfixed with 5.3.13, even with the patches applied.
Comment 22 Thomas Mann 2019-11-30 17:04:20 UTC
bug still present in 5.4.1
Comment 23 Thomas Mann 2019-12-05 19:23:55 UTC
bug still unfixed in 5.4.2
Comment 24 Thomas Mann 2019-12-21 12:16:41 UTC
5.4.5 - bug still exists
Comment 25 Thomas Mann 2020-01-08 19:53:48 UTC
5.4.8 - bug still exists
Comment 26 Thomas Mann 2020-02-03 21:20:14 UTC
5.5.1 - can't reproduce the bug anymore, seems the regression is fixed
Comment 27 Jan Janssen 2020-02-04 16:43:36 UTC
Can't say the same. 5.5.1 made it much less likely to happen. But it still happens around 1 in 10 times.
Comment 28 Jan Janssen 2020-04-13 15:56:17 UTC
Seemt 5.6 fixed it for me.
Comment 29 Andy Mann 2020-06-03 16:03:04 UTC
I note you have "carrier-changed" disconnections in the log.

I had "carrier-changed" errors, frequently disconnecting my ethernet, and after trying every solution in multiple forums, I finally upgraded the CAT5 cable to a CAT7 cable (10 meter run - cheap on ebay) which cured the problem. Apparently, the old CAT5 cable couldn't handle my ISP's upgraded speeds (Virgin Media, UK). The new CAT7 cable has not only stopped the dropouts, but also now allows the speed to auto-configure from 100Mb/s to 1000Mb/s. Hope that helps anyone looking to fix "carrier-changed" problems.
Comment 30 Stephen Hemminger 2020-06-03 16:20:40 UTC
(In reply to Andy Mann from comment #29)
> I note you have "carrier-changed" disconnections in the log.
> 
> I had "carrier-changed" errors, frequently disconnecting my ethernet, and
> after trying every solution in multiple forums, I finally upgraded the CAT5
> cable to a CAT7 cable (10 meter run - cheap on ebay) which cured the
> problem. Apparently, the old CAT5 cable couldn't handle my ISP's upgraded
> speeds (Virgin Media, UK). The new CAT7 cable has not only stopped the
> dropouts, but also now allows the speed to auto-configure from 100Mb/s to
> 1000Mb/s. Hope that helps anyone looking to fix "carrier-changed" problems.

1000Mb/s needs all 8 wires. Most likely your old cable had bad connections.
CAT5 will work fine.
Comment 31 Andy Mann 2020-06-03 16:47:55 UTC
(In reply to Stephen Hemminger from comment #30)
> 
> 1000Mb/s needs all 8 wires. Most likely your old cable had bad connections.
> CAT5 will work fine.

Just reporting that the CAT5 to CAT7 upgrade fixed my carrier-changed dropouts. My research shows that CAT5 is rated up to 100Mb/s - CAT5e and upwards are rated for 1000 + Mb/s. Plus, CAT7 cables are shielded - maybe that helps.
Comment 32 Andy Mann 2020-06-03 16:56:52 UTC
(In reply to Andy Mann from comment #31)
> (In reply to Stephen Hemminger from comment #30)
> > 
> > 1000Mb/s needs all 8 wires. Most likely your old cable had bad connections.
> > CAT5 will work fine.
> 
> Just reporting that the CAT5 to CAT7 upgrade fixed my carrier-changed
> dropouts. My research shows that CAT5 is rated up to 100Mb/s - CAT5e and
> upwards are rated for 1000 + Mb/s. Plus, CAT7 cables are shielded - maybe
> that helps.

With cables so cheap now, it's probably worth upgrading if not just to eliminate the cable as the problem.