Bug 204079 - r8169: no link after resume from sleep, possible regression from commit fa6821cbf1d9724284ef0906c9a01a5fbf13a35c
Summary: r8169: no link after resume from sleep, possible regression from commit fa682...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-06 11:30 UTC by Dušan Dragić
Modified: 2023-04-26 02:48 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.1.15
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel log with sleep/resume cycle and no link on nic (78.32 KB, text/plain)
2019-07-06 11:30 UTC, Dušan Dragić
Details
Log with sleep/resume after enabling WoL with ethtool, network works after resume. (77.19 KB, text/plain)
2019-07-08 18:04 UTC, Dušan Dragić
Details
Log with the patch applied. (78.71 KB, text/plain)
2019-07-10 23:16 UTC, Dušan Dragić
Details
Kernel log with the second patch applied, WoL set before suspend (78.05 KB, text/plain)
2019-07-12 10:15 UTC, Dušan Dragić
Details
Kernel log with second patch applied, without WoL (77.64 KB, text/plain)
2019-07-12 10:16 UTC, Dušan Dragić
Details
lspci -vv output (39.58 KB, text/plain)
2019-07-16 16:18 UTC, Dušan Dragić
Details
dmesg with patch from comment #15 (76.86 KB, text/plain)
2019-07-21 16:23 UTC, Dušan Dragić
Details
Output of "ethtool -d enp5s0" before suspend (2.68 KB, text/plain)
2019-07-22 18:52 UTC, Dušan Dragić
Details

Description Dušan Dragić 2019-07-06 11:30:37 UTC
Created attachment 283571 [details]
Kernel log with sleep/resume cycle and no link on nic

I'm on Fedora 29 and ever since upgrading to kernel 5.1 series I've had this problem, resuming from sleep results in no link and the only thing that helps is unloading and reloading the r8169 module.

After resume there is this message in the log (kernel log with sleep/resume is attached):
do_IRQ: 3.38 No irq handler for vector

This only started happening in 5.1.x, there were no issues in 5.0.x and earlier kernels.

I bisected the r8169 driver and it seems to be a regression from fa6821cbf1d9 (r8169: improve WoL handling), after reverting this commit it works fine again (please note, I reverted 5ea25b125289 first and then fa6821cbf1d9).


Hardware info (this is an old system):
Motherboard: Gigabyte GA-P35-DS3L
CPU: Xeon E5450


Ethtool after resume (no link):
Settings for enp5s0:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Half 1000baseT/Full 
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Half 1000baseT/Full 
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: MII
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: pumbg
        Wake-on: d
        Current message level: 0x00000033 (51)
                               drv probe ifdown ifup
        Link detected: no

lspci info:
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 01)
        Subsystem: Gigabyte Technology Co., Ltd Onboard Ethernet [1458:e000]
        Flags: bus master, fast devsel, latency 0, IRQ 31
        I/O ports at c000 [size=256]
        Memory at f8000000 (64-bit, non-prefetchable) [size=4K]
        [virtual] Expansion ROM at f7000000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 2
        Capabilities: [48] Vital Product Data
        Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [84] Vendor Specific Information: Len=4c <?>
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [12c] Virtual Channel
        Capabilities: [148] Device Serial Number 00-00-00-00-10-ec-81-68
        Capabilities: [154] Power Budgeting <?>
        Kernel driver in use: r8169
        Kernel modules: r8169
Comment 1 Heiner Kallweit 2019-07-08 14:03:03 UTC
The "attached PHY driver ..." message is diplayed on opening the network device only. There may be some userspace tool involved.
Could you try suspending the device manually via "systemctl suspend" ? Does the same issue happen an wakeup (via power button)?

And another test: If you set WoL before suspending (ethtool -s enp5s0 wol g), issue still there or gone?
Comment 2 Dušan Dragić 2019-07-08 18:01:53 UTC
(In reply to Heiner Kallweit from comment #1)
> The "attached PHY driver ..." message is diplayed on opening the network
> device only. There may be some userspace tool involved.
> Could you try suspending the device manually via "systemctl suspend" ? Does
> the same issue happen an wakeup (via power button)?
> 
> And another test: If you set WoL before suspending (ethtool -s enp5s0 wol
> g), issue still there or gone?

I tried suspending with "systemctl suspend" and waking up and there is no change, still no network, but enabling WoL with "ethtool -s enp5s0 wol g" before suspend did the trick!

Now everything looks ok after resume.
Comment 3 Dušan Dragić 2019-07-08 18:04:03 UTC
Created attachment 283583 [details]
Log with sleep/resume after enabling WoL with ethtool, network works after resume.
Comment 4 Heiner Kallweit 2019-07-09 09:12:33 UTC
All enabling WoL does is preventing the PHY from being powered-down. That should not cause the issue, so I assume a subtle race, leading to this "do_IRQ: 3.38 No irq handler for vector" at the time the link-ip interrrupt is expected.

There have been some recent fixes to the interrupt handling code in question, e.g.  b7107a67f0d1 ("x86/irq: Handle spurious interrupt after shutdown gracefully").
Therefore it would be helpful if you could re-test (with WoL Disabled) with latest linux-next: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Comment 5 Dušan Dragić 2019-07-09 23:05:03 UTC
I tried linux-next, but it's the same: "do_IRQ: 3.38 No irq handler for vector" and no link after wake-up.

P.S. 
Probably doesn't matter, but amdgpu from linux-next was freaking out and I couldn't get a gui or tty login (disabling modesetting got me a virtual console login but after resume there was no output on the screen), so I connected via serial console to the system and suspended with systemctl suspend since I didn't have time to deal with this.
Comment 6 Heiner Kallweit 2019-07-10 20:47:21 UTC
The following adds some debug info and should apply on 5.1.
Could you please run it with WoL disabled and post the dmesg of a suspend/resume cycle?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index d06a61f00..09324be54 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6493,6 +6493,8 @@ static int rtl8169_close(struct net_device *dev)
 	struct rtl8169_private *tp = netdev_priv(dev);
 	struct pci_dev *pdev = tp->pci_dev;
 
+	netdev_info(dev, "start rtl8169_close\n");
+
 	pm_runtime_get_sync(&pdev->dev);
 
 	/* Update counters before going down */
@@ -6520,6 +6522,8 @@ static int rtl8169_close(struct net_device *dev)
 
 	pm_runtime_put_sync(&pdev->dev);
 
+	netdev_info(dev, "end rtl8169_close\n");
+
 	return 0;
 }
 
@@ -6538,6 +6542,8 @@ static int rtl_open(struct net_device *dev)
 	struct pci_dev *pdev = tp->pci_dev;
 	int retval = -ENOMEM;
 
+	netdev_info(dev, "start rtl_open\n");
+
 	pm_runtime_get_sync(&pdev->dev);
 
 	/*
@@ -6591,6 +6597,7 @@ static int rtl_open(struct net_device *dev)
 
 	pm_runtime_put_sync(&pdev->dev);
 out:
+	netdev_info(dev, "end rtl_open, retval = %d\n", retval);
 	return retval;
 
 err_free_irq:
@@ -6673,6 +6680,8 @@ static void rtl8169_net_suspend(struct net_device *dev)
 	if (!netif_running(dev))
 		return;
 
+	netdev_info(dev, "in rtl8169_net_suspend\n");
+
 	phy_stop(tp->phydev);
 	netif_device_detach(dev);
 
@@ -6693,9 +6702,13 @@ static int rtl8169_suspend(struct device *device)
 	struct net_device *dev = dev_get_drvdata(device);
 	struct rtl8169_private *tp = netdev_priv(dev);
 
+	netdev_info(dev, "start rtl8169_suspend\n");
+
 	rtl8169_net_suspend(dev);
 	clk_disable_unprepare(tp->clk);
 
+	netdev_info(dev, "end rtl8169_suspend\n");
+
 	return 0;
 }
 
@@ -6722,12 +6735,20 @@ static int rtl8169_resume(struct device *device)
 	struct net_device *dev = dev_get_drvdata(device);
 	struct rtl8169_private *tp = netdev_priv(dev);
 
+	netdev_info(dev, "start rtl8169_resume\n");
+
 	rtl_rar_set(tp, dev->dev_addr);
 
 	clk_prepare_enable(tp->clk);
 
-	if (netif_running(dev))
+	if (netif_running(dev)) {
+		netdev_info(dev, "calling __rtl8169_resume\n");
 		__rtl8169_resume(dev);
+	} else {
+		netdev_info(dev, "not calling __rtl8169_resume\n");
+	}
+
+	netdev_info(dev, "end rtl8169_resume\n");
 
 	return 0;
 }
-- 
2.22.0
Comment 7 Dušan Dragić 2019-07-10 23:16:30 UTC
Created attachment 283603 [details]
Log with the patch applied.

dmesg shows boot, suspend/resume, rmmod r8169 and modprobe r8169
Comment 8 Heiner Kallweit 2019-07-11 08:04:09 UTC
Hmm, this looks totally normal. If you manually close and open the interface (ip link set enp5s0 down/up) w/o suspending, does the same problem occur?
Comment 9 Dušan Dragić 2019-07-11 08:08:49 UTC
(In reply to Heiner Kallweit from comment #8)
> Hmm, this looks totally normal. If you manually close and open the interface
> (ip link set enp5s0 down/up) w/o suspending, does the same problem occur?

No, no problem with manual link down/up:

[ 3152.064584] r8169 0000:05:00.0 enp5s0: start rtl8169_close
[ 3152.064610] r8169 0000:05:00.0 enp5s0: Link is Down
[ 3152.077962] r8169 0000:05:00.0 enp5s0: end rtl8169_close
[ 3177.556546] r8169 0000:05:00.0 enp5s0: start rtl_open
[ 3177.557051] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[ 3177.634558] r8169 0000:05:00.0 enp5s0: end rtl_open, retval = 0
[ 3177.635763] r8169 0000:05:00.0 enp5s0: Link is Down
[ 3179.278388] r8169 0000:05:00.0 enp5s0: Link is Up - 100Mbps/Full - flow control rx/tx
[ 3179.278402] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0: link becomes ready
Comment 10 Heiner Kallweit 2019-07-11 18:23:24 UTC
Thanks. Then there may be some erratum with this chip version. To check why enabling WoL avoids the issue: Can you apply the following and again check whether it helps and attach the dmesg log.


diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 09324be54..9b2229df0 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1440,7 +1440,14 @@ static void __rtl8169_set_wol(struct rtl8169_private *tp, u32 wolopts)
 
 	rtl_lock_config_regs(tp);
 
-	device_set_wakeup_enable(tp_to_dev(tp), wolopts);
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_12:
+		netdev_info(tp->dev, "don't set wakeup\n");
+		break;
+	default:
+		device_set_wakeup_enable(tp_to_dev(tp), wolopts);
+		break;
+	}
 }
 
 static int rtl8169_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
@@ -7360,6 +7367,15 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (r8168_check_dash(tp))
 		rtl8168_driver_start(tp);
 
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_12:
+		device_wakeup_enable(&pdev->dev);
+		netdev_info(dev, "enabling wakeup\n");
+		break;
+	default:
+		break;
+	}
+
 	if (pci_dev_run_wake(pdev))
 		pm_runtime_put_sync(&pdev->dev);
 
-- 
2.22.0
Comment 11 Dušan Dragić 2019-07-12 10:15:16 UTC
Created attachment 283641 [details]
Kernel log with the second patch applied, WoL set before suspend

Here's the dmesg for the second patch with WoL set. I applied the patch on top of the earlier one.
Comment 12 Dušan Dragić 2019-07-12 10:16:59 UTC
Created attachment 283643 [details]
Kernel log with second patch applied, without WoL

This is dmesg without WoL set before suspend.
Comment 13 Heiner Kallweit 2019-07-16 10:43:39 UTC
Could you please also attach the full output of "lspci -vv" ?
Comment 14 Dušan Dragić 2019-07-16 16:18:40 UTC
Created attachment 283743 [details]
lspci -vv output

Here is the lspci output. There was an error reading VPD of the realtek nic:
"pcilib: sysfs_read_vpd: read failed: Input/output error"

and there was this in dmesg:

[  268.620085] r8169 0000:05:00.0: invalid short VPD tag 05 at offset 2

Not sure if it's relevant.
Comment 15 Heiner Kallweit 2019-07-18 19:16:43 UTC
Could you please check whether any of the two following hunks, or both together, fix the issue (on top of debug patch from comment 6)?
If not please post dmesg output of this patch on top of the debug patch from comment 6. Thanks!

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 12700d840..632015143 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7556,6 +7556,8 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (pci_dev_run_wake(pdev))
 		pm_runtime_put_sync(&pdev->dev);
 
+	dev->wol_enabled = 1;
+
 	return 0;
 
 err_mdio_unregister:
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index f6a6cc5bf..219d5e56c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -281,10 +281,13 @@ static int mdio_bus_phy_suspend(struct device *dev)
 	if (phydev->attached_dev && phydev->adjust_link)
 		phy_stop_machine(phydev);
 
-	if (!mdio_bus_phy_may_suspend(phydev))
+	if (!mdio_bus_phy_may_suspend(phydev)) {
+		dev_info(dev, "phy suspend not required\n");
 		return 0;
+	}
 
-	return phy_suspend(phydev);
+	dev_info(dev, "dummy phy suspend\n");
+	return 0;
 }
 
 static int mdio_bus_phy_resume(struct device *dev)
-- 
2.22.0
Comment 16 Dušan Dragić 2019-07-21 16:21:08 UTC
Ok, I finally tested this, results:

1) hunk #1 (wol_enable) only
No change, link is down after resume

2) hunk #2 (phy suspend) only
Link is down after resume, dmesg says:

[   72.536201] RTL8211B Gigabit Ethernet r8169-500:00: phy suspend not required

3) Whole patch
Link is down, dmesg says:

[  133.787143] RTL8211B Gigabit Ethernet r8169-500:00: dummy phy suspend


I'll attach the whole log.
Comment 17 Dušan Dragić 2019-07-21 16:23:04 UTC
Created attachment 283885 [details]
dmesg with patch from comment #15
Comment 18 Heiner Kallweit 2019-07-22 09:21:14 UTC
Thanks. Still it's not clear to me why explicitly enabling WoL fixes the issue.
Could you please post the "ethtool -d enp5s0" output before suspend in the non-working case?
Also you may try 5.1.19, it includes some IRQ-related fixes.
Comment 19 Dušan Dragić 2019-07-22 10:56:59 UTC
It's just my luck, I seem to attract weird, fringe hardware issues, but usually I just live with them :) (doesn't help that the hardware is ancient).

I'll get you the info later today and I'll try 5.1.19 (I tried 5.1.18 a couple of days ago, but nothing changed).
Comment 20 Dušan Dragić 2019-07-22 18:52:46 UTC
Created attachment 283921 [details]
Output of "ethtool -d enp5s0" before suspend
Comment 21 Dušan Dragić 2019-07-23 08:45:41 UTC
5.1.19 doesn't help, still no link after resume.
Comment 22 Heiner Kallweit 2019-07-26 18:27:29 UTC
A similar issue occurs with sky2, there MSI has to be disabled on affected systems. We could do the same and like the r8168 vendor driver don't use MSI on chip versions before RTL8168d. Could you please test the following patch?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 6d176be51..038a034ee 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7105,13 +7105,18 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
 {
 	unsigned int flags;
 
-	if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_02 ... RTL_GIGA_MAC_VER_06:
 		rtl_unlock_config_regs(tp);
 		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
 		rtl_lock_config_regs(tp);
+		/* fall through */
+	case RTL_GIGA_MAC_VER_07 ... RTL_GIGA_MAC_VER_24:
 		flags = PCI_IRQ_LEGACY;
-	} else {
+		break;
+	default:
 		flags = PCI_IRQ_ALL_TYPES;
+		break;
 	}
 
 	return pci_alloc_irq_vectors(tp->pci_dev, 1, 1, flags);
-- 
2.22.0
Comment 23 Dušan Dragić 2019-07-26 20:34:57 UTC
This patch works, link is up after resume (applied on top of the patch from comment 6).
Comment 24 Heiner Kallweit 2019-07-28 14:54:37 UTC
Fixed with 003bd5b4a7b4 ("r8169: don't use MSI before RTL8168d").
Comment 25 Dušan Dragić 2019-07-30 16:51:08 UTC
Awesome, thanks for sticking with this.

Note You need to log in before you can comment on or make changes to this bug.