Created attachment 283571 [details] Kernel log with sleep/resume cycle and no link on nic I'm on Fedora 29 and ever since upgrading to kernel 5.1 series I've had this problem, resuming from sleep results in no link and the only thing that helps is unloading and reloading the r8169 module. After resume there is this message in the log (kernel log with sleep/resume is attached): do_IRQ: 3.38 No irq handler for vector This only started happening in 5.1.x, there were no issues in 5.0.x and earlier kernels. I bisected the r8169 driver and it seems to be a regression from fa6821cbf1d9 (r8169: improve WoL handling), after reverting this commit it works fine again (please note, I reverted 5ea25b125289 first and then fa6821cbf1d9). Hardware info (this is an old system): Motherboard: Gigabyte GA-P35-DS3L CPU: Xeon E5450 Ethtool after resume (no link): Settings for enp5s0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised pause frame use: Symmetric Receive-only Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Speed: Unknown! Duplex: Unknown! (255) Port: MII PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: d Current message level: 0x00000033 (51) drv probe ifdown ifup Link detected: no lspci info: 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 01) Subsystem: Gigabyte Technology Co., Ltd Onboard Ethernet [1458:e000] Flags: bus master, fast devsel, latency 0, IRQ 31 I/O ports at c000 [size=256] Memory at f8000000 (64-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at f7000000 [disabled] [size=128K] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+ Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [84] Vendor Specific Information: Len=4c <?> Capabilities: [100] Advanced Error Reporting Capabilities: [12c] Virtual Channel Capabilities: [148] Device Serial Number 00-00-00-00-10-ec-81-68 Capabilities: [154] Power Budgeting <?> Kernel driver in use: r8169 Kernel modules: r8169
The "attached PHY driver ..." message is diplayed on opening the network device only. There may be some userspace tool involved. Could you try suspending the device manually via "systemctl suspend" ? Does the same issue happen an wakeup (via power button)? And another test: If you set WoL before suspending (ethtool -s enp5s0 wol g), issue still there or gone?
(In reply to Heiner Kallweit from comment #1) > The "attached PHY driver ..." message is diplayed on opening the network > device only. There may be some userspace tool involved. > Could you try suspending the device manually via "systemctl suspend" ? Does > the same issue happen an wakeup (via power button)? > > And another test: If you set WoL before suspending (ethtool -s enp5s0 wol > g), issue still there or gone? I tried suspending with "systemctl suspend" and waking up and there is no change, still no network, but enabling WoL with "ethtool -s enp5s0 wol g" before suspend did the trick! Now everything looks ok after resume.
Created attachment 283583 [details] Log with sleep/resume after enabling WoL with ethtool, network works after resume.
All enabling WoL does is preventing the PHY from being powered-down. That should not cause the issue, so I assume a subtle race, leading to this "do_IRQ: 3.38 No irq handler for vector" at the time the link-ip interrrupt is expected. There have been some recent fixes to the interrupt handling code in question, e.g. b7107a67f0d1 ("x86/irq: Handle spurious interrupt after shutdown gracefully"). Therefore it would be helpful if you could re-test (with WoL Disabled) with latest linux-next: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
I tried linux-next, but it's the same: "do_IRQ: 3.38 No irq handler for vector" and no link after wake-up. P.S. Probably doesn't matter, but amdgpu from linux-next was freaking out and I couldn't get a gui or tty login (disabling modesetting got me a virtual console login but after resume there was no output on the screen), so I connected via serial console to the system and suspended with systemctl suspend since I didn't have time to deal with this.
The following adds some debug info and should apply on 5.1. Could you please run it with WoL disabled and post the dmesg of a suspend/resume cycle? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index d06a61f00..09324be54 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -6493,6 +6493,8 @@ static int rtl8169_close(struct net_device *dev) struct rtl8169_private *tp = netdev_priv(dev); struct pci_dev *pdev = tp->pci_dev; + netdev_info(dev, "start rtl8169_close\n"); + pm_runtime_get_sync(&pdev->dev); /* Update counters before going down */ @@ -6520,6 +6522,8 @@ static int rtl8169_close(struct net_device *dev) pm_runtime_put_sync(&pdev->dev); + netdev_info(dev, "end rtl8169_close\n"); + return 0; } @@ -6538,6 +6542,8 @@ static int rtl_open(struct net_device *dev) struct pci_dev *pdev = tp->pci_dev; int retval = -ENOMEM; + netdev_info(dev, "start rtl_open\n"); + pm_runtime_get_sync(&pdev->dev); /* @@ -6591,6 +6597,7 @@ static int rtl_open(struct net_device *dev) pm_runtime_put_sync(&pdev->dev); out: + netdev_info(dev, "end rtl_open, retval = %d\n", retval); return retval; err_free_irq: @@ -6673,6 +6680,8 @@ static void rtl8169_net_suspend(struct net_device *dev) if (!netif_running(dev)) return; + netdev_info(dev, "in rtl8169_net_suspend\n"); + phy_stop(tp->phydev); netif_device_detach(dev); @@ -6693,9 +6702,13 @@ static int rtl8169_suspend(struct device *device) struct net_device *dev = dev_get_drvdata(device); struct rtl8169_private *tp = netdev_priv(dev); + netdev_info(dev, "start rtl8169_suspend\n"); + rtl8169_net_suspend(dev); clk_disable_unprepare(tp->clk); + netdev_info(dev, "end rtl8169_suspend\n"); + return 0; } @@ -6722,12 +6735,20 @@ static int rtl8169_resume(struct device *device) struct net_device *dev = dev_get_drvdata(device); struct rtl8169_private *tp = netdev_priv(dev); + netdev_info(dev, "start rtl8169_resume\n"); + rtl_rar_set(tp, dev->dev_addr); clk_prepare_enable(tp->clk); - if (netif_running(dev)) + if (netif_running(dev)) { + netdev_info(dev, "calling __rtl8169_resume\n"); __rtl8169_resume(dev); + } else { + netdev_info(dev, "not calling __rtl8169_resume\n"); + } + + netdev_info(dev, "end rtl8169_resume\n"); return 0; } -- 2.22.0
Created attachment 283603 [details] Log with the patch applied. dmesg shows boot, suspend/resume, rmmod r8169 and modprobe r8169
Hmm, this looks totally normal. If you manually close and open the interface (ip link set enp5s0 down/up) w/o suspending, does the same problem occur?
(In reply to Heiner Kallweit from comment #8) > Hmm, this looks totally normal. If you manually close and open the interface > (ip link set enp5s0 down/up) w/o suspending, does the same problem occur? No, no problem with manual link down/up: [ 3152.064584] r8169 0000:05:00.0 enp5s0: start rtl8169_close [ 3152.064610] r8169 0000:05:00.0 enp5s0: Link is Down [ 3152.077962] r8169 0000:05:00.0 enp5s0: end rtl8169_close [ 3177.556546] r8169 0000:05:00.0 enp5s0: start rtl_open [ 3177.557051] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE) [ 3177.634558] r8169 0000:05:00.0 enp5s0: end rtl_open, retval = 0 [ 3177.635763] r8169 0000:05:00.0 enp5s0: Link is Down [ 3179.278388] r8169 0000:05:00.0 enp5s0: Link is Up - 100Mbps/Full - flow control rx/tx [ 3179.278402] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0: link becomes ready
Thanks. Then there may be some erratum with this chip version. To check why enabling WoL avoids the issue: Can you apply the following and again check whether it helps and attach the dmesg log. diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 09324be54..9b2229df0 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -1440,7 +1440,14 @@ static void __rtl8169_set_wol(struct rtl8169_private *tp, u32 wolopts) rtl_lock_config_regs(tp); - device_set_wakeup_enable(tp_to_dev(tp), wolopts); + switch (tp->mac_version) { + case RTL_GIGA_MAC_VER_12: + netdev_info(tp->dev, "don't set wakeup\n"); + break; + default: + device_set_wakeup_enable(tp_to_dev(tp), wolopts); + break; + } } static int rtl8169_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol) @@ -7360,6 +7367,15 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (r8168_check_dash(tp)) rtl8168_driver_start(tp); + switch (tp->mac_version) { + case RTL_GIGA_MAC_VER_12: + device_wakeup_enable(&pdev->dev); + netdev_info(dev, "enabling wakeup\n"); + break; + default: + break; + } + if (pci_dev_run_wake(pdev)) pm_runtime_put_sync(&pdev->dev); -- 2.22.0
Created attachment 283641 [details] Kernel log with the second patch applied, WoL set before suspend Here's the dmesg for the second patch with WoL set. I applied the patch on top of the earlier one.
Created attachment 283643 [details] Kernel log with second patch applied, without WoL This is dmesg without WoL set before suspend.
Could you please also attach the full output of "lspci -vv" ?
Created attachment 283743 [details] lspci -vv output Here is the lspci output. There was an error reading VPD of the realtek nic: "pcilib: sysfs_read_vpd: read failed: Input/output error" and there was this in dmesg: [ 268.620085] r8169 0000:05:00.0: invalid short VPD tag 05 at offset 2 Not sure if it's relevant.
Could you please check whether any of the two following hunks, or both together, fix the issue (on top of debug patch from comment 6)? If not please post dmesg output of this patch on top of the debug patch from comment 6. Thanks! diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 12700d840..632015143 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -7556,6 +7556,8 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (pci_dev_run_wake(pdev)) pm_runtime_put_sync(&pdev->dev); + dev->wol_enabled = 1; + return 0; err_mdio_unregister: diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index f6a6cc5bf..219d5e56c 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -281,10 +281,13 @@ static int mdio_bus_phy_suspend(struct device *dev) if (phydev->attached_dev && phydev->adjust_link) phy_stop_machine(phydev); - if (!mdio_bus_phy_may_suspend(phydev)) + if (!mdio_bus_phy_may_suspend(phydev)) { + dev_info(dev, "phy suspend not required\n"); return 0; + } - return phy_suspend(phydev); + dev_info(dev, "dummy phy suspend\n"); + return 0; } static int mdio_bus_phy_resume(struct device *dev) -- 2.22.0
Ok, I finally tested this, results: 1) hunk #1 (wol_enable) only No change, link is down after resume 2) hunk #2 (phy suspend) only Link is down after resume, dmesg says: [ 72.536201] RTL8211B Gigabit Ethernet r8169-500:00: phy suspend not required 3) Whole patch Link is down, dmesg says: [ 133.787143] RTL8211B Gigabit Ethernet r8169-500:00: dummy phy suspend I'll attach the whole log.
Created attachment 283885 [details] dmesg with patch from comment #15
Thanks. Still it's not clear to me why explicitly enabling WoL fixes the issue. Could you please post the "ethtool -d enp5s0" output before suspend in the non-working case? Also you may try 5.1.19, it includes some IRQ-related fixes.
It's just my luck, I seem to attract weird, fringe hardware issues, but usually I just live with them :) (doesn't help that the hardware is ancient). I'll get you the info later today and I'll try 5.1.19 (I tried 5.1.18 a couple of days ago, but nothing changed).
Created attachment 283921 [details] Output of "ethtool -d enp5s0" before suspend
5.1.19 doesn't help, still no link after resume.
A similar issue occurs with sky2, there MSI has to be disabled on affected systems. We could do the same and like the r8168 vendor driver don't use MSI on chip versions before RTL8168d. Could you please test the following patch? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 6d176be51..038a034ee 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -7105,13 +7105,18 @@ static int rtl_alloc_irq(struct rtl8169_private *tp) { unsigned int flags; - if (tp->mac_version <= RTL_GIGA_MAC_VER_06) { + switch (tp->mac_version) { + case RTL_GIGA_MAC_VER_02 ... RTL_GIGA_MAC_VER_06: rtl_unlock_config_regs(tp); RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable); rtl_lock_config_regs(tp); + /* fall through */ + case RTL_GIGA_MAC_VER_07 ... RTL_GIGA_MAC_VER_24: flags = PCI_IRQ_LEGACY; - } else { + break; + default: flags = PCI_IRQ_ALL_TYPES; + break; } return pci_alloc_irq_vectors(tp->pci_dev, 1, 1, flags); -- 2.22.0
This patch works, link is up after resume (applied on top of the patch from comment 6).
Fixed with 003bd5b4a7b4 ("r8169: don't use MSI before RTL8168d").
Awesome, thanks for sticking with this.