Created attachment 276753 [details] lspci -vvv output Setup: * Using Archlinux * Using wired network Only difference is using kernel 4.17.2 vs kernel 4.16.13 When using 4.17.2, the network does not work after resuming from suspend. With 4.16.13 it works as expected. My guess is that this is driver/hardware dependent since i have another machine with the same setup and it does not seem to have that regression. I'm attaching the lspci output of 4.16.13. Anything else I can provide or test to pinpoint where the regression happened please do not hesitate to ask.
Same issue with 4.17.3
More data points: * Works with 4.16.18 * Fails with 4.17.4
Adding Heiner to CC in case this is a regression in the r8169 driver
Could you please provide a full dmesg output incl. boot, suspend, resume? If the issue really affects a particular chip version only, it may be hard for me to reproduce the error. Can you bisect to check for the commit which broke network on your system? To get the full picture: "Network doesn't work" means link doesn't come up or link is up but no data traffic? Does unloading and re-loading the r8169 module fix the network?
Created attachment 277307 [details] dmesg output (In reply to Heiner Kallweit from comment #4) > Could you please provide a full dmesg output incl. boot, suspend, resume? Attached, at 82.645476 is when i rmmod and modrprobe the r8169 module. > If the issue really affects a particular chip version only, it may be hard > for me to reproduce the error. Can you bisect to check for the commit which > broke network on your system? I tried building the kernel from git and ended up with an unbootable system :D Compiling tarballed versions is very easy in Archlinux so testing those is trivial, i need to figure out how to do it from git versions. > To get the full picture: > "Network doesn't work" means link doesn't come up or link is up but no data > traffic? I think so yes. I.e. networkmanager and ip addr will show things "fine" but pings will never get anywhere. But if you tell me commands to try i can give you the output. > Does unloading and re-loading the r8169 module fix the network? Yes it does.
The dmesg output looks normal, link comes up properly after resume from suspend. You could try: 4.17 uses MSI-X interrupts if available instead of MSI. In rtl_alloc_irq() you could replace "flags = PCI_IRQ_ALL_TYPES" with "flags = PCI_IRQ_MSI" to check whether network works again and your system has a MSI-X-related issue.
I have found a way to bisect the kernels, so i'm trying that now. Let's hope it finds the culprit commit. It will take a while though, i'm away from the computer the end of this week so it may either be tomorrow or will have to wait until next week.
So i found the regression is caused by 7edf6d314cd061e1d0a1b7bc0b511d64322c3f72 My issue has nothing to do with WakeOnLan, it's just me closing and opening the lid of my laptop (i.e. suspend/resume).
Thanks for bisecting! So far it's not clear to me where the link is between suspend/resume and the WoL change. Few more questions: - Is WoL activated in your BIOS? - Does suspend/resume behavior change if you change the BIOS WoL setting? Does suspend/resume behavior change if you add the following before the call to __rtl8169_set_wol(tp, 0) in rtl_init_one() ? RTL_W8(Cfg9346, Cfg9346_Unlock); RTL_W8(Config1, RTL_R8(Config1) | PMEnable); RTL_W8(Cfg9346, Cfg9346_Lock);
(In reply to Heiner Kallweit from comment #9) > Thanks for bisecting! So far it's not clear to me where the link is between > suspend/resume and the WoL change. > > Few more questions: > - Is WoL activated in your BIOS? > - Does suspend/resume behavior change if you change the BIOS WoL setting? My BIOS doesn't seem to have a Wake on Lan setting. > > Does suspend/resume behavior change if you add the following before the call > to __rtl8169_set_wol(tp, 0) in rtl_init_one() ? > > RTL_W8(Cfg9346, Cfg9346_Unlock); > RTL_W8(Config1, RTL_R8(Config1) | PMEnable); > RTL_W8(Cfg9346, Cfg9346_Lock); No, same problem.
OK, thanks for testing. Then two more tests: 1. If you comment out the call to __rtl8169_set_wol(tp, 0) in rtl_init_one(), does it work again? 2. If you just comment out "RTL_W8(Config2, options)" in __rtl8169_set_wol(), does that help?
(In reply to Heiner Kallweit from comment #11) > OK, thanks for testing. Then two more tests: > > 1. If you comment out the call to __rtl8169_set_wol(tp, 0) in > rtl_init_one(), does it work again? > > 2. If you just comment out "RTL_W8(Config2, options)" in > __rtl8169_set_wol(), does that help? Should i try this with or without the changes suggested in Comment #9 (or both?)
BTW is there an easy way of just recompiling the r8169 module? Right now i'm compiling the whole kernel which takes a 40 min roundtrip so it's really not very optimal :D
Ah, seems make SUBDIRS=drivers/net/ethernet/realtek modules does the trick
Please do each test independently, apply just the change mentioned in the respective test. If you change just file r8169.c, then make automatically rebuilds only the r8169 module, just do a "make modules; make modules_install" in the kernel source root. If you use some distribution-provided script to (re-)build the kernel it may do a "make clean" or similar upfront.
(In reply to Heiner Kallweit from comment #11) > OK, thanks for testing. Then two more tests: > > 1. If you comment out the call to __rtl8169_set_wol(tp, 0) in > rtl_init_one(), does it work again? Yes, this helps. > > 2. If you just comment out "RTL_W8(Config2, options)" in > __rtl8169_set_wol(), does that help? No, this does not help
Good, it needs some patience, but we're getting closer. In __rtl8169_set_wol() there's the cfg[] array with 6 entries. If you comment out single entries, does it help? Best start with LanWake entry.
If i comment out { WAKE_MAGIC, Config3, MagicPacket } it works, the others don't seem to make any difference.
Huh, this is weird. I see two possible reasons: 1. An erratum in your network chip variant. However then others should have complained too, and also the r8168 vendor driver doesn't include any related quirk. So I think this option is less likely. 2. Your BIOS is broken and somehow relies on the MagicPacket bit being set even if WoL isn't used. We could set this bit in general in the driver, however this may have side effects and doesn't seem to be the right approach to fix an issue with just one broken BIOS. What should help as a workaround for you: Place an "ethtool -s <if> wol g" or similar in your startup scripts. Then the MagicPacket bit is set and suspend/resume should work. Could you please test this?
I'd say this is affecting other people https://www.reddit.com/r/archlinux/comments/8tsz8o/anyone_else_suffers_from_this_wired_network/ https://bugs.archlinux.org/task/59090 No idea if the first person is also using r8169 but the second one is And at least I have two laptops suffering from this problem. I'll try to test the workaround later today (kind of busy now) but honestly i don't think asking people to workaround the issue on their init scripts is great when this used to work perfectly fine with older kernels.
(In reply to Albert Astals Cid from comment #20) > I'll try to test the workaround later today (kind of busy now) but honestly > i don't think asking people to workaround the issue on their init scripts is > great when this used to work perfectly fine with older kernels. This is a valid point. A workaround may be acceptable if it didn't work so far, but it did and we shouldn't break it. So I think I will restore the state we had in 4.16 with the following patch: diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index a3f69901..eaedc11e 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -7734,8 +7734,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) return rc; } - /* override BIOS settings, use userspace tools to enable WOL */ - __rtl8169_set_wol(tp, 0); + tp->saved_wolopts = __rtl8169_get_wol(tp); if (rtl_tbi_enabled(tp)) { tp->set_speed = rtl8169_set_speed_tbi; This isn't perfect either because it accepts the BIOS WoL flags but doesn't mark the device as wakeup-enabled (due to a previous patch working around another broken BIOS). But people seemed to be fine with it.
Should be fixed with commit 18041b523692 "r8169: restore previous behavior to accept BIOS WoL settings" in 4.18, and in 4.17 once it has been applied to stable.
Very much appreciated. Should i mark the bug as Resolved now? Or do you prefer to wait for the patch to land in 4.17?
It would be good if you could apply this patch to the latest 4.17 kernel and confirm that it fixes the issue. Then the bug can be marked as resolved.
I can confirm that 4.17.10 + this patch fixes the regression for me.
*** Bug 200315 has been marked as a duplicate of this bug. ***
I'd say that 4.18-rc7 did not the trick for me. Neither reloading r8169 module. (In reply to Heiner Kallweit from comment #19) > What should help as a workaround for you: Place an "ethtool -s <if> wol g" > or similar in your startup scripts. Then the MagicPacket bit is set and > suspend/resume should work. Could you please test this? Will the effect be immediate or just with next system boot?
(In reply to Lou Reed from comment #27) > I'd say that 4.18-rc7 did not the trick for me. Neither reloading r8169 > module. > (In reply to Heiner Kallweit from comment #19) > > What should help as a workaround for you: Place an "ethtool -s <if> wol g" > > or similar in your startup scripts. Then the MagicPacket bit is set and > > suspend/resume should work. Could you please test this? > Will the effect be immediate or just with next system boot? After executing the ethtool command the effect should be immediate. 4.18-rc7 includes the fix for this bug. Having said that the issue you're facing may have a different reason. Therefore I'd prefer if this bug is closed and you create a new one, attaching the logs from your system.
I'm struggling with this as well (I'm an Ubuntu user, and they don't seem to be making much headway); have tested with net-next directly (981467033a37d916649647fa3afe1fe99bba1817) today and still no joy. Did anyone ever open another bug report?
(In reply to Steve Dodd from comment #29) > I'm struggling with this as well (I'm an Ubuntu user, and they don't seem to > be making much headway); have tested with net-next directly > (981467033a37d916649647fa3afe1fe99bba1817) today and still no joy. > > Did anyone ever open another bug report? It's hard to say anything w/o sufficient details. r8169 driver supports ~ 50 chip variants, more or less each one behaving slightly different. Please open a new bug with full dmesg output and at least the following info: - How to reproduce the error - Which kernel versions work and which don't (IOW: Which kernel version introduced the regression?) - Does unloading / re-loading module r8169 help ? - Does bringing the interface down and up again help ? - Does rebooting help? - Does changing WoL options (as described here) help ? Ideally all this using mainline kernel, because I don't know which modifications may have been made to distribution-specific kernels. And ideally (part 2) it would help if you could bisect the issue (requires some experience with git and in kernel building).
Steve's issue is caused by a broken BIOS which can't handle MSI-X properly. See also here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817 A possible workaround is to disable MSI for the device via sysfs.
(In reply to Heiner Kallweit from comment #31) > Steve's issue is caused by a broken BIOS which can't handle MSI-X properly. > See also here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817 > A possible workaround is to disable MSI for the device via sysfs. Looks like I have the same(?) chip version as Steve and *finally* my NIC waking up just fine after editing r8169.c https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817/comments/45 (sending 0 to /sys/devices/pci*/*/msi_bus was not tested yet) Big thanks, is there any chance to merge workaround into upstream in some form? > Aug 11 06:09:33 soder kernel: PM: suspend exit > Aug 11 06:09:33 soder kernel: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded > Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0: can't disable ASPM; OS > doesn't have ASPM control > Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0 eth0: RTL8168g/8111g at > 0x000000008c70da64, 1c:1b:0d:c7:f7:f8, XID >0c000800 IRQ 124 > Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0 eth0: jumbo features > [frames: 9200 bytes, tx checksumming: ko] > Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0 enp1s0: renamed from eth0 > Aug 11 06:09:34 soder kernel: r8169 0000:01:00.0 enp1s0: link down > Aug 11 06:09:34 soder kernel: r8169 0000:01:00.0 enp1s0: link down > Aug 11 06:09:34 soder kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not > ready > Aug 11 06:09:34 soder kernel: ata3: SATA link down (SStatus 4 SControl 300) > Aug 11 06:09:34 soder kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 11 06:09:34 soder kernel: ata1.00: supports DRM functions and may not be > fully accessible > Aug 11 06:09:34 soder kernel: ata1.00: supports DRM functions and may not be > fully accessible > Aug 11 06:09:34 soder kernel: ata1.00: configured for UDMA/133 > Aug 11 06:09:34 soder kernel: ahci 0000:00:17.0: port does not support device > sleep > Aug 11 06:09:36 soder kernel: r8169 0000:01:00.0 enp1s0: link up > Aug 11 06:09:36 soder kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link > becomes ready
@Lou, what kind of system is yours? MAC seems to indicate a Gigabyte board. Just to have the full picture, can you provide a full dmesg output incl. boot and suspend/resume?
Created attachment 277821 [details] /var/log/messages (In reply to Heiner Kallweit from comment #33) > @Lou, what kind of system is yours? MAC seems to indicate a Gigabyte board. > Just to have the full picture, can you provide a full dmesg output incl. > boot and suspend/resume? Gigabyte indeed, it's a PC, here is /var/log/messages from current boot. I assume that we probably have the same chip version with Steve https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817/comments/47
Yeah, looks the same. Wonder if it's a chipset bug? Incidentally, regarding the sysfs fix, you just need to disable MSI for the NIC, not the other devices - but needs doing before the module loads. On my Ubuntu system the module loads in initramfs, so I'm thinking of either using an "install" line in /etc/modprobe.d or chucking a script in /etc/initramfs-tools/scripts/init-top.
Even though two reports don't really provide statistical confidence yet, it looks like we have to blame the network chip version, not the BIOS. Could both of you please confirm that the following fixes the issue for you? Then I'll submit the fix to mainline. diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 344d77d9..d103e4dd 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -7076,6 +7076,11 @@ static int rtl_alloc_irq(struct rtl8169_private *tp) RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable); RTL_W8(tp, Cfg9346, Cfg9346_Lock); flags = PCI_IRQ_LEGACY; + } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) { + /* This version was reported to have issues with resume + * from suspend when using MSI-X + */ + flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI; } else { flags = PCI_IRQ_ALL_TYPES; } --
(In reply to Heiner Kallweit from comment #36) > Even though two reports don't really provide statistical confidence yet, it > looks like we have to blame the network chip version, not the BIOS. > Could both of you please confirm that the following fixes the issue for you? > Then I'll submit the fix to mainline. > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 344d77d9..d103e4dd 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -7076,6 +7076,11 @@ static int rtl_alloc_irq(struct rtl8169_private *tp) > RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable); > RTL_W8(tp, Cfg9346, Cfg9346_Lock); > flags = PCI_IRQ_LEGACY; > + } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) { > + /* This version was reported to have issues with resume > + * from suspend when using MSI-X > + */ > + flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI; > } else { > flags = PCI_IRQ_ALL_TYPES; > } > -- Yes, I can confirm that.
(In reply to Heiner Kallweit from comment #36) > Even though two reports don't really provide statistical confidence yet, it > looks like we have to blame the network chip version, not the BIOS. > Could both of you please confirm that the following fixes the issue for you? Yup, that sorts it - thank you very much!
Fixed with commit 7c53a722459c ("r8169: don't use MSI-X on RTL8168g"). It will take a little bit until the fix shows up in the stable kernel.
A very similar regression just appeared in Linux 5.1.2 (compared to 5.0.13) any of the subscribed thinks/knows which may be the cause? Should i just start a bisect again? Also sorry I never closed this ^_^ I guess the best would be to close this an open a new bug for the regression in 5.1.2?
Root cause of the original issue was a problem with Intel PCI bridge code, the problem with the network card was just a symptom and the workarounds have been removed meanwhile. Having said that I suppose your issue should be reported in a new ticket.
There is possibly something going on. I've just pulled Ubuntu's package of upstream 5.1.3. After suspend/resume, I only get 100mbit/s performance, despite ethtool showing connection at 1000mbit/s. Disabling MSI doesn't make a difference...
(In reply to Heiner Kallweit from comment #41) > Root cause of the original issue was a problem with Intel PCI bridge code, > the problem with the network card was just a symptom and the workarounds > have been removed meanwhile. Having said that I suppose your issue should be > reported in a new ticket. Agreed, closing this one and will create a new one when i can bisect when it appeared.
(In reply to Steve Dodd from comment #42) > There is possibly something going on. I've just pulled Ubuntu's package of > upstream 5.1.3. After suspend/resume, I only get 100mbit/s performance, > despite ethtool showing connection at 1000mbit/s. Disabling MSI doesn't make > a difference... This is completely unrelated to the original issue. Best create a new ticket with full dmesg, last working version etc.
(In reply to Heiner Kallweit from comment #44) > (In reply to Steve Dodd from comment #42) > > There is possibly something going on. I've just pulled Ubuntu's package of > > upstream 5.1.3. After suspend/resume, I only get 100mbit/s performance, > > despite ethtool showing connection at 1000mbit/s. Disabling MSI doesn't > make > > a difference... > > This is completely unrelated to the original issue. Best create a new ticket > with full dmesg, last working version etc. Maybe it's related to the following. https://bugzilla.kernel.org/show_bug.cgi?id=202851 When saying suspend/resume do you mean S3 or hibernation?