Bug 200195 - [Regression] Network does not come back after suspend
Summary: [Regression] Network does not come back after suspend
Status: RESOLVED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
: 200315 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-06-22 18:04 UTC by Albert Astals Cid
Modified: 2019-05-21 18:57 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.17.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvv output (38.87 KB, text/plain)
2018-06-22 18:04 UTC, Albert Astals Cid
Details
dmesg output (63.89 KB, application/x-troff-man)
2018-07-09 23:18 UTC, Albert Astals Cid
Details
/var/log/messages (63.62 KB, text/plain)
2018-08-11 08:22 UTC, Lou Reed
Details

Description Albert Astals Cid 2018-06-22 18:04:00 UTC
Created attachment 276753 [details]
lspci -vvv output

Setup:
 * Using Archlinux
 * Using wired network

Only difference is using kernel 4.17.2 vs kernel 4.16.13

When using 4.17.2, the network does not work after resuming from suspend. With 4.16.13 it works as expected.

My guess is that this is driver/hardware dependent since i have another machine with the same setup and it does not seem to have that regression.

I'm attaching the lspci output of 4.16.13.

Anything else I can provide or test to pinpoint where the regression happened please do not hesitate to ask.
Comment 1 Albert Astals Cid 2018-06-27 21:38:20 UTC
Same issue with 4.17.3
Comment 2 Albert Astals Cid 2018-07-05 08:12:07 UTC
More data points:
 * Works with 4.16.18
 * Fails with 4.17.4
Comment 3 Albert Astals Cid 2018-07-05 08:19:21 UTC
Adding Heiner to CC in case this is a regression in the r8169 driver
Comment 4 Heiner Kallweit 2018-07-09 07:59:48 UTC
Could you please provide a full dmesg output incl. boot, suspend, resume?
If the issue really affects a particular chip version only, it may be hard for me to reproduce the error. Can you bisect to check for the commit which broke network on your system?

To get the full picture:
"Network doesn't work" means link doesn't come up or link is up but no data traffic?
Does unloading and re-loading the r8169 module fix the network?
Comment 5 Albert Astals Cid 2018-07-09 23:18:39 UTC
Created attachment 277307 [details]
dmesg output

(In reply to Heiner Kallweit from comment #4)
> Could you please provide a full dmesg output incl. boot, suspend, resume?

Attached, at 82.645476 is when i rmmod and modrprobe the r8169 module.

> If the issue really affects a particular chip version only, it may be hard
> for me to reproduce the error. Can you bisect to check for the commit which
> broke network on your system?

I tried building the kernel from git and ended up with an unbootable system :D Compiling tarballed versions is very easy in Archlinux so testing those is trivial, i need to figure out how to do it from git versions. 

> To get the full picture:
> "Network doesn't work" means link doesn't come up or link is up but no data
> traffic?

I think so yes. I.e. networkmanager and ip addr will show things "fine" but pings will never get anywhere. But if you tell me commands to try i can give you the output.

> Does unloading and re-loading the r8169 module fix the network?

Yes it does.
Comment 6 Heiner Kallweit 2018-07-10 09:00:58 UTC
The dmesg output looks normal, link comes up properly after resume from suspend.
You could try:

4.17 uses MSI-X interrupts if available instead of MSI. In rtl_alloc_irq() you could replace "flags = PCI_IRQ_ALL_TYPES" with "flags = PCI_IRQ_MSI" to check whether network works again and your system has a MSI-X-related issue.
Comment 7 Albert Astals Cid 2018-07-10 21:38:49 UTC
I have found a way to bisect the kernels, so i'm trying that now. Let's hope it finds the culprit commit. It will take a while though, i'm away from the computer  the end of this week so it may either be tomorrow or will have to wait until next week.
Comment 8 Albert Astals Cid 2018-07-23 10:02:41 UTC
So i found the regression is caused by 7edf6d314cd061e1d0a1b7bc0b511d64322c3f72

My issue has nothing to do with WakeOnLan, it's just me closing and opening the lid of my laptop (i.e. suspend/resume).
Comment 9 Heiner Kallweit 2018-07-23 15:12:29 UTC
Thanks for bisecting! So far it's not clear to me where the link is between suspend/resume and the WoL change.

Few more questions:
- Is WoL activated in your BIOS?
- Does suspend/resume behavior change if you change the BIOS WoL setting?

Does suspend/resume behavior change if you add the following before the call to __rtl8169_set_wol(tp, 0) in rtl_init_one() ?

RTL_W8(Cfg9346, Cfg9346_Unlock);
RTL_W8(Config1, RTL_R8(Config1) | PMEnable);
RTL_W8(Cfg9346, Cfg9346_Lock);
Comment 10 Albert Astals Cid 2018-07-23 18:51:20 UTC
(In reply to Heiner Kallweit from comment #9)
> Thanks for bisecting! So far it's not clear to me where the link is between
> suspend/resume and the WoL change.
> 
> Few more questions:
> - Is WoL activated in your BIOS?
> - Does suspend/resume behavior change if you change the BIOS WoL setting?

My BIOS doesn't seem to have a Wake on Lan setting.

> 
> Does suspend/resume behavior change if you add the following before the call
> to __rtl8169_set_wol(tp, 0) in rtl_init_one() ?
> 
> RTL_W8(Cfg9346, Cfg9346_Unlock);
> RTL_W8(Config1, RTL_R8(Config1) | PMEnable);
> RTL_W8(Cfg9346, Cfg9346_Lock);

No, same problem.
Comment 11 Heiner Kallweit 2018-07-23 19:25:42 UTC
OK, thanks for testing. Then two more tests:

1. If you comment out the call to __rtl8169_set_wol(tp, 0) in rtl_init_one(), does it work again?

2. If you just comment out "RTL_W8(Config2, options)" in __rtl8169_set_wol(), does that help?
Comment 12 Albert Astals Cid 2018-07-23 19:40:44 UTC
(In reply to Heiner Kallweit from comment #11)
> OK, thanks for testing. Then two more tests:
> 
> 1. If you comment out the call to __rtl8169_set_wol(tp, 0) in
> rtl_init_one(), does it work again?
> 
> 2. If you just comment out "RTL_W8(Config2, options)" in
> __rtl8169_set_wol(), does that help?

Should i try this with or without the changes suggested in Comment #9 (or both?)
Comment 13 Albert Astals Cid 2018-07-23 19:43:02 UTC
BTW is there an easy way of just recompiling the r8169 module? Right now i'm compiling the whole kernel which takes a 40 min roundtrip so it's really not very optimal :D
Comment 14 Albert Astals Cid 2018-07-23 19:45:03 UTC
Ah, seems 
make SUBDIRS=drivers/net/ethernet/realtek modules
does the trick
Comment 15 Heiner Kallweit 2018-07-23 19:50:28 UTC
Please do each test independently, apply just the change mentioned in the respective test.

If you change just file r8169.c, then make automatically rebuilds only the r8169 module, just do a "make modules; make modules_install" in the kernel source root.

If you use some distribution-provided script to (re-)build the kernel it may do a "make clean" or similar upfront.
Comment 16 Albert Astals Cid 2018-07-23 20:42:52 UTC
(In reply to Heiner Kallweit from comment #11)
> OK, thanks for testing. Then two more tests:
> 
> 1. If you comment out the call to __rtl8169_set_wol(tp, 0) in
> rtl_init_one(), does it work again?

Yes, this helps.

> 
> 2. If you just comment out "RTL_W8(Config2, options)" in
> __rtl8169_set_wol(), does that help?

No, this does not help
Comment 17 Heiner Kallweit 2018-07-23 21:07:50 UTC
Good, it needs some patience, but we're getting closer.
In __rtl8169_set_wol() there's the cfg[] array with 6 entries.
If you comment out single entries, does it help? Best start with LanWake entry.
Comment 18 Albert Astals Cid 2018-07-23 22:43:23 UTC
If i comment out 
 { WAKE_MAGIC, Config3, MagicPacket }
it works, the others don't seem to make any difference.
Comment 19 Heiner Kallweit 2018-07-24 09:38:13 UTC
Huh, this is weird. I see two possible reasons:

1. An erratum in your network chip variant. However then others should have complained too, and also the r8168 vendor driver doesn't include any related quirk. So I think this option is less likely.

2. Your BIOS is broken and somehow relies on the MagicPacket bit being set even if WoL isn't used.

We could set this bit in general in the driver, however this may have side effects and doesn't seem to be the right approach to fix an issue with just one broken BIOS.

What should help as a workaround for you: Place an "ethtool -s <if> wol g" or similar in your startup scripts. Then the MagicPacket bit is set and suspend/resume should work. Could you please test this?
Comment 20 Albert Astals Cid 2018-07-24 09:55:46 UTC
I'd say this is affecting other people

https://www.reddit.com/r/archlinux/comments/8tsz8o/anyone_else_suffers_from_this_wired_network/

https://bugs.archlinux.org/task/59090

No idea if the first person is also using r8169 but the second one is

And at least I have two laptops suffering from this problem.

I'll try to test the workaround later today (kind of busy now) but honestly i don't think asking people to workaround the issue on their init scripts is great when this used to work perfectly fine with older kernels.
Comment 21 Heiner Kallweit 2018-07-24 19:53:25 UTC
(In reply to Albert Astals Cid from comment #20)
> I'll try to test the workaround later today (kind of busy now) but honestly
> i don't think asking people to workaround the issue on their init scripts is
> great when this used to work perfectly fine with older kernels.

This is a valid point. A workaround may be acceptable if it didn't work so far, but it did and we shouldn't break it. So I think I will restore the state we had in 4.16 with the following patch:

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index a3f69901..eaedc11e 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7734,8 +7734,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
                return rc;
        }

-       /* override BIOS settings, use userspace tools to enable WOL */
-       __rtl8169_set_wol(tp, 0);
+       tp->saved_wolopts = __rtl8169_get_wol(tp);

        if (rtl_tbi_enabled(tp)) {
                tp->set_speed = rtl8169_set_speed_tbi;

This isn't perfect either because it accepts the BIOS WoL flags but doesn't mark the device as wakeup-enabled (due to a previous patch working around another broken BIOS). But people seemed to be fine with it.
Comment 22 Heiner Kallweit 2018-07-24 21:37:01 UTC
Should be fixed with commit 18041b523692 "r8169: restore previous behavior to accept BIOS WoL settings" in 4.18, and in 4.17 once it has been applied to stable.
Comment 23 Albert Astals Cid 2018-07-24 22:24:18 UTC
Very much appreciated. 

Should i mark the bug as Resolved now? Or do you prefer to wait for the patch to land in 4.17?
Comment 24 Heiner Kallweit 2018-07-25 08:14:23 UTC
It would be good if you could apply this patch to the latest 4.17 kernel and confirm that it fixes the issue. Then the bug can be marked as resolved.
Comment 25 Albert Astals Cid 2018-07-26 09:52:35 UTC
I can confirm that 4.17.10 + this patch fixes the regression for me.
Comment 26 Nícholas Lima de Souza Silva 2018-07-26 14:46:23 UTC
*** Bug 200315 has been marked as a duplicate of this bug. ***
Comment 27 Lou Reed 2018-07-30 13:30:10 UTC
I'd say that 4.18-rc7 did not the trick for me. Neither reloading r8169 module.
(In reply to Heiner Kallweit from comment #19)
> What should help as a workaround for you: Place an "ethtool -s <if> wol g"
> or similar in your startup scripts. Then the MagicPacket bit is set and
> suspend/resume should work. Could you please test this?
Will the effect be immediate or just with next system boot?
Comment 28 Heiner Kallweit 2018-07-30 14:45:03 UTC
(In reply to Lou Reed from comment #27)
> I'd say that 4.18-rc7 did not the trick for me. Neither reloading r8169
> module.
> (In reply to Heiner Kallweit from comment #19)
> > What should help as a workaround for you: Place an "ethtool -s <if> wol g"
> > or similar in your startup scripts. Then the MagicPacket bit is set and
> > suspend/resume should work. Could you please test this?
> Will the effect be immediate or just with next system boot?

After executing the ethtool command the effect should be immediate.
4.18-rc7 includes the fix for this bug. Having said that the issue you're facing may have a different reason. Therefore I'd prefer if this bug is closed and you create a new one, attaching the logs from your system.
Comment 29 Steve Dodd 2018-08-06 16:46:56 UTC
I'm struggling with this as well (I'm an Ubuntu user, and they don't seem to be making much headway); have tested with net-next directly (981467033a37d916649647fa3afe1fe99bba1817) today and still no joy.

Did anyone ever open another bug report?
Comment 30 Heiner Kallweit 2018-08-06 19:17:48 UTC
(In reply to Steve Dodd from comment #29)
> I'm struggling with this as well (I'm an Ubuntu user, and they don't seem to
> be making much headway); have tested with net-next directly
> (981467033a37d916649647fa3afe1fe99bba1817) today and still no joy.
> 
> Did anyone ever open another bug report?

It's hard to say anything w/o sufficient details. r8169 driver supports ~ 50 chip variants, more or less each one behaving slightly different.

Please open a new bug with full dmesg output and at least the following info:
- How to reproduce the error
- Which kernel versions work and which don't
  (IOW: Which kernel version introduced the regression?)
- Does unloading / re-loading module r8169 help ?
- Does bringing the interface down and up again help ?
- Does rebooting help?
- Does changing WoL options (as described here) help ?

Ideally all this using mainline kernel, because I don't know which modifications may have been made to distribution-specific kernels.

And ideally (part 2) it would help if you could bisect the issue (requires some experience with git and in kernel building).
Comment 31 Heiner Kallweit 2018-08-09 14:29:29 UTC
Steve's issue is caused by a broken BIOS which can't handle MSI-X properly.
See also here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817
A possible workaround is to disable MSI for the device via sysfs.
Comment 32 Lou Reed 2018-08-11 02:02:25 UTC
(In reply to Heiner Kallweit from comment #31)
> Steve's issue is caused by a broken BIOS which can't handle MSI-X properly.
> See also here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817
> A possible workaround is to disable MSI for the device via sysfs.
Looks like I have the same(?) chip version as Steve and *finally* my NIC waking up just fine after editing r8169.c https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817/comments/45 (sending 0 to /sys/devices/pci*/*/msi_bus was not tested yet) Big thanks, is there any chance to merge workaround into upstream in some form?
> Aug 11 06:09:33 soder kernel: PM: suspend exit
> Aug 11 06:09:33 soder kernel: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0: can't disable ASPM; OS
> doesn't have ASPM control
> Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0 eth0: RTL8168g/8111g at
> 0x000000008c70da64, 1c:1b:0d:c7:f7:f8, XID >0c000800 IRQ 124
> Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0 eth0: jumbo features
> [frames: 9200 bytes, tx checksumming: ko]
> Aug 11 06:09:33 soder kernel: r8169 0000:01:00.0 enp1s0: renamed from eth0
> Aug 11 06:09:34 soder kernel: r8169 0000:01:00.0 enp1s0: link down
> Aug 11 06:09:34 soder kernel: r8169 0000:01:00.0 enp1s0: link down
> Aug 11 06:09:34 soder kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not
> ready
> Aug 11 06:09:34 soder kernel: ata3: SATA link down (SStatus 4 SControl 300)
> Aug 11 06:09:34 soder kernel: ata1: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 11 06:09:34 soder kernel: ata1.00: supports DRM functions and may not be
> fully accessible
> Aug 11 06:09:34 soder kernel: ata1.00: supports DRM functions and may not be
> fully accessible
> Aug 11 06:09:34 soder kernel: ata1.00: configured for UDMA/133
> Aug 11 06:09:34 soder kernel: ahci 0000:00:17.0: port does not support device
> sleep
> Aug 11 06:09:36 soder kernel: r8169 0000:01:00.0 enp1s0: link up
> Aug 11 06:09:36 soder kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link
> becomes ready
Comment 33 Heiner Kallweit 2018-08-11 07:07:25 UTC
@Lou, what kind of system is yours? MAC seems to indicate a Gigabyte board.
Just to have the full picture, can you provide a full dmesg output incl. boot and suspend/resume?
Comment 34 Lou Reed 2018-08-11 08:22:17 UTC
Created attachment 277821 [details]
/var/log/messages

(In reply to Heiner Kallweit from comment #33)
> @Lou, what kind of system is yours? MAC seems to indicate a Gigabyte board.
> Just to have the full picture, can you provide a full dmesg output incl.
> boot and suspend/resume?

Gigabyte indeed, it's a PC, here is /var/log/messages from current boot.
I assume that we probably have the same chip version with Steve https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779817/comments/47
Comment 35 Steve Dodd 2018-08-11 08:29:08 UTC
Yeah, looks the same. Wonder if it's a chipset bug?

Incidentally, regarding the sysfs fix, you just need to disable MSI for the NIC, not the other devices - but needs doing before the module loads. On my Ubuntu system the module loads in initramfs, so I'm thinking of either using an "install" line in /etc/modprobe.d or chucking a script in /etc/initramfs-tools/scripts/init-top.
Comment 36 Heiner Kallweit 2018-08-11 12:30:45 UTC
Even though two reports don't really provide statistical confidence yet, it looks like we have to blame the network chip version, not the BIOS.
Could both of you please confirm that the following fixes the issue for you?
Then I'll submit the fix to mainline.

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 344d77d9..d103e4dd 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7076,6 +7076,11 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
 		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
 		RTL_W8(tp, Cfg9346, Cfg9346_Lock);
 		flags = PCI_IRQ_LEGACY;
+	} else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
+		/* This version was reported to have issues with resume
+		 * from suspend when using MSI-X
+		 */
+		flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
 	} else {
 		flags = PCI_IRQ_ALL_TYPES;
 	}
--
Comment 37 Lou Reed 2018-08-11 21:20:44 UTC
(In reply to Heiner Kallweit from comment #36)
> Even though two reports don't really provide statistical confidence yet, it
> looks like we have to blame the network chip version, not the BIOS.
> Could both of you please confirm that the following fixes the issue for you?
> Then I'll submit the fix to mainline.
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c
> b/drivers/net/ethernet/realtek/r8169.c
> index 344d77d9..d103e4dd 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -7076,6 +7076,11 @@ static int rtl_alloc_irq(struct rtl8169_private *tp)
>               RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~MSIEnable);
>               RTL_W8(tp, Cfg9346, Cfg9346_Lock);
>               flags = PCI_IRQ_LEGACY;
> +     } else if (tp->mac_version == RTL_GIGA_MAC_VER_40) {
> +             /* This version was reported to have issues with resume
> +              * from suspend when using MSI-X
> +              */
> +             flags = PCI_IRQ_LEGACY | PCI_IRQ_MSI;
>       } else {
>               flags = PCI_IRQ_ALL_TYPES;
>       }
> --
Yes, I can confirm that.
Comment 38 Steve Dodd 2018-08-12 10:14:57 UTC
(In reply to Heiner Kallweit from comment #36)

> Even though two reports don't really provide statistical confidence yet, it
> looks like we have to blame the network chip version, not the BIOS.
> Could both of you please confirm that the following fixes the issue for you?

Yup, that sorts it - thank you very much!
Comment 39 Heiner Kallweit 2018-08-13 19:55:13 UTC
Fixed with commit 7c53a722459c ("r8169: don't use MSI-X on RTL8168g").
It will take a little bit until the fix shows up in the stable kernel.
Comment 40 Albert Astals Cid 2019-05-16 21:29:59 UTC
A very similar regression just appeared in Linux 5.1.2 (compared to 5.0.13) any of the subscribed thinks/knows which may be the cause?

Should i just start a bisect again?

Also sorry I never closed this ^_^

I guess the best would be to close this an open a new bug for the regression in 5.1.2?
Comment 41 Heiner Kallweit 2019-05-17 08:00:49 UTC
Root cause of the original issue was a problem with Intel PCI bridge code, the problem with the network card was just a symptom and the workarounds have been removed meanwhile. Having said that I suppose your issue should be reported in a new ticket.
Comment 42 Steve Dodd 2019-05-17 12:22:23 UTC
There is possibly something going on. I've just pulled Ubuntu's package of upstream 5.1.3. After suspend/resume, I only get 100mbit/s performance, despite ethtool showing connection at 1000mbit/s. Disabling MSI doesn't make a difference...
Comment 43 Albert Astals Cid 2019-05-17 21:13:05 UTC
(In reply to Heiner Kallweit from comment #41)
> Root cause of the original issue was a problem with Intel PCI bridge code,
> the problem with the network card was just a symptom and the workarounds
> have been removed meanwhile. Having said that I suppose your issue should be
> reported in a new ticket.

Agreed, closing this one and will create a new one when i can bisect when it appeared.
Comment 44 Heiner Kallweit 2019-05-21 18:53:24 UTC
(In reply to Steve Dodd from comment #42)
> There is possibly something going on. I've just pulled Ubuntu's package of
> upstream 5.1.3. After suspend/resume, I only get 100mbit/s performance,
> despite ethtool showing connection at 1000mbit/s. Disabling MSI doesn't make
> a difference...

This is completely unrelated to the original issue. Best create a new ticket with full dmesg, last working version etc.
Comment 45 Heiner Kallweit 2019-05-21 18:57:41 UTC
(In reply to Heiner Kallweit from comment #44)
> (In reply to Steve Dodd from comment #42)
> > There is possibly something going on. I've just pulled Ubuntu's package of
> > upstream 5.1.3. After suspend/resume, I only get 100mbit/s performance,
> > despite ethtool showing connection at 1000mbit/s. Disabling MSI doesn't
> make
> > a difference...
> 
> This is completely unrelated to the original issue. Best create a new ticket
> with full dmesg, last working version etc.

Maybe it's related to the following.
https://bugzilla.kernel.org/show_bug.cgi?id=202851
When saying suspend/resume do you mean S3 or hibernation?

Note You need to log in before you can comment on or make changes to this bug.