Bug 206969

Summary: r8169: issues on RTL8110s with TSO enabled
Product: Drivers Reporter: timo
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED CODE_FIX    
Severity: normal CC: hkallweit1
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=206891
https://bugzilla.kernel.org/show_bug.cgi?id=207049
Kernel Version: 5.4 and higher; fixed in 5.6.4 Subsystem:
Regression: Yes Bisected commit-id:

Description timo 2020-03-25 23:11:50 UTC
One of my NICs is causing failures in data transfers when TSO is enabled (default). The problems disappear when I disable TSO using 'ethtool'.

The NIC (output from 'lspci -nn'):

04:02.0 Ethernet controller [0200]: U.S. Robotics USR997902 10/100/1000 Mbps
PCI Network Card [16ec:0116] (rev 10)

Further details (output from 'dmesg'):

[ 1.076528] r8169 0000:04:02.0 eth2: RTL8110s, 00:c0:49:f2:7c:11, XID 040,
IRQ 17
[ 1.076530] r8169 0000:04:02.0 eth2: jumbo features [frames: 7152 bytes, tx checksumming: ok]

I suspect that this is caused by commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/realtek?h=v5.5.13&id=93681cd7d94f83903cb3f0f95433d10c28a7e9a5

And possibly, the patch in commit a0783cd0c810504427777e8aae20d5f4f8b652a0 should be extended to disable TSO by default for this NIC:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/realtek?h=v5.5.13&id=a0783cd0c810504427777e8aae20d5f4f8b652a0
Comment 1 Heiner Kallweit 2020-03-27 12:31:22 UTC
Issue may be related to bug ticket 206891. Therefore the question: Do you run a 32 bit or a 64 bit kernel?
Comment 2 Heiner Kallweit 2020-03-27 12:38:16 UTC
One more question as you said "one of my NIC's": Do you have multiple RTL8169/RTL8168 chips/cards in the same system?

Also interesting would be whether you face the same issue if you manually enable TSO on an older kernel version, let's say 4.19.
Comment 3 timo 2020-03-27 17:37:33 UTC
Hi Heiner,

1) I'm running a 64-bit kernel (x86_64).

2) I have the following NICs installed:

00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04)
01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125]
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 02)
04:02.0 Ethernet controller [0200]: U.S. Robotics USR997902 10/100/1000 Mbps PCI Network Card [16ec:0116] (rev 10)

3) Regarding kernel versions:

I have used the affected NIC (USR997902) for many years, with several different kernel versions, and without ever manually disabling TSO. (I assume it was disabled by default, previously -- but I never checked that.)

The issues started somewhere in January. Looking back in my log, the switch from kernel 5.3 to 5.4 may have triggered this change in behavior.

/var/log/dpkg.log.2.gz:2020-01-25 12:28:05 install linux-image-5.4.0-3-amd64:amd64 <none> 5.4.13-1
/var/log/dpkg.log.2.gz:2020-01-12 17:33:40 install linux-image-5.4.0-2-amd64:amd64 <none> 5.4.8-1
/var/log/dpkg.log.3.gz:2019-12-13 22:57:06 install linux-image-5.3.0-3-amd64:amd64 <none> 5.3.15-1
/var/log/dpkg.log.4.gz:2019-11-26 07:32:13 install linux-image-5.3.0-2-amd64:amd64 <none> 5.3.9-3
/var/log/dpkg.log.5.gz:2019-10-03 10:42:22 install linux-image-5.2.0-3-amd64:amd64 <none> 5.2.17-1

This would also match with my suspicion that commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5 ("r8169: enable HW csum and TSO") caused this change in behavior, as this was introduced in 5.4.

I hope this helps. Let me know if I should provide further information.
Comment 4 Heiner Kallweit 2020-03-27 17:48:33 UTC
"and without ever manually disabling TSO. (I assume it was disabled by default, previously -- but I never checked that.)"

Right, tx checksumming and TSO were disabled by default.
Therefore it would be interesting to see whether you face the same issue if you enable SG/TSO on older kernel versions.
Comment 5 timo 2020-03-27 21:23:05 UTC
I just performed the following experiments:

1) boot linux-image-5.4.0-2-amd64:

- TSO is by default on and I see data transfer failures again.
- The data transfer errors disappear when I manually turn TSO off again
=> this confirms that this bug exists at least in kernel 5.4.8.

2) boot linux-image-5.3.0-3-amd64:

- TSO is by default off and I get no data transfer failures.
- I tried turning TSO on (using 'ethtool -K eth2 tso on'), but this results in an error ('Could not change any device features').
Comment 6 timo 2020-03-27 21:52:14 UTC
Some further info on the NICs with realtek chipsets:

[    1.075345] r8169 0000:01:00.0 eth0: RTL8125, 00:8e:25:70:01:42, XID 609, IRQ 28
[    1.075346] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]

This is a relatively new PCI-e NIC with RTL8125 chipset (XID 0x609 = RTL_GIGA_MAC_VER_61 I believe).

[    1.078405] r8169 0000:03:00.0 eth1: RTL8168c/8111c, 00:e0:4c:68:21:ed, XID 3c4, IRQ 16
[    1.078407] r8169 0000:03:00.0 eth1: jumbo features [frames: 6128 bytes, tx checksumming: ko]

This is an older PCI-e NIC with RTL8168c/8111c chipset (XID 0x3c4 = RTL_GIGA_MAC_VER_22 I believe).

[    1.080646] r8169 0000:04:02.0 eth2: RTL8110s, 00:c0:49:f2:7c:11, XID 040, IRQ 17
[    1.080648] r8169 0000:04:02.0 eth2: jumbo features [frames: 7152 bytes, tx checksumming: ok]
[    1.312056] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:1e:8c:f1:7a:cd

This is an old PCI NIC with RTL8110s chipset (XID 0x040 = RTL_GIGA_MAC_VER_03 I believe).

On the kernel I'm currently running (a self-compiled vanilla 5.5.13 kernel), TSO is by default disabled on all NICs, except the RTL8110s which is causing the issues.
Comment 7 Heiner Kallweit 2020-03-27 22:06:20 UTC
"I tried turning TSO on (using 'ethtool -K eth2 tso on'), but this results in an error ('Could not change any device features')."

Thanks for the further analysis.
TSO depends on SG. To switch TSO on: ethtool -K eth2 tso on sg on
Comment 8 timo 2020-03-27 22:35:15 UTC
I tried that, too, but it doesn't work:

$ ethtool -K eth2 tso on sg on
Actual changes:
scatter-gather: on
        tx-scatter-gather: on
generic-segmentation-offload: on

$ ethtool -k eth2 
Features for eth2:
rx-checksumming: on
tx-checksumming: off
        tx-checksum-ipv4: off
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [requested on]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off [requested on]
        tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
...

$ ethtool -K eth2 tso on sg on
Could not change any device features
Comment 9 Heiner Kallweit 2020-03-27 22:39:26 UTC
Right, because tx checksumming is also off. Then:
ethtool -K eth2 tso on sg on tx on
Comment 10 timo 2020-03-27 23:09:22 UTC
Thanks, that works:

$ ethtool -K eth2 tso on sg on tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ipv4: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-mangleid-segmentation: on
generic-segmentation-offload: on

After that, I see the same transfer failures on kernel 5.3.
Comment 11 timo 2020-03-28 09:47:43 UTC
Not sure if it helps. But when such transfer failures happen, the kernel log contains the following line:

NETDEV WATCHDOG: eth2 (r8169): transmit queue 0 timed out

I'm wondering whether this NIC is buggy (like the RTL8168e-vl), or that enabling TSO simply reveals another bug in the r8169 driver. I somewhat suspect the latter, as bug 206891 indeed seems related, but the behavior different (as it there only happens on a 32-bit kernel).

If that would be the case, then maybe commit a0783cd0c810504427777e8aae20d5f4f8b652a0 for which disables TSO for RTL_GIGA_MAC_VER_22 is (also) only a workaround.
Comment 12 Heiner Kallweit 2020-03-28 10:05:15 UTC
That's exactly what I'm trying to find out: whether few chip versions have HW issues with TSO/SG, or whether there's a software issue somewhere.
So far Realtek confirmed a HW TSO issue for chip version 34 only.

To bisect a potential regression first we'd need to find an older kernel version that doesn't show the issue with manually enabled CSUM/SG/TSO.
Or a SW issue has been there forever ..

However most chip versions seem to work fine with TSO, this makes it harder to explain why a driver issue should affect certain chip versions only.
Comment 13 timo 2020-03-28 10:06:43 UTC
When googling for r8169 and "transmit queue 0 timed out", I came across bug 199549. The root cause for that bug seems to be that the PHY isn't fully resumed yet when configuration starts.

Which reminds me that I enable runtime PM for these NICs by setting the following parameters to "auto":

/sys/bus/pci/devices/0000:01:00.0/power/control
/sys/bus/pci/devices/0000:03:00.0/power/control
/sys/bus/pci/devices/0000:04:02.0/power/control

Probably far-fetched... but might there be some relation to that?
Comment 14 Heiner Kallweit 2020-03-28 10:10:33 UTC
Unlikely, but I can't rule it out. It should be easy to test by disabling runtime PM for the affected card.
Comment 15 timo 2020-03-28 10:31:52 UTC
I booted kernel 5.3, left the runtime PM settings to their default value (which is "on" by the way -- "off" is not a value that's accepted), and manually enabled TSO. The transfer failures occurred again, so I think that rules it out.
Comment 16 timo 2020-03-28 17:17:26 UTC
"To bisect a potential regression first we'd need to find an older kernel version that doesn't show the issue with manually enabled CSUM/SG/TSO.
Or a SW issue has been there forever .."

How old do you think the kernel should be?

I could try to download an older Debian kernel package and see if it still boots my computer. What about 4.19, 4.9, or even 3.16?

https://packages.debian.org/search?suite=default&section=all&arch=amd64&searchon=names&keywords=linux-image-amd64
Comment 17 Heiner Kallweit 2020-03-28 19:56:45 UTC
4.19 and 4.9 should be checked, you don't have to go back as far as 3.16.
Comment 18 Heiner Kallweit 2020-03-28 20:37:59 UTC
I'd like to check whether the issue might be related to fragmented frame handling. Can you test the following patch on top of 5.4 or 5.5?
Thank you.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 791d99b9e..ba822e70d 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4101,11 +4101,14 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 			goto err_out;
 		}
 
-		txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
-		txd->opts2 = cpu_to_le32(opts[1]);
+		tp->tx_skb[entry].len = len;
 		txd->addr = cpu_to_le64(mapping);
+		txd->opts2 = cpu_to_le32(opts[1]);
 
-		tp->tx_skb[entry].len = len;
+		/* Complete memory writes before releasing descriptor */
+		dma_wmb();
+
+		txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
 	}
 
 	if (cur_frag) {
-- 
2.26.0
Comment 19 timo 2020-03-29 08:14:43 UTC
I recompiled kernel 5.5.13 (vanilla) including the patch. But the issue is still there.
Comment 20 timo 2020-03-29 08:46:31 UTC
I just booted into the Debian oldstable 4.9 kernel.

Status after boot:

$ ethtool -k eth2
Features for eth2:
rx-checksumming: on
tx-checksumming: off
        tx-checksum-ipv4: off
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off [fixed]
...


In this configuration, I saw no data transfer failures.


I then enabled the offloading:

$ ethtool -K eth2 tso on sg on tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ipv4: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-mangleid-segmentation: on
generic-segmentation-offload: on
...


And the errors are back again.

Hence, this issue is either a hardware bug, or a driver issue that already existed in kernel 4.9.
Comment 21 timo 2020-03-29 08:48:10 UTC
For completeness:

$ uname -a
Linux <anonymized> 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
Comment 22 Heiner Kallweit 2020-04-03 10:12:22 UTC
One more question for better understanding:
Do you observe the issue with TSO over IPv4 and/or IPv6?
Comment 23 Heiner Kallweit 2020-04-03 10:34:02 UTC
Sorry, as you have the issue with RTL8110s: This chip version doesn't support TSO6 in general.
Comment 24 timo 2020-04-03 11:21:16 UTC
No issue.

But I'm only using IPv4 in general. The file /etc/sysctl.conf contains the following lines:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Comment 25 Heiner Kallweit 2020-04-13 16:11:18 UTC
TSO has been disabled again due to problemes with few chip versions.
You can re-test with the latest kernel versions, e.g. 5.6.4.
Comment 26 timo 2020-04-13 20:21:07 UTC
I today compiled kernel 5.6.4 and can confirm that TSO is off by default for this adapter and that the issue (thus) also doesn't occur anymore.