Bug 206969 - r8169: issues on RTL8110s with TSO enabled
Summary: r8169: issues on RTL8110s with TSO enabled
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-25 23:11 UTC by timo
Modified: 2020-04-13 20:22 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.4 and higher; fixed in 5.6.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description timo 2020-03-25 23:11:50 UTC
One of my NICs is causing failures in data transfers when TSO is enabled (default). The problems disappear when I disable TSO using 'ethtool'.

The NIC (output from 'lspci -nn'):

04:02.0 Ethernet controller [0200]: U.S. Robotics USR997902 10/100/1000 Mbps
PCI Network Card [16ec:0116] (rev 10)

Further details (output from 'dmesg'):

[ 1.076528] r8169 0000:04:02.0 eth2: RTL8110s, 00:c0:49:f2:7c:11, XID 040,
IRQ 17
[ 1.076530] r8169 0000:04:02.0 eth2: jumbo features [frames: 7152 bytes, tx checksumming: ok]

I suspect that this is caused by commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/realtek?h=v5.5.13&id=93681cd7d94f83903cb3f0f95433d10c28a7e9a5

And possibly, the patch in commit a0783cd0c810504427777e8aae20d5f4f8b652a0 should be extended to disable TSO by default for this NIC:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/realtek?h=v5.5.13&id=a0783cd0c810504427777e8aae20d5f4f8b652a0
Comment 1 Heiner Kallweit 2020-03-27 12:31:22 UTC
Issue may be related to bug ticket 206891. Therefore the question: Do you run a 32 bit or a 64 bit kernel?
Comment 2 Heiner Kallweit 2020-03-27 12:38:16 UTC
One more question as you said "one of my NIC's": Do you have multiple RTL8169/RTL8168 chips/cards in the same system?

Also interesting would be whether you face the same issue if you manually enable TSO on an older kernel version, let's say 4.19.
Comment 3 timo 2020-03-27 17:37:33 UTC
Hi Heiner,

1) I'm running a 64-bit kernel (x86_64).

2) I have the following NICs installed:

00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04)
01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125]
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 02)
04:02.0 Ethernet controller [0200]: U.S. Robotics USR997902 10/100/1000 Mbps PCI Network Card [16ec:0116] (rev 10)

3) Regarding kernel versions:

I have used the affected NIC (USR997902) for many years, with several different kernel versions, and without ever manually disabling TSO. (I assume it was disabled by default, previously -- but I never checked that.)

The issues started somewhere in January. Looking back in my log, the switch from kernel 5.3 to 5.4 may have triggered this change in behavior.

/var/log/dpkg.log.2.gz:2020-01-25 12:28:05 install linux-image-5.4.0-3-amd64:amd64 <none> 5.4.13-1
/var/log/dpkg.log.2.gz:2020-01-12 17:33:40 install linux-image-5.4.0-2-amd64:amd64 <none> 5.4.8-1
/var/log/dpkg.log.3.gz:2019-12-13 22:57:06 install linux-image-5.3.0-3-amd64:amd64 <none> 5.3.15-1
/var/log/dpkg.log.4.gz:2019-11-26 07:32:13 install linux-image-5.3.0-2-amd64:amd64 <none> 5.3.9-3
/var/log/dpkg.log.5.gz:2019-10-03 10:42:22 install linux-image-5.2.0-3-amd64:amd64 <none> 5.2.17-1

This would also match with my suspicion that commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5 ("r8169: enable HW csum and TSO") caused this change in behavior, as this was introduced in 5.4.

I hope this helps. Let me know if I should provide further information.
Comment 4 Heiner Kallweit 2020-03-27 17:48:33 UTC
"and without ever manually disabling TSO. (I assume it was disabled by default, previously -- but I never checked that.)"

Right, tx checksumming and TSO were disabled by default.
Therefore it would be interesting to see whether you face the same issue if you enable SG/TSO on older kernel versions.
Comment 5 timo 2020-03-27 21:23:05 UTC
I just performed the following experiments:

1) boot linux-image-5.4.0-2-amd64:

- TSO is by default on and I see data transfer failures again.
- The data transfer errors disappear when I manually turn TSO off again
=> this confirms that this bug exists at least in kernel 5.4.8.

2) boot linux-image-5.3.0-3-amd64:

- TSO is by default off and I get no data transfer failures.
- I tried turning TSO on (using 'ethtool -K eth2 tso on'), but this results in an error ('Could not change any device features').
Comment 6 timo 2020-03-27 21:52:14 UTC
Some further info on the NICs with realtek chipsets:

[    1.075345] r8169 0000:01:00.0 eth0: RTL8125, 00:8e:25:70:01:42, XID 609, IRQ 28
[    1.075346] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]

This is a relatively new PCI-e NIC with RTL8125 chipset (XID 0x609 = RTL_GIGA_MAC_VER_61 I believe).

[    1.078405] r8169 0000:03:00.0 eth1: RTL8168c/8111c, 00:e0:4c:68:21:ed, XID 3c4, IRQ 16
[    1.078407] r8169 0000:03:00.0 eth1: jumbo features [frames: 6128 bytes, tx checksumming: ko]

This is an older PCI-e NIC with RTL8168c/8111c chipset (XID 0x3c4 = RTL_GIGA_MAC_VER_22 I believe).

[    1.080646] r8169 0000:04:02.0 eth2: RTL8110s, 00:c0:49:f2:7c:11, XID 040, IRQ 17
[    1.080648] r8169 0000:04:02.0 eth2: jumbo features [frames: 7152 bytes, tx checksumming: ok]
[    1.312056] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:1e:8c:f1:7a:cd

This is an old PCI NIC with RTL8110s chipset (XID 0x040 = RTL_GIGA_MAC_VER_03 I believe).

On the kernel I'm currently running (a self-compiled vanilla 5.5.13 kernel), TSO is by default disabled on all NICs, except the RTL8110s which is causing the issues.
Comment 7 Heiner Kallweit 2020-03-27 22:06:20 UTC
"I tried turning TSO on (using 'ethtool -K eth2 tso on'), but this results in an error ('Could not change any device features')."

Thanks for the further analysis.
TSO depends on SG. To switch TSO on: ethtool -K eth2 tso on sg on
Comment 8 timo 2020-03-27 22:35:15 UTC
I tried that, too, but it doesn't work:

$ ethtool -K eth2 tso on sg on
Actual changes:
scatter-gather: on
        tx-scatter-gather: on
generic-segmentation-offload: on

$ ethtool -k eth2 
Features for eth2:
rx-checksumming: on
tx-checksumming: off
        tx-checksum-ipv4: off
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [requested on]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off [requested on]
        tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
...

$ ethtool -K eth2 tso on sg on
Could not change any device features
Comment 9 Heiner Kallweit 2020-03-27 22:39:26 UTC
Right, because tx checksumming is also off. Then:
ethtool -K eth2 tso on sg on tx on
Comment 10 timo 2020-03-27 23:09:22 UTC
Thanks, that works:

$ ethtool -K eth2 tso on sg on tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ipv4: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-mangleid-segmentation: on
generic-segmentation-offload: on

After that, I see the same transfer failures on kernel 5.3.
Comment 11 timo 2020-03-28 09:47:43 UTC
Not sure if it helps. But when such transfer failures happen, the kernel log contains the following line:

NETDEV WATCHDOG: eth2 (r8169): transmit queue 0 timed out

I'm wondering whether this NIC is buggy (like the RTL8168e-vl), or that enabling TSO simply reveals another bug in the r8169 driver. I somewhat suspect the latter, as bug 206891 indeed seems related, but the behavior different (as it there only happens on a 32-bit kernel).

If that would be the case, then maybe commit a0783cd0c810504427777e8aae20d5f4f8b652a0 for which disables TSO for RTL_GIGA_MAC_VER_22 is (also) only a workaround.
Comment 12 Heiner Kallweit 2020-03-28 10:05:15 UTC
That's exactly what I'm trying to find out: whether few chip versions have HW issues with TSO/SG, or whether there's a software issue somewhere.
So far Realtek confirmed a HW TSO issue for chip version 34 only.

To bisect a potential regression first we'd need to find an older kernel version that doesn't show the issue with manually enabled CSUM/SG/TSO.
Or a SW issue has been there forever ..

However most chip versions seem to work fine with TSO, this makes it harder to explain why a driver issue should affect certain chip versions only.
Comment 13 timo 2020-03-28 10:06:43 UTC
When googling for r8169 and "transmit queue 0 timed out", I came across bug 199549. The root cause for that bug seems to be that the PHY isn't fully resumed yet when configuration starts.

Which reminds me that I enable runtime PM for these NICs by setting the following parameters to "auto":

/sys/bus/pci/devices/0000:01:00.0/power/control
/sys/bus/pci/devices/0000:03:00.0/power/control
/sys/bus/pci/devices/0000:04:02.0/power/control

Probably far-fetched... but might there be some relation to that?
Comment 14 Heiner Kallweit 2020-03-28 10:10:33 UTC
Unlikely, but I can't rule it out. It should be easy to test by disabling runtime PM for the affected card.
Comment 15 timo 2020-03-28 10:31:52 UTC
I booted kernel 5.3, left the runtime PM settings to their default value (which is "on" by the way -- "off" is not a value that's accepted), and manually enabled TSO. The transfer failures occurred again, so I think that rules it out.
Comment 16 timo 2020-03-28 17:17:26 UTC
"To bisect a potential regression first we'd need to find an older kernel version that doesn't show the issue with manually enabled CSUM/SG/TSO.
Or a SW issue has been there forever .."

How old do you think the kernel should be?

I could try to download an older Debian kernel package and see if it still boots my computer. What about 4.19, 4.9, or even 3.16?

https://packages.debian.org/search?suite=default&section=all&arch=amd64&searchon=names&keywords=linux-image-amd64
Comment 17 Heiner Kallweit 2020-03-28 19:56:45 UTC
4.19 and 4.9 should be checked, you don't have to go back as far as 3.16.
Comment 18 Heiner Kallweit 2020-03-28 20:37:59 UTC
I'd like to check whether the issue might be related to fragmented frame handling. Can you test the following patch on top of 5.4 or 5.5?
Thank you.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 791d99b9e..ba822e70d 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4101,11 +4101,14 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 			goto err_out;
 		}
 
-		txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
-		txd->opts2 = cpu_to_le32(opts[1]);
+		tp->tx_skb[entry].len = len;
 		txd->addr = cpu_to_le64(mapping);
+		txd->opts2 = cpu_to_le32(opts[1]);
 
-		tp->tx_skb[entry].len = len;
+		/* Complete memory writes before releasing descriptor */
+		dma_wmb();
+
+		txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
 	}
 
 	if (cur_frag) {
-- 
2.26.0
Comment 19 timo 2020-03-29 08:14:43 UTC
I recompiled kernel 5.5.13 (vanilla) including the patch. But the issue is still there.
Comment 20 timo 2020-03-29 08:46:31 UTC
I just booted into the Debian oldstable 4.9 kernel.

Status after boot:

$ ethtool -k eth2
Features for eth2:
rx-checksumming: on
tx-checksumming: off
        tx-checksum-ipv4: off
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off [fixed]
...


In this configuration, I saw no data transfer failures.


I then enabled the offloading:

$ ethtool -K eth2 tso on sg on tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ipv4: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-mangleid-segmentation: on
generic-segmentation-offload: on
...


And the errors are back again.

Hence, this issue is either a hardware bug, or a driver issue that already existed in kernel 4.9.
Comment 21 timo 2020-03-29 08:48:10 UTC
For completeness:

$ uname -a
Linux <anonymized> 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
Comment 22 Heiner Kallweit 2020-04-03 10:12:22 UTC
One more question for better understanding:
Do you observe the issue with TSO over IPv4 and/or IPv6?
Comment 23 Heiner Kallweit 2020-04-03 10:34:02 UTC
Sorry, as you have the issue with RTL8110s: This chip version doesn't support TSO6 in general.
Comment 24 timo 2020-04-03 11:21:16 UTC
No issue.

But I'm only using IPv4 in general. The file /etc/sysctl.conf contains the following lines:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Comment 25 Heiner Kallweit 2020-04-13 16:11:18 UTC
TSO has been disabled again due to problemes with few chip versions.
You can re-test with the latest kernel versions, e.g. 5.6.4.
Comment 26 timo 2020-04-13 20:21:07 UTC
I today compiled kernel 5.6.4 and can confirm that TSO is off by default for this adapter and that the issue (thus) also doesn't occur anymore.

Note You need to log in before you can comment on or make changes to this bug.