One of my NICs is causing failures in data transfers when TSO is enabled (default). The problems disappear when I disable TSO using 'ethtool'. The NIC (output from 'lspci -nn'): 04:02.0 Ethernet controller [0200]: U.S. Robotics USR997902 10/100/1000 Mbps PCI Network Card [16ec:0116] (rev 10) Further details (output from 'dmesg'): [ 1.076528] r8169 0000:04:02.0 eth2: RTL8110s, 00:c0:49:f2:7c:11, XID 040, IRQ 17 [ 1.076530] r8169 0000:04:02.0 eth2: jumbo features [frames: 7152 bytes, tx checksumming: ok] I suspect that this is caused by commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/realtek?h=v5.5.13&id=93681cd7d94f83903cb3f0f95433d10c28a7e9a5 And possibly, the patch in commit a0783cd0c810504427777e8aae20d5f4f8b652a0 should be extended to disable TSO by default for this NIC: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/ethernet/realtek?h=v5.5.13&id=a0783cd0c810504427777e8aae20d5f4f8b652a0
Issue may be related to bug ticket 206891. Therefore the question: Do you run a 32 bit or a 64 bit kernel?
One more question as you said "one of my NIC's": Do you have multiple RTL8169/RTL8168 chips/cards in the same system? Also interesting would be whether you face the same issue if you manually enable TSO on an older kernel version, let's say 4.19.
Hi Heiner, 1) I'm running a 64-bit kernel (x86_64). 2) I have the following NICs installed: 00:19.0 Ethernet controller [0200]: Intel Corporation 82579V Gigabit Network Connection [8086:1503] (rev 04) 01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] 03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 02) 04:02.0 Ethernet controller [0200]: U.S. Robotics USR997902 10/100/1000 Mbps PCI Network Card [16ec:0116] (rev 10) 3) Regarding kernel versions: I have used the affected NIC (USR997902) for many years, with several different kernel versions, and without ever manually disabling TSO. (I assume it was disabled by default, previously -- but I never checked that.) The issues started somewhere in January. Looking back in my log, the switch from kernel 5.3 to 5.4 may have triggered this change in behavior. /var/log/dpkg.log.2.gz:2020-01-25 12:28:05 install linux-image-5.4.0-3-amd64:amd64 <none> 5.4.13-1 /var/log/dpkg.log.2.gz:2020-01-12 17:33:40 install linux-image-5.4.0-2-amd64:amd64 <none> 5.4.8-1 /var/log/dpkg.log.3.gz:2019-12-13 22:57:06 install linux-image-5.3.0-3-amd64:amd64 <none> 5.3.15-1 /var/log/dpkg.log.4.gz:2019-11-26 07:32:13 install linux-image-5.3.0-2-amd64:amd64 <none> 5.3.9-3 /var/log/dpkg.log.5.gz:2019-10-03 10:42:22 install linux-image-5.2.0-3-amd64:amd64 <none> 5.2.17-1 This would also match with my suspicion that commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5 ("r8169: enable HW csum and TSO") caused this change in behavior, as this was introduced in 5.4. I hope this helps. Let me know if I should provide further information.
"and without ever manually disabling TSO. (I assume it was disabled by default, previously -- but I never checked that.)" Right, tx checksumming and TSO were disabled by default. Therefore it would be interesting to see whether you face the same issue if you enable SG/TSO on older kernel versions.
I just performed the following experiments: 1) boot linux-image-5.4.0-2-amd64: - TSO is by default on and I see data transfer failures again. - The data transfer errors disappear when I manually turn TSO off again => this confirms that this bug exists at least in kernel 5.4.8. 2) boot linux-image-5.3.0-3-amd64: - TSO is by default off and I get no data transfer failures. - I tried turning TSO on (using 'ethtool -K eth2 tso on'), but this results in an error ('Could not change any device features').
Some further info on the NICs with realtek chipsets: [ 1.075345] r8169 0000:01:00.0 eth0: RTL8125, 00:8e:25:70:01:42, XID 609, IRQ 28 [ 1.075346] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] This is a relatively new PCI-e NIC with RTL8125 chipset (XID 0x609 = RTL_GIGA_MAC_VER_61 I believe). [ 1.078405] r8169 0000:03:00.0 eth1: RTL8168c/8111c, 00:e0:4c:68:21:ed, XID 3c4, IRQ 16 [ 1.078407] r8169 0000:03:00.0 eth1: jumbo features [frames: 6128 bytes, tx checksumming: ko] This is an older PCI-e NIC with RTL8168c/8111c chipset (XID 0x3c4 = RTL_GIGA_MAC_VER_22 I believe). [ 1.080646] r8169 0000:04:02.0 eth2: RTL8110s, 00:c0:49:f2:7c:11, XID 040, IRQ 17 [ 1.080648] r8169 0000:04:02.0 eth2: jumbo features [frames: 7152 bytes, tx checksumming: ok] [ 1.312056] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:1e:8c:f1:7a:cd This is an old PCI NIC with RTL8110s chipset (XID 0x040 = RTL_GIGA_MAC_VER_03 I believe). On the kernel I'm currently running (a self-compiled vanilla 5.5.13 kernel), TSO is by default disabled on all NICs, except the RTL8110s which is causing the issues.
"I tried turning TSO on (using 'ethtool -K eth2 tso on'), but this results in an error ('Could not change any device features')." Thanks for the further analysis. TSO depends on SG. To switch TSO on: ethtool -K eth2 tso on sg on
I tried that, too, but it doesn't work: $ ethtool -K eth2 tso on sg on Actual changes: scatter-gather: on tx-scatter-gather: on generic-segmentation-offload: on $ ethtool -k eth2 Features for eth2: rx-checksumming: on tx-checksumming: off tx-checksum-ipv4: off tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off [requested on] tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off [requested on] tx-tcp6-segmentation: off [fixed] generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] ... $ ethtool -K eth2 tso on sg on Could not change any device features
Right, because tx checksumming is also off. Then: ethtool -K eth2 tso on sg on tx on
Thanks, that works: $ ethtool -K eth2 tso on sg on tx on Actual changes: tx-checksumming: on tx-checksum-ipv4: on scatter-gather: on tx-scatter-gather: on tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-mangleid-segmentation: on generic-segmentation-offload: on After that, I see the same transfer failures on kernel 5.3.
Not sure if it helps. But when such transfer failures happen, the kernel log contains the following line: NETDEV WATCHDOG: eth2 (r8169): transmit queue 0 timed out I'm wondering whether this NIC is buggy (like the RTL8168e-vl), or that enabling TSO simply reveals another bug in the r8169 driver. I somewhat suspect the latter, as bug 206891 indeed seems related, but the behavior different (as it there only happens on a 32-bit kernel). If that would be the case, then maybe commit a0783cd0c810504427777e8aae20d5f4f8b652a0 for which disables TSO for RTL_GIGA_MAC_VER_22 is (also) only a workaround.
That's exactly what I'm trying to find out: whether few chip versions have HW issues with TSO/SG, or whether there's a software issue somewhere. So far Realtek confirmed a HW TSO issue for chip version 34 only. To bisect a potential regression first we'd need to find an older kernel version that doesn't show the issue with manually enabled CSUM/SG/TSO. Or a SW issue has been there forever .. However most chip versions seem to work fine with TSO, this makes it harder to explain why a driver issue should affect certain chip versions only.
When googling for r8169 and "transmit queue 0 timed out", I came across bug 199549. The root cause for that bug seems to be that the PHY isn't fully resumed yet when configuration starts. Which reminds me that I enable runtime PM for these NICs by setting the following parameters to "auto": /sys/bus/pci/devices/0000:01:00.0/power/control /sys/bus/pci/devices/0000:03:00.0/power/control /sys/bus/pci/devices/0000:04:02.0/power/control Probably far-fetched... but might there be some relation to that?
Unlikely, but I can't rule it out. It should be easy to test by disabling runtime PM for the affected card.
I booted kernel 5.3, left the runtime PM settings to their default value (which is "on" by the way -- "off" is not a value that's accepted), and manually enabled TSO. The transfer failures occurred again, so I think that rules it out.
"To bisect a potential regression first we'd need to find an older kernel version that doesn't show the issue with manually enabled CSUM/SG/TSO. Or a SW issue has been there forever .." How old do you think the kernel should be? I could try to download an older Debian kernel package and see if it still boots my computer. What about 4.19, 4.9, or even 3.16? https://packages.debian.org/search?suite=default§ion=all&arch=amd64&searchon=names&keywords=linux-image-amd64
4.19 and 4.9 should be checked, you don't have to go back as far as 3.16.
I'd like to check whether the issue might be related to fragmented frame handling. Can you test the following patch on top of 5.4 or 5.5? Thank you. diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 791d99b9e..ba822e70d 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -4101,11 +4101,14 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb, goto err_out; } - txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); - txd->opts2 = cpu_to_le32(opts[1]); + tp->tx_skb[entry].len = len; txd->addr = cpu_to_le64(mapping); + txd->opts2 = cpu_to_le32(opts[1]); - tp->tx_skb[entry].len = len; + /* Complete memory writes before releasing descriptor */ + dma_wmb(); + + txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); } if (cur_frag) { -- 2.26.0
I recompiled kernel 5.5.13 (vanilla) including the patch. But the issue is still there.
I just booted into the Debian oldstable 4.9 kernel. Status after boot: $ ethtool -k eth2 Features for eth2: rx-checksumming: on tx-checksumming: off tx-checksum-ipv4: off tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: off tx-scatter-gather: off tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: off [fixed] ... In this configuration, I saw no data transfer failures. I then enabled the offloading: $ ethtool -K eth2 tso on sg on tx on Actual changes: tx-checksumming: on tx-checksum-ipv4: on scatter-gather: on tx-scatter-gather: on tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-mangleid-segmentation: on generic-segmentation-offload: on ... And the errors are back again. Hence, this issue is either a hardware bug, or a driver issue that already existed in kernel 4.9.
For completeness: $ uname -a Linux <anonymized> 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
One more question for better understanding: Do you observe the issue with TSO over IPv4 and/or IPv6?
Sorry, as you have the issue with RTL8110s: This chip version doesn't support TSO6 in general.
No issue. But I'm only using IPv4 in general. The file /etc/sysctl.conf contains the following lines: net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
TSO has been disabled again due to problemes with few chip versions. You can re-test with the latest kernel versions, e.g. 5.6.4.
I today compiled kernel 5.6.4 and can confirm that TSO is off by default for this adapter and that the issue (thus) also doesn't occur anymore.