After update to 6.13, I see periodic r8169 tx timeouts on RTL8111H controller. During such events, it loses connection for a few seconds. 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 15) Subsystem: ASUSTeK Computer Inc. Onboard RTL8111H Ethernet Kernel driver in use: r8169 [ 422.114210] r8169 0000:03:00.0 enp3s0: NETDEV WATCHDOG: CPU: 4: transmit queue 0 timed out 5580 ms [ 422.114271] r8169 0000:03:00.0 enp3s0: ASPM disabled on Tx timeout [ 422.126986] r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100). [ 816.098500] r8169 0000:03:00.0 enp3s0: NETDEV WATCHDOG: CPU: 4: transmit queue 0 timed out 5140 ms [ 816.112159] r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100). The offending commit is b8bf38440ba94e8ed8e2ae55c5dfb0276d30e843: r8169: enable SG/TSO on selected chip versions per default Due to problem reports in the past SG and TSO/TSO6 are disabled per default. It's not fully clear which chip versions are affected, so we may impact also users of unaffected chip versions, unless they know how to use ethtool for enabling SG/TSO/TSO6. Vendor drivers r8168/r8125 enable SG/TSO/TSO6 for selected chip versions per default, I'd interpret this as confirmation that these chip versions are unaffected. So let's do the same here. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reverting it fixes the issue. So I assume RTL8111H is one of the affected chips.
Thanks for the report. On my test system with RTL8168h I never experienced such issues. Do you have any special type of traffic? Can you say whether it's IPv4 or IPv6 traffic triggering the issue? For the time being you can use ethtool to disable SG/TSO/TSO6 on boot.
> Do you have any special type of traffic? Can you say whether it's IPv4 or > IPv6 traffic triggering the issue? That's a server with fair amount of traffic / serving multiple requests per second. I can't figure any particular workload for triggering the issue and it doesn't seem to be tied to bandwidth, it just happens occasionally. Network is IPv4-only. > For the time being you can use ethtool to disable SG/TSO/TSO6 on boot. For now I just applied a revert patch for my kernel build.
I performed additional testing on other machines with RTL8111GR and RTL8125BG: they are not affected. So this seems to be a very hardware piece specific issue.
Can you test also with vendor driver r8168? It enables TSO/TSO6 per default as well on RTL8168h.
Hmm, looks like r8168 does not produce the same behavior. The network works just fine. r8168 - 8.054.00, kernel - 6.13.1. 'ethtool -k' output is completely identical for both drivers. But just to be sure, here is some output. 03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15) Subsystem: ASUSTeK Computer Inc. Onboard RTL8111H Ethernet [1043:8677] Kernel driver in use: r8168 scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on
Just for a sanity check, I tested (unpatched) r8169 again on 6.13.1 - still produces the timeouts. Also ensured that both drivers load the same firmware (rtl_nic/rtl8168h-2.fw). As far as I understand, RTL_GIGA_MAC_VER_46 stands for both 8111h and 8168h. Is there a way to distinguish them in runtime? If yes, maybe extempting 8111h from SG/TSO for now would be a good idea.
8111h and 8168h are basically the same chips. 8168h supports connecting a SPI flash chip, that's the only difference. One more question: Are you using jumbo packets or standard MTU?
Standard MTU. 2: enp3s0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
Could you please test whether the following fixes the issue for you? diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 7306c8e32..931035a99 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -4185,6 +4185,9 @@ static int rtl8169_tx_map(struct rtl8169_private *tp, const u32 *opts, u32 len, txd->addr = cpu_to_le64(mapping); txd->opts2 = cpu_to_le32(opts[1]); + /* analog to rtl8169_mark_to_asic */ + dma_wmb(); + opts1 = opts[0] | len; if (entry == NUM_TX_DESC - 1) opts1 |= RingEnd; -- 2.48.1
Unfortunately, nothing changed.
I got a hint from Realtek. Could you please test with the following? What's not clear yet is whether your report is the only one, as this chip version is very common. diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 7306c8e32..e14219d6a 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -519,7 +519,7 @@ enum rtl_tx_desc_bit_1 { TD1_GTSENV4 = (1 << 26), /* Giant Send for IPv4 */ TD1_GTSENV6 = (1 << 25), /* Giant Send for IPv6 */ #define GTTCPHO_SHIFT 18 -#define GTTCPHO_MAX 0x7f +#define GTTCPHO_MAX 0x70 /* Second doubleword. */ #define TCPHO_SHIFT 18 @@ -4436,17 +4436,13 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, txd_last = tp->TxDescArray + entry; txd_last->opts1 |= cpu_to_le32(LastFrag); + txd_first->opts1 |= cpu_to_le32(DescOwn | FirstFrag); tp->tx_skb[entry].skb = skb; skb_tx_timestamp(skb); - /* Force memory writes to complete before releasing descriptor */ - dma_wmb(); - door_bell = __netdev_sent_queue(dev, skb->len, netdev_xmit_more()); - txd_first->opts1 |= cpu_to_le32(DescOwn | FirstFrag); - /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */ smp_wmb(); -- 2.48.1
Nope, still produces the same behavior. > What's not clear yet is whether your report is the only one, as this chip > version is very common. Well, if I would've guess: * It is hard to reproduce, I still don't know what exactly triggers it. It would be good to have a stable reproduction pattern. But for now regular desktop usage doesn't seem to be problematic and not a lot of people use such consumer grade boards as a server with fairly high congestion. * It is hard to notice. It doesn't crash the system and just disrupts connection for a few seconds. I actually discovered it not because of having troubles, but because I have a habit to monitor dmesg. * Even if someone does notice it, connecting the dots is not trivial here. * 6.13 is still very new, most people don't use bleeding edge kernels, especially on servers. --- A bit offtopic, but I said before, my other board with 8111G chip ironically doesn't have SG/TSO enabled by default, but works totally fine when I enable it manually.
Created attachment 307584 [details] dmesg output of upstream kernel I think I have the same issue here. It started when upgrading from 5.14.21 to 6.3.0. After posting this to my distro bugzilla, I was told to try the latest upstream kernel (done), but still facing the issue with 6.13.0. This is my NIC: ----- 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 03) Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0 I/O ports at e800 [size=256] Memory at fe8ff000 (64-bit, non-prefetchable) [size=4K] Memory at fdffc000 (64-bit, prefetchable) [size=16K] Expansion ROM at fe8c0000 [disabled] [size=128K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [70] Express Endpoint, MSI 01 Capabilities: [ac] MSI-X: Enable+ Count=4 Masked- Capabilities: [cc] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [140] Virtual Channel Capabilities: [160] Device Serial Number 4e-3b-00-00-68-4c-e0-00 Kernel driver in use: r8169 ----- IPv4 only setup. I'm attaching the dmesg output from the upstream kernel.
(In reply to Alexander Maus from comment #13) > Created attachment 307584 [details] > dmesg output of upstream kernel > > I think I have the same issue here. It started when upgrading from 5.14.21 > to 6.3.0. > After posting this to my distro bugzilla, I was told to try the latest > upstream kernel (done), but still facing the issue with 6.13.0. > > This is my NIC: > ----- > 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 03) > Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express > Gigabit Ethernet controller > Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0 > I/O ports at e800 [size=256] > Memory at fe8ff000 (64-bit, non-prefetchable) [size=4K] > Memory at fdffc000 (64-bit, prefetchable) [size=16K] > Expansion ROM at fe8c0000 [disabled] [size=128K] > Capabilities: [40] Power Management version 3 > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ > Capabilities: [70] Express Endpoint, MSI 01 > Capabilities: [ac] MSI-X: Enable+ Count=4 Masked- > Capabilities: [cc] Vital Product Data > Capabilities: [100] Advanced Error Reporting > Capabilities: [140] Virtual Channel > Capabilities: [160] Device Serial Number 4e-3b-00-00-68-4c-e0-00 > Kernel driver in use: r8169 > ----- > > IPv4 only setup. I'm attaching the dmesg output from the upstream kernel. It's not the same issue (just same symptom) because you have different chip version RTL8168d, where TSO isn't enabled per default. And you apparently have the issue since 6.3. If it was ok with 5.14, then please bisect to find the offending commit: https://docs.kernel.org/admin-guide/bug-bisect.html
Ok, weird situation. I got a timeout on a completely different machine with 8111G. And SG/TSO is *off* on it! [11699.331067] r8169 0000:01:00.0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5530 ms 01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 11) Subsystem: ASRock Incorporation Motherboard (one of many) [1849:8168] Kernel driver in use: r8169 It's the only single event during few days of operation though, so not nearly as critical, in contrast with 8111H bugging out every half on an hour. Maybe r8169 has some subtle flaw and SG/TSO on my board with 8111H just triggers it more apparently? And I never had this issue before 6.13. I would try running r8168 for longer time to ensure it not being affected.
> It's not the same issue (just same symptom) because you have different chip > version RTL8168d, where TSO isn't enabled per default. And you apparently > have the issue since 6.3. If it was ok with 5.14, then please bisect to find > the offending commit: https://docs.kernel.org/admin-guide/bug-bisect.html OK, will try to bisect (which might take a while as the problem pops up only about once a week) and then open an extra bug. Thx!
FYI, the single timeout in my earlier message seemingly was just a fluke, probably caused by problems on the other side. After a copule of weeks of operation, 8111G seems to be totally stable regardless of the driver (r8169 or r8168) and SG/TSO being on or off. So I replaced this machine completely. Unfortunately, that means I would not be able to provide you experimental data further. In case you ask what if this issue would be resolved. In fact, it's only partially responsible for the replacement decision. That board also had other problems, like flaky onboard SATA controller (which is also a known issue https://bugzilla.kernel.org/show_bug.cgi?id=201693).
Thanks a lot for the follow-up!