Bug 209839 - r8169 (RTL8125B): "rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100)" and connectivity loss, caused by small fragmented datagrams
Summary: r8169 (RTL8125B): "rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100)" and conne...
Status: REOPENED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-24 10:15 UTC by WGH
Modified: 2021-06-02 12:04 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.9.1
Tree: Mainline
Regression: No


Attachments
pcap when the the hang occured (322.83 KB, application/gzip)
2020-10-28 02:00 UTC, WGH
Details
trace when the hang occured (27.69 KB, application/gzip)
2020-10-28 02:01 UTC, WGH
Details
distilled hang pcap (3.00 KB, application/vnd.tcpdump.pcap)
2020-10-29 11:24 UTC, WGH
Details
full dmesg (133.99 KB, text/plain)
2021-01-05 10:02 UTC, Priit O.
Details
30 packets to cause crash (20.43 KB, application/octet-stream)
2021-01-05 18:01 UTC, Priit O.
Details
Test patch v1 (3.36 KB, patch)
2021-01-22 10:18 UTC, Heiner Kallweit
Details | Diff
Test patch v2 (2.66 KB, patch)
2021-01-22 20:21 UTC, Heiner Kallweit
Details | Diff

Description WGH 2020-10-24 10:15:05 UTC
This is a new RTL8125B chip that is supported only since Linux 5.9.

eth0: RTL8125B, [mac address redacted], XID 641, IRQ 42

I can trigger this problem semi-reliably by attempting to connect to a Windows 7 VM via RDP running on this machine. The VM in question uses a macvtap (I'll check with a bridge whether that's something important or not, and report back).

The general symptoms are that the connectivity is completely lost on the host machine, and it recovers itself after a minute a so.

The first time a got a generic stacktrace:

[94171.732285] ------------[ cut here ]------------
[94171.732301] NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out
[94171.732319] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x232/0x240
[94171.732321] Modules linked in: macvtap macvlan tap veth xt_MASQUERADE xt_CHECKSUM xt_comment bridge stp llc ip6table_raw ip6table_nat iptable_raw iptable_nat bpfilter fuse btrfs blake2b_generic xor zstd_compress lzo_compress raid6_pq sctp kvm_amd kvm amdgpu irqbypass mfd_core gpu_sched ttm ghash_clmulni_intel nct6775 hwmon_vid k10temp efivarfs
[94171.732345] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1-gentoo #6
[94171.732348] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B550 Extreme4, BIOS P1.20 08/13/2020
[94171.732353] RIP: 0010:dev_watchdog+0x232/0x240
[94171.732357] Code: 85 c0 75 e5 eb 9c 4c 89 ef c6 05 c5 e6 c9 00 01 e8 b3 1d fc ff 44 89 e1 48 89 c2 4c 89 ee 48 c7 c7 c8 12 a8 91 e8 1e 9d 7e ff <0f> 0b e9 7a ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 c7 47
[94171.732360] RSP: 0018:ffff9e8d00003ea0 EFLAGS: 00010286
[94171.732363] RAX: 0000000000000000 RBX: ffff8c1aa331d600 RCX: 0000000000000000
[94171.732365] RDX: ffff8c1aaea278a0 RSI: ffff8c1aaea17820 RDI: 0000000000000300
[94171.732368] RBP: ffff8c1aa306a440 R08: ffff8c1aaea17820 R09: 00000000000006e3
[94171.732370] R10: ffffffff922cdd78 R11: ffff9e8d00003d48 R12: 0000000000000000
[94171.732372] R13: ffff8c1aa306a000 R14: ffff8c1aa306a440 R15: 0000000000000000
[94171.732374] FS:  0000000000000000(0000) GS:ffff8c1aaea00000(0000) knlGS:0000000000000000
[94171.732377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[94171.732380] CR2: 00007f8d50009048 CR3: 0000001e9b98a000 CR4: 0000000000350ef0
[94171.732382] Call Trace:
[94171.732385]  <IRQ>
[94171.732390]  ? qdisc_put_unlocked+0x30/0x30
[94171.732395]  call_timer_fn+0x2d/0x130
[94171.732399]  run_timer_softirq+0x393/0x450
[94171.732403]  ? tick_sched_handle.isra.0+0x40/0x40
[94171.732405]  ? __hrtimer_run_queues+0xfd/0x260
[94171.732408]  ? ktime_get+0x4a/0xc0
[94171.732412]  __do_softirq+0xe1/0x2bf
[94171.732416]  asm_call_irq_on_stack+0x12/0x20
[94171.732418]  </IRQ>
[94171.732423]  do_softirq_own_stack+0x36/0x40
[94171.732427]  irq_exit_rcu+0x9a/0xa0
[94171.732432]  sysvec_apic_timer_interrupt+0x2e/0x80
[94171.732435]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[94171.732440] RIP: 0010:cpuidle_enter_state+0xd5/0x380
[94171.732443] Code: c4 0f 1f 44 00 00 31 ff e8 88 77 91 ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 8d 02 00 00 31 ff e8 7f c0 96 ff fb 45 85 f6 <0f> 88 20 01 00 00 49 63 c6 be 68 00 00 00 4c 2b 24 24 48 89 c2 48
[94171.732446] RSP: 0018:ffffffff91c03e58 EFLAGS: 00000202
[94171.732448] RAX: ffff8c1aaea2a5c0 RBX: ffff8c1aa5d6b000 RCX: 000000000000001f
[94171.732451] RDX: 0000000000000000 RSI: 0000000021bf5c7a RDI: 0000000000000000
[94171.732453] RBP: ffffffff91cdce00 R08: 000055a610a6578d R09: ffff8c1aa77c9000
[94171.732455] R10: ffff8c1aaea29584 R11: ffff8c1aaea29564 R12: 000055a610a6578d
[94171.732458] R13: ffffffff91cdcee8 R14: 0000000000000002 R15: ffff8c1aa5d6b000
[94171.732463]  ? cpuidle_enter_state+0xb8/0x380
[94171.732466]  cpuidle_enter+0x37/0x60
[94171.732470]  do_idle+0x1c9/0x240
[94171.732473]  cpu_startup_entry+0x19/0x20
[94171.732476]  start_kernel+0x50a/0x52c
[94171.732480]  secondary_startup_64+0xa4/0xb0
[94171.732484] ---[ end trace 896922ae98389a20 ]---

However, this stack trace was also followed by 
[94171.744374] r8169 0000:06:00.0 enp6s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[94171.752789] r8169 0000:06:00.0 enp6s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).

The next times when this was triggered, no stack trace was reported, but the these two messages invariably appeared.

I didn't this problem before installing a VM and trying to connect to it, even though I had relatively heavy network activity for a week (it's a new machine).

I'll continue debugging, and try to eliminate some extra variables. As Heiner Kallweit suggested in bug 203703, comment 27, I'll also try various 5.9-rc kernels, though bisecting might take some time.
Comment 1 WGH 2020-10-24 10:44:45 UTC
The problem occurs when both VM and host bridged through a bridge interface as well, so that likely rules macvtap problem out.
Comment 2 Heiner Kallweit 2020-10-27 20:19:49 UTC
On tx timeout a chip reset is done and connectivity should recover. However, depending on how confused the chip is, even this may not help 100% (indicated by the error messages that tx/rx fifo's aren't empty).

I wonder what could be special with RDP traffic. It should be normal TCP traffic. What may help to analyze the issue:
- trace when the tx timeout occurs. Enable trace events (under /sys/kernel/debug/tracing/events/net):
  - net_dev_start_xmit
  - net_dev_xmit
  - net_dev_xmit_timeout
- a capture of the network traffic when tx times out
Comment 3 WGH 2020-10-28 02:00:53 UTC
Created attachment 293271 [details]
pcap when the the hang occured
Comment 4 WGH 2020-10-28 02:01:18 UTC
Created attachment 293273 [details]
trace when the hang occured
Comment 5 WGH 2020-10-28 02:01:53 UTC
The hang didn't occur immediately, it appears it only happens almost as soon as the desktop is rendered for the first time, and rarely much later.

The network topology is as follows: br0 bridge interface with enp6s0 (ethernet adapter) and vnet0 (VM) enslaved. I captured the traffic on the enp6s0 interface.

I have edited the trace by adding the realtime column so it's easier to match trace with the pcap.

I have randomized MAC addresses in the pcap, and removed a stray IPv6 multicast packet that has my public address in it. I can send you an unedited dump to you privately, if you need it.

I didn't fiddle with ethtool settings, everything was left at their defaults (I know that GSO/TSO sometimes produce funny oversized packets in pcaps, though it doesn't seem to be the case this time).
Comment 6 Heiner Kallweit 2020-10-28 10:08:23 UTC
Thanks for the logs. Only thing that looks interesting at the first glance is the fragmented packets in the pcap log. Not sure however whether this could conflict with the following in the r8169 driver:

		/* The driver does not support incoming fragmented frames.
		 * They are seen as a symptom of over-mtu sized frames.
		 */
		if (unlikely(rtl8169_fragmented_frame(status))) {
			dev->stats.rx_dropped++;
			dev->stats.rx_length_errors++;
			goto release_descriptor;
		}
Comment 7 WGH 2020-10-29 11:21:59 UTC
Your intuition was right, it does appear it has something to do with fragmented packets.

I have distilled the test case to 3 packets you can simply send with tcpreplay to get the hang with 100% reliability (I wish every driver/hardware bug was like that!).

(As a side-note, I find it odd that Windows decided to use fragmented IP datagrams. To my knowledge, IP fragmentation is something that is usually avoided. It seems that the qemu device advertises TSO support, but instead of splitting it into TCP segments, oversized packets gets to the bridge device, and gets fragmented on Ethernet output. Which is likely a bug on its own, but still, IP fragments is not something that should hang the network card driver. Disabling Large Send Offload on the Windows side prevents this fragmentation)
Comment 8 WGH 2020-10-29 11:24:01 UTC
Created attachment 293293 [details]
distilled hang pcap

sudo tcpreplay -i enp6s0 distilled_hang1.pcap
Comment 9 Heiner Kallweit 2020-10-29 12:16:57 UTC
The tx timeout is triggered by the net core if the tx queue has been stopped for too long and there are pending packets. Can't say how this relates to the dropped incoming fragmented packets, IOW: who stops the tx queue.
The fragmented rx packets are dropped only in the driver, therefore I'd expect that the rx/tx hw should be fine. Well, except this chip version has a hw issue with fragmented rx packets.
Not sure whether an upper layer could have stopped the tx queue due to the missing (dropped) packets.
Comment 10 Heiner Kallweit 2020-10-29 12:18:31 UTC
What we could do fow now is adding a WARN_ONCE if we detect a fragmented incoming packet, knowing that this can cause trouble.
Comment 11 WGH 2020-10-29 12:26:31 UTC
I don't really have the expertise with this driver and hardware, but at the first glance that particular comment about fragmented frames seems to be concerned with receiving large Ethernet frames when each ring buffer entry is too small (smaller than actual MTU). Fragmentation on the IP level has nothing to do with it. Unless there's some weird fragmentation/segmentation offload interaction that causes trouble, of course.
Comment 12 WGH 2020-10-29 12:28:51 UTC
Oh, just to clarify, I'm running tcpreplay on the machine that has the RTL8125B network card, i.e. I'm sending, not receiving.
Comment 13 Heiner Kallweit 2020-11-02 10:12:47 UTC
To verify whether the mentioned fragmented packets rx path is hit:
If the issue occurs, do you see rx length errors here?
/sys/class/net/<if>/statistics/rx_length_errors
Comment 14 WGH 2020-11-02 10:54:07 UTC
It's zero. The only non-zero, in fact, are multicast, rx_bytes, rx_packets, tx_bytes and tx_packets.
Comment 15 WGH 2020-11-02 11:04:22 UTC
FWIW, tcpreplaying the pcap from other machine, and watching tcpdump on the affected one, has no ill effects. Packets are visible in tcpdump, but nothing else happens. No error counter increase as well.
Comment 16 Heiner Kallweit 2020-11-02 11:46:16 UTC
Ah, then we don't talk about the rx path. If you tcpreplay the "distilled hang" pcap, does the traffic look ok on the wire?
Comment 17 WGH 2020-11-02 11:52:20 UTC
> Ah, then we don't talk about the rx path.

Sorry for not making it more clear in the beginning.

> If you tcpreplay the "distilled hang" pcap, does the traffic look ok on the
> wire?

The other machine reliably sees only the first 2 packets (out of 3).
Comment 18 WGH 2020-11-02 11:56:55 UTC
I mean it's always 2 out of 3, not that the 3rd packet appears unrealiably.
Comment 19 Heiner Kallweit 2020-11-02 13:51:55 UTC
OK. Hmm, maybe a HW bug. Can you test whether the same issue occurs with the r8125 vendor driver?
There's no errata info available from Realtek, but I can check with a contact in Realtek whether they are aware of any such issue.
Comment 20 WGH 2020-11-02 15:25:34 UTC
It's exactly the same with the out-of-tree r8125, except it doesn't say anything in dmesg when the hang happens, and never recovers on its own.
Comment 21 Heiner Kallweit 2020-11-02 19:39:07 UTC
Realtek was able to reproduce the issue. RTL8125B may suffer from a similar issue with short packets like RTL8168evl.
Could you please check whether the following fixes the issue for you?

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 7e0947e29..375f451cc 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4053,7 +4053,13 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 
 static bool rtl_test_hw_pad_bug(struct rtl8169_private *tp, struct sk_buff *skb)
 {
-	return skb->len < ETH_ZLEN && tp->mac_version == RTL_GIGA_MAC_VER_34;
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_34:
+	case RTL_GIGA_MAC_VER_63:
+		return skb->len < ETH_ZLEN;
+	default:
+		return false;
+	}
 }
 
 static void rtl8169_tso_csum_v1(struct sk_buff *skb, u32 *opts)
-- 
2.29.2
Comment 22 WGH 2020-11-02 19:46:22 UTC
Seems to help! No hangs, and the other machine receives all 3 packets fine as well.
Comment 23 Heiner Kallweit 2020-11-02 19:54:28 UTC
Great! Then I'll check with Realtek whether other chip versions may be affected too before submitting the change.

This way bug analysis is fun: Realtek responds within 2 hrs, and you test within 10 mins.
Comment 24 WGH 2020-11-02 19:56:44 UTC
My own e-mail to realtek that I sent earlier today, given the lack of positive delivery status notification, seems to be devnulled, though :)
Comment 25 Heiner Kallweit 2020-11-10 07:16:35 UTC
Fixed with 2aaf09a0e784 ("r8169: work around short packet hw bug on RTL8125").
Comment 26 WGH 2020-11-26 15:53:32 UTC
The fix was released in Linux 5.9.7.
Comment 27 Priit O. 2020-12-17 11:11:30 UTC
Was this fix removed in 5.10.0 RC5?
I still get this error regularly when torrenting. Have not tried 5.9 because ... ryzen 5950x and it works better with 5.10.

[1274670.281163] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[1274670.289901] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[1274680.304429] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[1274680.313171] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[1275010.329033] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[1275010.337773] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[1275020.355642] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[1275020.364378] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[1275375.340078] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[1275375.348817] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[1275390.273322] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[1275390.282058] r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).

$ lspci | grep RTL
25:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 04)
Comment 28 Heiner Kallweit 2020-12-17 11:18:54 UTC
Not sure whether the fix made it to 5.10-rc5 or to a later rc version only. Please test with 5.10.1. There the fix is included.
Comment 29 Priit O. 2021-01-05 05:54:47 UTC
Nope, not fixed. It's still here.
jaan  05 07:50:16 Zen kernel: r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
jaan  05 07:50:16 Zen kernel: r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
jaan  05 07:51:56 Zen kernel: r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
jaan  05 07:51:56 Zen kernel: r8169 0000:25:00.0 huginn: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[deemon@Zen ~]$ uname -a
Linux Zen 5.10.2-2-MANJARO #1 SMP PREEMPT Tue Dec 22 08:14:42 UTC 2020 x86_64 GNU/Linux
Comment 30 Heiner Kallweit 2021-01-05 07:40:26 UTC
The patch from comment 21 is included in 5.10.2. And in comment 22 it was confirmed that it fixes the issue.
So maybe we talk about something different here. Do you also see the transmit queue timeouts? Because the log messages are triggered from the chip reset procedure after a tx timeout.

Best test with a mainline non-preempt kernel to rule out that the issue is caused by downstream kernel patches.
Comment 31 Priit O. 2021-01-05 08:41:46 UTC
This is Manjaro default "mainline" kernel. There is also "real-time" variant for audio production people and such. No non-preempt options.

I have no idea what this "transmit queue" even is, but I do lose connection when this thing is happening (pings get lost for like 5-15 seconds.)

But maybe I don't have this "RTL8125B" but just "RTL8125" without B?
$ lspci | grep -i realtek
25:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 04)

Also no wizard to create myself any such packets for testing, but for me the problem appears every time I use qbittorrent to download stuff. Quite easy way for me to check if the problem is still there.
Comment 32 Heiner Kallweit 2021-01-05 09:47:09 UTC
Are you sure that Manjaro didn't add own patches? Best compile a mainline kernel yourself from kernel.org sources.
And please attach full dmesg output.
Comment 33 Priit O. 2021-01-05 10:02:06 UTC
Created attachment 294501 [details]
full dmesg
Comment 34 Heiner Kallweit 2021-01-05 10:19:00 UTC
OK, so it is a RTL8125B. Please provide the output of "ethtool -i huginn" to check for the right firmware version being loaded.
Comment 35 Priit O. 2021-01-05 10:51:49 UTC
$ ethtool -i huginn
driver: r8169
version: 5.10.2-2-MANJARO
firmware-version: rtl8125b-2_0.0.2 07/13/20
expansion-rom-version: 
bus-info: 0000:25:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
Comment 36 Heiner Kallweit 2021-01-05 11:07:46 UTC
All this looks normal. Just strange that the link breaks immediately once it's established. You could re-test with a different cable and best also a different link partner. Maybe it's a platform issue as your dmesg shows several other errors. It would be helpful if you could check whether any special network packets trigger the issue.
Last but not least please check whether the same happens with the r8125 vendor driver.
Comment 37 Priit O. 2021-01-05 11:28:49 UTC
I have never experienced such problem when qbittorrent is not running.
And I have NAS, so I move big files over the network quite often back and forth and never have this link drop problem then -- this should eliminate the cable problem (sadly can't swap it easily out, as it is kind of built into the wall).
Link partner is my only router, so not really? I could put some switch in-between, but I doubt it will change much.

About platform errors -- maybe; Yes, Zen3 is pretty raw also still; and I am eagerly waiting for MSI to provide newer stable BIOSes, but they are slow and release only betas right now, so... don't feel like being their beta tester with possibility to brick the board :D (they just yesterday released newer beta, with "Update to ComboAM4PIV2 1.1.9.0" but ... it's still beta sadly.

About special network packets -- I have no idea how. If you provide some more handholding or guide with what packets to test and how, I could try, maybe.
Comment 38 WGH 2021-01-05 11:37:48 UTC
>About special network packets -- I have no idea how. If you provide
>some more
>handholding or guide with what packets to test and how, I could try,
>maybe.

The tcpreplay command I posted earlier is literally all you need to run.
Comment 39 Priit O. 2021-01-05 11:42:05 UTC
ah yes, and I never experienced this link drop problem with my other (previous) computer in the same router port with the same cable with qbittorrent. So I assume it's either this NIC or the NIC drivers issue, not the "other end of the link". But I can do the switch in-between test later. Also will look into the tcpreplay thing a bit later when back at home.
Comment 40 Priit O. 2021-01-05 14:34:11 UTC
Downloaded the distilled_hang1.pcap and ran the command on the NIC. 20 times. No hang. How can I create my own version? :)
Comment 41 WGH 2021-01-05 14:53:44 UTC
It's Heiner's call, but I'd say that yours is a different bug.

> How can I create my own version?

Capture the network traffic with Wireshark or tcpdump when the hang occurs. You'll likely notice TCP retransmissions when it happens. If tcpreplaying it gives the same problem - good news, you're mostly there. Wireshark lets you export a selected subset of packets to a new file, which may help to minimize the testcase.

If _incoming_ traffic causes the problem, then you might need to replay the traffic from a different machine.
Comment 42 Priit O. 2021-01-05 17:30:46 UTC
it's kind of stupid to find the correct package because tcpreplay fails to deal with big packages (does not fragment them?) and just dies. Is there some flag or trick I can use to remedy that?

Warning: Unable to send packet: Error with PF_PACKET send() [119]: Message too long (errno = 90)
Actual: 118 packets (87206 bytes) sent in 0.009997 seconds
Rated: 8723216.9 Bps, 69.78 Mbps, 11803.54 pps
Flows: 19 flows, 1900.57 fps, 119 flow packets, 0 non-flow
Statistics for network device: huginn
	Successful packets:        118
	Failed packets:            1
	Truncated packets:         0
	Retried packets (ENOBUFS): 0
	Retried packets (EAGAIN):  0
Comment 43 WGH 2021-01-05 17:40:16 UTC
This might be caused by TSO/GSO.

(When TSO is active, the in-kernel TCP/IP stack implementation sends seemingly oversized TCP segments to the network card, which cuts it up to appropriately sized segments, offloading the segmentation to the network card instead of CPU. Note that it has nothing to do with IP fragmentation. As a side effect, tcpdump usually captures such impossibly large IP datagrams, even though they will appear normal on the wire.)

Try to temporarily disable TSO/GSO with ethtool, that should fix the capture.
Comment 44 Priit O. 2021-01-05 18:01:28 UTC
Created attachment 294519 [details]
30 packets to cause crash

probably takes less than all those 30.
Comment 45 WGH 2021-01-05 18:41:05 UTC
Yup, I can also reproduce it with your pcap.
Comment 46 WGH 2021-01-05 19:02:21 UTC
The vendor r8125 driver hangs the same way. However, their driver doesn't even have a fix for my "distilled pcap" yet.

(After several rmmod/insmod the r8169 driver to longer detects the link, but r8125 works fine. I think I'll need a reboot, lol)
Comment 47 Heiner Kallweit 2021-01-05 19:49:29 UTC
It looks weird that a packet of 1480 bytes is fragmented to 1472 + 8 bytes.
Maybe something with your MTU setting is incorrect. Try to reduce your MTU by 8 bytes.
This may still not explain the hw issue, but it may help to avoid triggering it.
Comment 48 Priit O. 2021-01-06 01:28:32 UTC
Not quite sure which one of us you are talking, but my network MTU is 1500 like it's by default. Normally the unfragmented traffic in Wireshark is also 1514 bytes "on wire" (technically it should be 1518, 14 header and 4 tail checksum; no idea why wireshark doesn't show this):
Ethernet II: 14 bytes
IP header: 20 bytes
UDP header: 8 bytes
data or "payload": 1472 bytes.
(checksum: 4 bytes which wireshark doesn't count)

https://en.wikipedia.org/wiki/Ethernet_frame#Ethernet_II

1472(data) + 8(udp) + 20(ip) = 1500 MTU size exactly

I have no clear understanding why it sometimes starts fragmenting normal size packets to smaller pieces -- maybe because the other end asks for smaller (or bigger) chunks? Smaller makes sense if they are connected over PPPoE or VPN or something (those protocol encapsulations need to fit into the frame also) and instead of multiple fragmentations on the way they ask smaller size from the beginning somehow?

Not sure where you got this 1480 and what exactly you mean by it, but when the payload is 1480 (anything over 1472) for some reason, it has to be fragmented to 1472 byte chunks I think.
Comment 49 Priit O. 2021-01-06 16:49:36 UTC
So, will you reopen this bug or let it remain "fixed" and create a new one? :)
Technically it's a problem with the same hardware and same driver.
Comment 50 WGH 2021-01-06 16:57:57 UTC
Your pcap causes the same problem, and looks strikingly similar to mine, except 

1) UDP instead of TCP (inside the fragmented IP datagrams),
2) the mininum IP datagram size (of the last fragment) is 28 instead of 25,
3) the padding workaround implemented earlier doesn't fix it.

In my opinion, the underlying problem is likely the same, it's just the fix was incomplete. I'm reopening it.
Comment 51 Priit O. 2021-01-06 21:45:46 UTC
Thank you WGH.

Just to clarify one thing -- since the minimum Ethernet frame by IEEE802.3 is 64 bytes/octets (14 header; 46 "payload"; 4 checksum). When the payload is smaller than 46 octets, the padding should be added accordingly by specification.
The question is, which component in the linux network stack is responsible for that (I have no clue), but when it should be the driver, then adding padding is not a "workaround" but actual "fix" and how it should have been done in the first place. :-)
Comment 52 Heiner Kallweit 2021-01-06 21:53:29 UTC
For chips handled by r8169 the hw is supposed to do the padding.
Comment 53 Heiner Kallweit 2021-01-07 15:28:24 UTC
(In reply to Priit O. from comment #48)
> Not sure where you got this 1480 and what exactly you mean by it, but when
> the payload is 1480 (anything over 1472) for some reason, it has to be
> fragmented to 1472 byte chunks I think.

The 1480 comes from the recorded pcap file. A 1480 byte UDP packet is fragmented to 1472 + 8 bytes. IP fragmentation is bad in general. Data load should be splitted on a higher level. So the question is why a higher level creates 1480 byte UDP packets instead of 1472 bytes packets.
Comment 54 WGH 2021-01-07 17:22:54 UTC
It's hard to tell. libtorrent (which qBittorrent is based on) appears to have MTU discovery code built in its uTP implementation.

https://github.com/arvidn/libtorrent/blob/e3f2b016dcd37a9a6e8a94006c7befcf2cb7bfac/src/utp_stream.cpp#L1752

Perhaps the code is simply not written to avoid fragmentation at all costs, so datagrams will be occasionally fragmented before path MTU is probed. "Don't fragment" is set only on the probe datagrams, so other datagrams might still be fragmented.  The kernel might also already know that the effective path MTU to some specific host is lower (e.g. discovered by Path MTU Discovery earlier), which would cause fragmentation right away (as opposed to on some distant router).

Maybe the libtorrent developer had a reason to not enforce "don't fragment" on all datagrams, as ICMP is known to be occasionally blocked by some retarded ISPs for no reason, causing broken PMTUD (hello, Fastly). TCP has a "clamp MSS" workaround for that scenario, but I don't think there's any for UDP.

Anyway, I believe this is tangential to the problem at hand.
Comment 55 xplo 2021-01-19 23:42:11 UTC
For information i have the same bug
[Tue Jan 19 22:24:00 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:00 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:20 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:20 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:34 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:34 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:44 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[Tue Jan 19 22:24:44 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).

When it happens the server is timing out for a few seconds.

It happens randomly when there is heavy traffic. I also use qbitorrent. But i never have seen this problem on an older hardware that used the same connexion and the same rj45 cable.
I tried to change the rj45 cable but it didnt fix it.

Following this thread i tried :
debian kernel 5.9
debian kernel 5.10
ubuntu kernel 5.10.8
ubuntu kernel 5.11.0-051100rc4-generic

I ll try the same setup on windows 10 pro (if i can do it with WSL2) to see if it s not driver related but hardware related.
Comment 56 Heiner Kallweit 2021-01-20 06:41:22 UTC
Please check whether the same issue happens with the r8125 vendor driver.
Comment 57 xplo 2021-01-20 06:54:32 UTC
r8125 realtek driver was worse and it didnt log an error in dmesg. I could see enp1s0 going up multiple time on dmesg when the server was not responding.
It was also random on heavy traffic but the timed out was way longer like 30s-2mn.
The ssh connexion would just never recover when in r8169 it would recover after a few seconds.
Comment 58 WGH 2021-01-21 00:58:57 UTC
Hey, Heiner, are you in touch with Realtek regarding the new pcap this time?
Comment 59 Heiner Kallweit 2021-01-21 06:49:48 UTC
I contacted Realtek, however have no feedback yet.
Comment 60 xplo 2021-01-21 09:39:29 UTC
I searched for similar problem and found out that it was happening with UDP connexion even on windows 10.
https://forum-en.msi.com/index.php?threads/realtek-pcie-2-5gbe-family-controller-resetting.345823/page-4
It seems realtek fixed it but just in the windows driver ( 2020/12/31 version )
https://www.realtek.com/en/component/zoo/category/network-interface-controllers-10-100-1000m-gigabit-ethernet-pci-express-software
All the post strongly suggest heavy udp load that lead the driver to crash on windows. It seems exactly the same problem on linux.
Comment 61 xplo 2021-01-22 09:38:13 UTC
setting qbittorrent to TCP only and bloquing UDP seems to have stabilized the connexion for one day.
Now anyway i just use a usb gigabit ethernet adaptator and everything work fine.
Comment 62 Heiner Kallweit 2021-01-22 10:18:33 UTC
Created attachment 294809 [details]
Test patch v1
Comment 63 Heiner Kallweit 2021-01-22 10:21:41 UTC
Realtek provided a workaround proposal. Please test attached patch.
Comment 64 xplo 2021-01-22 11:28:09 UTC
i dont know how to do that, is there a guide i could follow ?
Comment 65 Heiner Kallweit 2021-01-22 12:12:46 UTC
Apply patch (patch -p1 -i <patch>) on top of linux-next. It should apply also on top of recent stable kernels, but maybe small adjustments are needed.
Then build the kernel as usual, procedure may be distro-dependent.
Comment 66 xplo 2021-01-22 17:15:29 UTC
i think i managed to install it. I compiled the new r8169.ko module and just copied it over the old one so i would not have to recompile the whole kernel ( and probably fail doing it ).

Thank you it s been 1 hour without any timeout so it looks good.
Comment 67 Heiner Kallweit 2021-01-22 17:35:12 UTC
Thanks a lot for testing! Few inquiries, just to be on the safe side:
- After installing the fresh r8169 module, you rmmod/modprobe'd it?
- You reversed the changes described in comment 61?
Comment 68 xplo 2021-01-22 18:35:41 UTC
i did not rmmod/modprobe it. I just rebooted. I saw that modinfo r8169 changed, it lost its signature part so my guess is that it did work but i have no idea.
So now i did modprobe -r r8169 (and found out the hard way it s directly effective as it s headless server and i was connected throught the realtek :o )
modprobe -v r8169
and i rebooted

Yes i reversed what i did to set back udp on in qbittorrent

I acutally got the
[Fri Jan 22 18:47:35 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[Fri Jan 22 18:47:35 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).

But that was before i did the modprobe actions so i hope i was still on the old version.

I m not good enought with linux module compilation to know if what i did was the correct way to update it :/

here is what i get with modinfo (no more signature part)
xplo@nas:~$ modinfo r8169
filename:       /lib/modules/5.10.9-051009-generic/kernel/drivers/net/ethernet/realtek/r8169.ko
firmware:       rtl_nic/rtl8125b-2.fw
firmware:       rtl_nic/rtl8125a-3.fw
firmware:       rtl_nic/rtl8107e-2.fw
firmware:       rtl_nic/rtl8107e-1.fw
firmware:       rtl_nic/rtl8168fp-3.fw
firmware:       rtl_nic/rtl8168h-2.fw
firmware:       rtl_nic/rtl8168h-1.fw
firmware:       rtl_nic/rtl8168g-3.fw
firmware:       rtl_nic/rtl8168g-2.fw
firmware:       rtl_nic/rtl8106e-2.fw
firmware:       rtl_nic/rtl8106e-1.fw
firmware:       rtl_nic/rtl8411-2.fw
firmware:       rtl_nic/rtl8411-1.fw
firmware:       rtl_nic/rtl8402-1.fw
firmware:       rtl_nic/rtl8168f-2.fw
firmware:       rtl_nic/rtl8168f-1.fw
firmware:       rtl_nic/rtl8105e-1.fw
firmware:       rtl_nic/rtl8168e-3.fw
firmware:       rtl_nic/rtl8168e-2.fw
firmware:       rtl_nic/rtl8168e-1.fw
firmware:       rtl_nic/rtl8168d-2.fw
firmware:       rtl_nic/rtl8168d-1.fw
license:        GPL
softdep:        pre: realtek
description:    RealTek RTL-8169 Gigabit Ethernet driver
author:         Realtek and the Linux r8169 crew <netdev@vger.kernel.org>
srcversion:     7022707DE347B6B940D4E30
alias:          pci:v000010ECd00003000sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008125sv*sd*bc*sc*i*
alias:          pci:v00000001d00008168sv*sd00002410bc*sc*i*
alias:          pci:v00001737d00001032sv*sd00000024bc*sc*i*
alias:          pci:v000016ECd00000116sv*sd*bc*sc*i*
alias:          pci:v00001259d0000C107sv*sd*bc*sc*i*
alias:          pci:v00001186d00004302sv*sd*bc*sc*i*
alias:          pci:v00001186d00004300sv*sd*bc*sc*i*
alias:          pci:v00001186d00004300sv00001186sd00004B10bc*sc*i*
alias:          pci:v000010ECd00008169sv*sd*bc*sc*i*
alias:          pci:v000010FFd00008168sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008168sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008167sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008161sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008136sv*sd*bc*sc*i*
alias:          pci:v000010ECd00008129sv*sd*bc*sc*i*
alias:          pci:v000010ECd00002600sv*sd*bc*sc*i*
alias:          pci:v000010ECd00002502sv*sd*bc*sc*i*
depends:
retpoline:      Y
name:           r8169
vermagic:       5.10.9-051009-generic SMP mod_unload
Comment 69 xplo 2021-01-22 18:54:37 UTC
still happening
[Fri Jan 22 19:42:30 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[Fri Jan 22 19:42:30 2021] r8169 0000:01:00.0 enp1s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100).

I just feel it s less frequent but i m not sure at all.
Comment 70 WGH 2021-01-22 19:26:08 UTC
I have applied the patch to 5.10.7 (the first hunk with the includes didn't apply cleanly, but it was easy enough to fix).

Neither my TCP distilled pcap nor fragmented UDP pcap by Priit causes any hangs, so it seems to work.

xplo, could you tcpreplay the attached pcaps on your machine with the patch applied? If they don't cause any hangs, could you try to create a small reproduction pcap the same way we did further up this discussion?
Comment 71 Heiner Kallweit 2021-01-22 20:21:14 UTC
Created attachment 294813 [details]
Test patch v2

I attached a second patch version, also based on a proposal from Realtek.
Maybe the first version fixes most of the problem cases, but not all.
Comment 72 xplo 2021-01-22 20:56:26 UTC
forget everything i said i just fail at patching the driver :/
WGH since i reproduce the problem with Priit udp pcap it just means i m still on the old driver. I dont know how to patch / install corectly a module, each google search got me differents ways.
Comment 73 xplo 2021-01-22 21:21:29 UTC
it seems if i reboot after modprobe r8169 it fall back to the old driver. Sorry i still have to learn a lot about linux kernel.

Without rebooting the pcap udp work fine.
So i wont reboot and let it run to see if it time-out again.
Comment 74 WGH 2021-01-22 21:23:24 UTC
Perhaps, you didn't rebuild your initramfs, and the old module is still in it?
Comment 75 Heiner Kallweit 2021-01-23 09:03:02 UTC
On Ubuntu "update-initramfs -u" seems to be the proper command.
Or you remove r8169 from initramfs, it's not needed except your rootfs is e.g. on NFS. Typically this is done in /etc/mkinitcpio.conf.
Comment 76 xplo 2021-01-23 11:07:23 UTC
i just wont reboot as it s a home server.
so far so good
i just got that in dmesg
[Sat Jan 23 04:19:13 2021] TCP: enp1s0: Driver has suspect GRO implementation, TCP performance may be compromised.

but the smokeping service report no packet loss for all night.

On a sidenote i dont really understand why realtek doesn't fix their own driver. It s like the only 2.5gbe ethernet device that are shiped by realtek with nearly all 2020 motherboard o_o
Comment 77 Heiner Kallweit 2021-01-23 15:03:12 UTC
Sooner or later Realtek is going to come up with a fixed r8125 Linux driver, but apparently their focus is on the Windows driver, especially as their network chips are used primarily on consumer hardware.

Regarding the TCP warning: There has been a number of reports over the last years, with very different network drivers. But I didn't find a good explanation what actually causes the error. One hint was that it can happen if you receive packets bigger than MTU. Do you have any device in your network that may send jumbo packets?
Comment 78 xplo 2021-01-23 21:33:49 UTC
i dont know but at night if it was a jumbo packet it was from the internet (if that is even possible). And i did set mtu to 9000.

24h so far the new driver with Test patch v2. Nothing to report expect it s perfectly working. Thank you so much :)
Comment 79 Heiner Kallweit 2021-01-23 22:39:46 UTC
Good to hear that patch 2 works for you. It has slightly more performance impact than version 1, therefore if possible I'd go with version 1. But only if version 1 definitely fixes the issue. Could you run version 1 again for some time and check whether it's fine as well?
Comment 80 xplo 2021-01-24 00:35:12 UTC
of course performances are important, i m now testing patch1
Comment 81 xplo 2021-01-24 22:11:35 UTC
After one day the patch 1 looks good.
By the way update-initramfs -u did work to keep the driver loaded after reboot thx :)
Comment 82 Heiner Kallweit 2021-01-24 22:52:30 UTC
Great. Just to be sure: You did a update-initramfs after switching from patch v2 to v2?
Comment 83 xplo 2021-01-24 23:25:13 UTC
yes i did it after modprobe r8169 with v1 patch
Comment 84 xplo 2021-01-26 08:16:32 UTC
Still good after 3 days.
I dont see any problem and it s far better than the realtek driver or the current kernel driver.
Comment 85 Heiner Kallweit 2021-01-26 08:39:19 UTC
Great, thanks for the feedback. Then I'll submit patch v1.
Comment 86 Heiner Kallweit 2021-01-30 22:57:00 UTC
Should be fixed with 8d520b4de3ed ("r8169: work around RTL8125 UDP hw bug").
Comment 87 xplo 2021-01-31 20:47:41 UTC
Thank you. 
How can we know in which kernel version it will be merged ?
Comment 88 Heiner Kallweit 2021-01-31 21:27:29 UTC
It may take 1-2 weeks until it's applied to 5.11-rc and the stable kernel versions. Just check the change log of new kernel versions.
Comment 89 curtdept 2021-03-26 21:07:55 UTC
I am also hitting this pretty consistently after some time on an 8111H with all offloads turned off

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
        Subsystem: Realtek Semiconductor Co., Ltd. Device 0123
        Flags: bus master, fast devsel, latency 0, IRQ 54, IOMMU group 11
        I/O ports at e000 [size=256]
        Memory at fe804000 (64-bit, non-prefetchable) [size=4K]
        Memory at fe800000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 01
        Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Virtual Channel
        Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
        Capabilities: [170] Latency Tolerance Reporting
        Capabilities: [178] L1 PM Substates
        Kernel driver in use: r8169
Comment 90 Heiner Kallweit 2021-03-26 21:19:07 UTC
Please attach a full dmesg incl. the error. Then:
- Any specific action how you can trigger it?
- Did you test the potential reasons / fixes discussed here?
- Is it a regression?
Comment 91 curtdept 2021-03-26 22:39:13 UTC
Created attachment 296079 [details]
attachment-20686-0.html

For sure sorry, I had already cleared it with a power cycle on a volatile file system image. When it comes around again I'll try try capture it.

-Curtis
________________________________
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org>
Sent: Friday, March 26, 2021 3:19:07 PM
To: curtdept@me.com <curtdept@me.com>
Subject: [Bug 209839] r8169 (RTL8125B): "rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100)" and connectivity loss, caused by small fragmented datagrams

https://bugzilla.kernel.org/show_bug.cgi?id=209839

--- Comment #90 from Heiner Kallweit (hkallweit1@gmail.com) ---
Please attach a full dmesg incl. the error. Then:
- Any specific action how you can trigger it?
- Did you test the potential reasons / fixes discussed here?
- Is it a regression?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You are on the CC list for the bug.
Comment 92 curtdept 2021-04-08 20:17:49 UTC
No obvious triggers, short of time, it seems to happen after a few days or so, sometimes as short as a day. Reboot always fixes, causes the NIC to stop responding.

Managed to capture this.

237837.769013] ------------[ cut here ]------------
[237837.774087] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
[237837.781113] WARNING: CPU: 6 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x1f6/0x200
[237837.789995] Modules linked in: xt_connlimit nf_conncount iptable_nat xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT nf_nat nf_flow_table nf_conntrack ipt_REJECT xt_time xt_tcpudp xt_tcpmss xt_statistic xt_recent xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY sch_cake rfcomm nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables hidp hci_uart btusb btintel bnep bluetooth_6lowpan bluetooth sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred 6lowpan evdev xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port
[237837.790226]  ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb vfat fat nls_utf8 nls_iso8859_1 nls_cp437 ecdh_generic ecc kpp button_hotplug
[237837.907427] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.10.27 #0
[237837.913962] Hardware name: Maxtang FP30/FP30, BIOS FP30T306 06/19/2020
[237837.921049] RIP: 0010:dev_watchdog+0x1f6/0x200
[237837.925905] Code: 48 63 75 28 eb 91 4c 89 ef c6 05 0d 3c d1 00 01 e8 2f 09 fd ff 44 89 e1 4c 89 ee 48 c7 c7 d8 22 4d 82 48 89 c2 e8 b9 e1 10 00 <0f> 0b eb bc 66 0f 1f 44 00 00 49 89 f9 48 8d 87 40 01 00 00 31 c9
[237837.945635] RSP: 0018:ffffa128802b0ec0 EFLAGS: 00010282
[237837.951281] RAX: 0000000000000039 RBX: ffff949700934800 RCX: 0000000000000000
[237837.958946] RDX: ffff949818f9e2c8 RSI: ffff949818f973a0 RDI: 0000000000000300
[237837.966824] RBP: ffff949706eb8480 R08: 0000000000000000 R09: ffffa128802b0d10
[237837.974543] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[237837.982175] R13: ffff949706eb8000 R14: ffff949706eb83dc R15: ffff949700934880
[237837.989942] FS:  0000000000000000(0000) GS:ffff949818f80000(0000) knlGS:0000000000000000
[237837.998721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[237838.004914] CR2: 0000559c83b1c920 CR3: 0000000107a58000 CR4: 00000000003506e0
[237838.012603] Call Trace:
[237838.015336]  <IRQ>
[237838.017573]  ? dev_trans_start+0x70/0x70
[237838.021868]  ? dev_trans_start+0x70/0x70
[237838.026170]  call_timer_fn.constprop.0+0x11/0x70
[237838.031244]  expire_timers+0x94/0xc0
[237838.035152]  run_timer_softirq+0x21c/0x230
[237838.039597]  ? lapic_next_event+0x18/0x20
[237838.043957]  ? clockevents_program_event+0x8a/0xe0
[237838.049172]  ? hrtimer_interrupt+0x136/0x280
[237838.053830]  __do_softirq+0xb3/0x1de
[237838.057761]  asm_call_irq_on_stack+0x12/0x20
[237838.062423]  </IRQ>
[237838.064781]  do_softirq_own_stack+0x32/0x40
[237838.069331]  irq_exit_rcu+0x83/0xb0
[237838.072987]  sysvec_apic_timer_interrupt+0x2e/0x80
[237838.078254]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[237838.083785] RIP: 0010:cpuidle_enter_state+0xba/0x280
[237838.089209] Code: ac 0d 7c ff 31 ff 49 89 c5 e8 02 21 7c ff 41 83 e7 01 74 12 9c 58 f6 c4 02 0f 85 b6 01 00 00 31 ff e8 ba 57 80 ff fb 45 85 f6 <0f> 88 bd 00 00 00 4c 2b 2c 24 49 63 c6 48 6b c8 68 48 8d 14 40 48
[237838.109497] RSP: 0018:ffffa12880157eb0 EFLAGS: 00000202
[237838.115232] RAX: ffff949818fa0b40 RBX: ffff94970147a400 RCX: 000000000000001f
[237838.122917] RDX: 0000000000000000 RSI: 000000003d112d75 RDI: 0000000000000000
[237838.130642] RBP: ffffffff826c8980 R08: 0000d84febbc0f9b R09: 0000000000000018
[237838.138353] R10: 0000000000027c58 R11: ffff949818f9fde4 R12: 0000000000000002
[237838.146117] R13: 0000d84febbc0f9b R14: 0000000000000002 R15: 0000000000000000
[237838.153849]  cpuidle_enter+0x24/0x40
[237838.157780]  do_idle+0x196/0x1f0
[237838.161293]  cpu_startup_entry+0x14/0x20
[237838.165596]  secondary_startup_64_no_verify+0xc2/0xcb
[237838.171152] ---[ end trace 23fecbc400b1bc82 ]---
[237838.189098] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237838.201581] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf8d90270 flags=0x0018]
[237843.669099] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237843.681078] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf8880250 flags=0x0018]
[237848.788563] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237848.830082] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf8000040 flags=0x0018]
[237858.771434] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237858.846681] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf8d80200 flags=0x0008]
[237863.890870] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237863.933365] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf82e0240 flags=0x0018]
[237869.777828] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237869.991911] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf8c70200 flags=0x0008]
[237874.897662] r8169 0000:03:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[237875.476296] r8169 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xfffffffdf8000040 flags=0x0018]
Comment 93 Heiner Kallweit 2021-04-08 20:41:18 UTC
Typically RTL8168h runs fine with r8169, it can be found on a lot of consumer mainboards. The timeout warning provides no further details, "rtl_rxtx_empty_cond == 0" is caused by the chip not responding. Therefore it doesn't say anything about the root cause.

You could check whether it's a regression, IOW whether some previous kernel version is fine. Also you could check whether it works ok with the r8168 vendor driver from Realtek.
Comment 94 curtdept 2021-04-08 20:46:48 UTC
(In reply to Heiner Kallweit from comment #93)
> Typically RTL8168h runs fine with r8169, it can be found on a lot of
> consumer mainboards. The timeout warning provides no further details,
> "rtl_rxtx_empty_cond == 0" is caused by the chip not responding. Therefore
> it doesn't say anything about the root cause.
> 
> You could check whether it's a regression, IOW whether some previous kernel
> version is fine. Also you could check whether it works ok with the r8168
> vendor driver from Realtek.

Ya tried vendor driver, stability on it is significantly worse. It does seem like a regression to the mid 5.4 series, but I havent been able to pinpoint which one due to the length of time to repro the error.

Note You need to log in before you can comment on or make changes to this bug.