Bug 207049

Summary: Realtek RTL8211E network card is unstable
Product: Drivers Reporter: oyvinds
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED CODE_FIX    
Severity: normal CC: hkallweit1, timo
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: Full dmesg of box with realtek issue

Description oyvinds 2020-04-01 00:03:59 UTC
1) Have RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
2) Use network audio setup
3) Suddenly audio stops working and then it resumes because the r8169 using realtek network card is unstable.

This happens with Linux 5.6. It is not a new problem with this kernel, it has always been some kind of a problem with Realtek network cards.

Please explain what they mean by this:

[150463.765847] ------------[ cut here ]------------
[150463.765858] NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out
[150463.765892] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x1e6/0x1f0
[150463.765896] Modules linked in: twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common loop btrfs blake2b_generic ufs hfsplus hfs minix msdos jfs drivetemp it87(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfcomm xt_DSCP xt_length iptable_mangle nf_conntrack_irc nf_conntrack_sip iptable_raw xt_CT nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rt xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables bnep sunrpc xfs vfat fat wmi_bmof edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi pcspkr snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd joydev sp5100_tco i2c_piix4 bfq gpio_amdpt gpio_generic wmi acpi_cpufreq dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel r8169 ccp realtek pinctrl_amd fuse k10temp
[150463.765942] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G           OE     5.6.0-Chaekyung #1
[150463.765944] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 4207 12/07/2018
[150463.765947] RIP: 0010:dev_watchdog+0x1e6/0x1f0
[150463.765950] Code: 48 63 75 28 eb 91 4c 89 ef c6 05 91 f3 12 01 01 e8 af a8 fc ff 44 89 e1 4c 89 ee 48 c7 c7 60 89 a6 af 48 89 c2 e8 ad ab 56 ff <0f> 0b eb bc 66 0f 1f 44 00 00 49 89 f9 48 8d 87 40 01 00 00 31 c9
[150463.765951] RSP: 0018:ffff9e903e605eb0 EFLAGS: 00010282
[150463.765954] RAX: 000000000000003b RBX: ffff9e9031f26600 RCX: 0000000000000000
[150463.765955] RDX: ffff9e903e61d3a0 RSI: ffff9e903e618358 RDI: 0000000000000300
[150463.765956] RBP: ffff9e903ab10440 R08: 0000000000000001 R09: 00000000000005c6
[150463.765957] R10: 000000000001effc R11: 0000000000000003 R12: 0000000000000000
[150463.765958] R13: ffff9e903ab10000 R14: ffffffffafc05108 R15: ffffffffafc05100
[150463.765960] FS:  0000000000000000(0000) GS:ffff9e903e600000(0000) knlGS:0000000000000000
[150463.765962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[150463.765963] CR2: 000055cf91960774 CR3: 00000007f75cc000 CR4: 00000000003406e0
[150463.765964] Call Trace:
[150463.765966]  <IRQ>
[150463.765971]  ? qdisc_put+0x40/0x40
[150463.765976]  call_timer_fn.constprop.0+0x11/0x70
[150463.765979]  expire_timers+0x7c/0xa0
[150463.765982]  run_timer_softirq+0xe4/0x250
[150463.765986]  ? __hrtimer_run_queues+0x117/0x1b0
[150463.765989]  ? sched_clock_cpu+0xc/0xa0
[150463.765993]  __do_softirq+0xcc/0x214
[150463.765997]  irq_exit+0x97/0xd0
[150463.766000]  smp_apic_timer_interrupt+0x5b/0x90
[150463.766003]  apic_timer_interrupt+0xf/0x20
[150463.766004]  </IRQ>
[150463.766009] RIP: 0010:cpuidle_enter_state+0x10d/0x210
[150463.766011] Code: 79 10 64 ff 31 ff 49 89 c6 e8 5f 26 64 ff 41 83 e7 01 74 12 9c 58 f6 c4 02 0f 85 e3 00 00 00 31 ff e8 d7 9c 69 ff fb 45 85 e4 <0f> 88 a4 00 00 00 49 63 d4 4c 89 f0 48 6b ca 68 4c 29 e8 48 6b f2
[150463.766012] RSP: 0018:ffff9e903b357eb8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[150463.766015] RAX: ffff9e903e620340 RBX: ffff9e903aaa3400 RCX: 000000000000001f
[150463.766016] RDX: 0000000000000000 RSI: 0000000025b7c298 RDI: 0000000000000000
[150463.766017] RBP: ffffffffafc68b00 R08: 000088d8935079e5 R09: 0000000000000018
[150463.766018] R10: 00000000000012a1 R11: 00000000000011ef R12: 0000000000000002
[150463.766019] R13: 000088d892f558dc R14: 000088d8935079e5 R15: 0000000000000000
[150463.766025]  ? cpuidle_enter_state+0xf1/0x210
[150463.766028]  cpuidle_enter+0x24/0x40
[150463.766031]  do_idle+0x190/0x200
[150463.766033]  cpu_startup_entry+0x14/0x20
[150463.766036]  secondary_startup_64+0xa4/0xb0
[150463.766040] ---[ end trace 3294f335a60d8db9 ]---
[151743.849874] r8169 0000:06:00.0 enp6s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).

I have the technology, I can do more testing and provide more information.
Comment 1 Heiner Kallweit 2020-04-01 13:04:29 UTC
This is a generic tx timeout that can be caused by anything. Few inquiries:

1. RTl8211E is just the integrated PHY. What is the actual network chip model? Best post dmesg line incl. XID.

2. Is this a regression? What was the last known good kernel version?

3. As you say you always had problems with this realtek chip version: In which kernel version did you have which problems?

Please attach a full dmesg log.
Comment 2 oyvinds 2020-04-01 13:12:32 UTC
Created attachment 288143 [details]
Full dmesg of box with realtek issue

I was very wrong about the RTL chip in question, very sorry. The machine has:

[   26.415537] Generic FE-GE Realtek PHY r8169-600:00: attached PHY driver [Generic FE-GE Realtek PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)
[   26.563620] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)

I was using the ASUS motherboard integrated Realtek network chip when I got the r8169 0000:06:00.0 enp6s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100) error.

2&3: I have irregular problems with the integrated Realtek network card since forever, which is why I put a second (Realtek, sadly) card in it. I kind of forgot last time I moved the machine and used the integrated.

lspci identifies the cards as:
03:00.0 USB controller: ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller (prog-if 30 [XHCI])
        Subsystem: ASUSTeK Computer Inc. Device 86f2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 25
        Region 0: Memory at fe300000 (64-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: xhci_hcd

06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
        Subsystem: ASUSTeK Computer Inc. Device 8677
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 24
        Region 0: I/O ports at e000 [size=256]
        Region 2: Memory at fe204000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at fe200000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: r8169
        Kernel modules: r8169

The integrated 06:00.0 is the one which cases problems.
Comment 3 oyvinds 2020-04-01 13:16:57 UTC
> Best post dmesg line incl. XID.
I am not familiar with how I get dmesg w/XID (what is XID)? I can absolutely post more incriminating information about the machine if you tell me what command(s) are required to get the information. Are there any options I should add to dmesg to get the XID?
Comment 4 Heiner Kallweit 2020-04-01 13:25:08 UTC
The requested XID info is included in the attached dmesg log.
The affected chip is a RTL8168h that is quite common on recent consumer mainboards. I'm not aware of any problem reports about this chip version, and if you say you always had problems with this chip version it may be some hardware defect or BIOS bug. You could install the r8168 Realtek vendor driver and check whether problem persists.
Comment 5 timo 2020-04-01 18:46:13 UTC
Try to disable TCP Segmentation Offloading (using 'ethtool -K enp6s0 tso off').

If this solves it, you may have the same issue as I have. See bug 206969.
Comment 6 oyvinds 2020-04-02 13:01:41 UTC
I tried upgrading to the latest BIOS for this board (09/12/2019) and it did not help at all.

[22626.887129] ------------[ cut here ]------------
[22626.887133] NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out
[22626.887166] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x1e6/0x1f0
[22626.887166] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfcomm xt_DSCP xt_length iptable_mangle nf_conntrack_irc nf_conntrack_sip iptable_raw xt_CT nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rt xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables bnep sunrpc xfs vfat fat wmi_bmof edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel pcspkr snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd joydev sp5100_tco i2c_piix4 bfq gpio_amdpt gpio_generic wmi acpi_cpufreq dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ccp r8169 realtek pinctrl_amd fuse it87(OE) k10temp
[22626.887208] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G           OE     5.6.0-Chaekyung #1
[22626.887210] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 5220 09/12/2019
[22626.887215] RIP: 0010:dev_watchdog+0x1e6/0x1f0
[22626.887219] Code: 48 63 75 28 eb 91 4c 89 ef c6 05 91 f3 12 01 01 e8 af a8 fc ff 44 89 e1 4c 89 ee 48 c7 c7 60 89 a6 90 48 89 c2 e8 ad ab 56 ff <0f> 0b eb bc 66 0f 1f 44 00 00 49 89 f9 48 8d 87 40 01 00 00 31 c9
[22626.887221] RSP: 0018:ffff8da87e205eb0 EFLAGS: 00010282
[22626.887223] RAX: 000000000000003b RBX: ffff8da87a285400 RCX: 0000000000000007
[22626.887225] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff8da87e218350
[22626.887227] RBP: ffff8da87a7e2440 R08: 0000000000000001 R09: 00000000000005b3
[22626.887228] R10: 000000000001e528 R11: 0000000000000003 R12: 0000000000000000
[22626.887229] R13: ffff8da87a7e2000 R14: ffffffff90c05108 R15: ffffffff90c05100
[22626.887232] FS:  0000000000000000(0000) GS:ffff8da87e200000(0000) knlGS:0000000000000000
[22626.887234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22626.887236] CR2: 00000f6b11af0140 CR3: 0000000798260000 CR4: 00000000003406e0
[22626.887237] Call Trace:
[22626.887240]  <IRQ>
[22626.887247]  ? qdisc_put+0x40/0x40
[22626.887252]  call_timer_fn.constprop.0+0x11/0x70
[22626.887256]  expire_timers+0x7c/0xa0
[22626.887259]  run_timer_softirq+0xe4/0x250
[22626.887264]  ? __hrtimer_run_queues+0x153/0x1b0
[22626.887268]  ? sched_clock_cpu+0xc/0xa0
[22626.887273]  __do_softirq+0xcc/0x214
[22626.887278]  irq_exit+0x97/0xd0
[22626.887281]  smp_apic_timer_interrupt+0x5b/0x90
[22626.887285]  apic_timer_interrupt+0xf/0x20
[22626.887288]  </IRQ>
[22626.887291] RIP: 0010:acpi_safe_halt+0x1f/0x30
[22626.887295] Code: fb c3 e9 14 e1 23 ff cc cc cc cc 65 48 8b 04 25 00 7d 01 00 48 8b 00 a8 08 74 01 c3 e9 07 00 00 00 0f 00 2d cd a5 4d 00 fb f4 <fa> c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 41 56 49 89 f6
[22626.887296] RSP: 0018:ffff8da87af5be70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[22626.887299] RAX: 0000000080004000 RBX: 0000000000000001 RCX: 000000000000001f
[22626.887300] RDX: 4ec4ec4ec4ec4ec5 RSI: ffffffff90c68b00 RDI: ffff8da87a62fc00
[22626.887302] RBP: ffff8da87a6f9400 R08: 000014943b8cc635 R09: 0000000000000018
[22626.887304] R10: 0000000000000243 R11: 00000000000000b8 R12: ffff8da87a6f9464
[22626.887305] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[22626.887315]  acpi_idle_enter+0x1dc/0x2b0
[22626.887319]  ? tick_nohz_get_sleep_length+0x66/0x90
[22626.887325]  cpuidle_enter_state+0xd3/0x210
[22626.887329]  cpuidle_enter+0x24/0x40
[22626.887332]  do_idle+0x190/0x200
[22626.887335]  cpu_startup_entry+0x14/0x20
[22626.887339]  secondary_startup_64+0xa4/0xb0
[22626.887343] ---[ end trace 2b3a3073fafbb8ba ]---


I will try using 

ethtool -K enp6s0 tso off

and see if it happens again. It could be TCP Segmentation Offloading that is the problem. I don't have any issues transferring large files at gigabit speeds on the LAN. This tends to happen if I use software like qBittorrent to push 80-100mbit to the Internet - which means that the Realtek card in the machine is at less than 1/10th utilization.
Comment 7 timo 2020-04-04 08:33:57 UTC
Also in my case it doesn't happen under heavy load, but rather when uploading something to SharePoint or Google Drive with somewhere between 5 to 25 Mbps on a gigabit link.
Comment 8 oyvinds 2020-04-06 15:21:03 UTC
It has been 3 days since I started using 

ethtool -K enp6s0 tso off

and I have so far had zero issues. That does not mean that it is certain this fixes it, I could get a problem next week, but it seems extremely likely that this does indeed fix the problem.

I am not sure if I could leave this bug as NEW or RESOLVED; it appears to be RESOLVED for me but it is obviously not resolved for anyone with similar hardware who is not aware that they need to use ethtool to change that option in order to have a stable Realtek NIC.
Comment 9 Heiner Kallweit 2020-04-06 18:43:59 UTC
With following commit the default is changed back to SG/TSO being disabled.
It should show up in the stable kernels soon.

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=95099c569a9fdbe186a27447dfa8a5a0562d4b7f

Based on that you can set the issue to resolved (even though we still don't know the root cause).