Bug 207049
Summary: | Realtek RTL8211E network card is unstable | ||
---|---|---|---|
Product: | Drivers | Reporter: | oyvinds |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | hkallweit1, timo |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Full dmesg of box with realtek issue |
Description
oyvinds
2020-04-01 00:03:59 UTC
This is a generic tx timeout that can be caused by anything. Few inquiries: 1. RTl8211E is just the integrated PHY. What is the actual network chip model? Best post dmesg line incl. XID. 2. Is this a regression? What was the last known good kernel version? 3. As you say you always had problems with this realtek chip version: In which kernel version did you have which problems? Please attach a full dmesg log. Created attachment 288143 [details]
Full dmesg of box with realtek issue
I was very wrong about the RTL chip in question, very sorry. The machine has:
[ 26.415537] Generic FE-GE Realtek PHY r8169-600:00: attached PHY driver [Generic FE-GE Realtek PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)
[ 26.563620] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
I was using the ASUS motherboard integrated Realtek network chip when I got the r8169 0000:06:00.0 enp6s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100) error.
2&3: I have irregular problems with the integrated Realtek network card since forever, which is why I put a second (Realtek, sadly) card in it. I kind of forgot last time I moved the machine and used the integrated.
lspci identifies the cards as:
03:00.0 USB controller: ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller (prog-if 30 [XHCI])
Subsystem: ASUSTeK Computer Inc. Device 86f2
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 25
Region 0: Memory at fe300000 (64-bit, non-prefetchable) [size=32K]
Capabilities: <access denied>
Kernel driver in use: xhci_hcd
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
Subsystem: ASUSTeK Computer Inc. Device 8677
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 24
Region 0: I/O ports at e000 [size=256]
Region 2: Memory at fe204000 (64-bit, non-prefetchable) [size=4K]
Region 4: Memory at fe200000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: r8169
Kernel modules: r8169
The integrated 06:00.0 is the one which cases problems.
> Best post dmesg line incl. XID.
I am not familiar with how I get dmesg w/XID (what is XID)? I can absolutely post more incriminating information about the machine if you tell me what command(s) are required to get the information. Are there any options I should add to dmesg to get the XID?
The requested XID info is included in the attached dmesg log. The affected chip is a RTL8168h that is quite common on recent consumer mainboards. I'm not aware of any problem reports about this chip version, and if you say you always had problems with this chip version it may be some hardware defect or BIOS bug. You could install the r8168 Realtek vendor driver and check whether problem persists. Try to disable TCP Segmentation Offloading (using 'ethtool -K enp6s0 tso off'). If this solves it, you may have the same issue as I have. See bug 206969. I tried upgrading to the latest BIOS for this board (09/12/2019) and it did not help at all. [22626.887129] ------------[ cut here ]------------ [22626.887133] NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out [22626.887166] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x1e6/0x1f0 [22626.887166] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfcomm xt_DSCP xt_length iptable_mangle nf_conntrack_irc nf_conntrack_sip iptable_raw xt_CT nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rt xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables bnep sunrpc xfs vfat fat wmi_bmof edac_mce_amd kvm_amd kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel pcspkr snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd joydev sp5100_tco i2c_piix4 bfq gpio_amdpt gpio_generic wmi acpi_cpufreq dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ccp r8169 realtek pinctrl_amd fuse it87(OE) k10temp [22626.887208] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G OE 5.6.0-Chaekyung #1 [22626.887210] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 5220 09/12/2019 [22626.887215] RIP: 0010:dev_watchdog+0x1e6/0x1f0 [22626.887219] Code: 48 63 75 28 eb 91 4c 89 ef c6 05 91 f3 12 01 01 e8 af a8 fc ff 44 89 e1 4c 89 ee 48 c7 c7 60 89 a6 90 48 89 c2 e8 ad ab 56 ff <0f> 0b eb bc 66 0f 1f 44 00 00 49 89 f9 48 8d 87 40 01 00 00 31 c9 [22626.887221] RSP: 0018:ffff8da87e205eb0 EFLAGS: 00010282 [22626.887223] RAX: 000000000000003b RBX: ffff8da87a285400 RCX: 0000000000000007 [22626.887225] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff8da87e218350 [22626.887227] RBP: ffff8da87a7e2440 R08: 0000000000000001 R09: 00000000000005b3 [22626.887228] R10: 000000000001e528 R11: 0000000000000003 R12: 0000000000000000 [22626.887229] R13: ffff8da87a7e2000 R14: ffffffff90c05108 R15: ffffffff90c05100 [22626.887232] FS: 0000000000000000(0000) GS:ffff8da87e200000(0000) knlGS:0000000000000000 [22626.887234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [22626.887236] CR2: 00000f6b11af0140 CR3: 0000000798260000 CR4: 00000000003406e0 [22626.887237] Call Trace: [22626.887240] <IRQ> [22626.887247] ? qdisc_put+0x40/0x40 [22626.887252] call_timer_fn.constprop.0+0x11/0x70 [22626.887256] expire_timers+0x7c/0xa0 [22626.887259] run_timer_softirq+0xe4/0x250 [22626.887264] ? __hrtimer_run_queues+0x153/0x1b0 [22626.887268] ? sched_clock_cpu+0xc/0xa0 [22626.887273] __do_softirq+0xcc/0x214 [22626.887278] irq_exit+0x97/0xd0 [22626.887281] smp_apic_timer_interrupt+0x5b/0x90 [22626.887285] apic_timer_interrupt+0xf/0x20 [22626.887288] </IRQ> [22626.887291] RIP: 0010:acpi_safe_halt+0x1f/0x30 [22626.887295] Code: fb c3 e9 14 e1 23 ff cc cc cc cc 65 48 8b 04 25 00 7d 01 00 48 8b 00 a8 08 74 01 c3 e9 07 00 00 00 0f 00 2d cd a5 4d 00 fb f4 <fa> c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 41 56 49 89 f6 [22626.887296] RSP: 0018:ffff8da87af5be70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [22626.887299] RAX: 0000000080004000 RBX: 0000000000000001 RCX: 000000000000001f [22626.887300] RDX: 4ec4ec4ec4ec4ec5 RSI: ffffffff90c68b00 RDI: ffff8da87a62fc00 [22626.887302] RBP: ffff8da87a6f9400 R08: 000014943b8cc635 R09: 0000000000000018 [22626.887304] R10: 0000000000000243 R11: 00000000000000b8 R12: ffff8da87a6f9464 [22626.887305] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000 [22626.887315] acpi_idle_enter+0x1dc/0x2b0 [22626.887319] ? tick_nohz_get_sleep_length+0x66/0x90 [22626.887325] cpuidle_enter_state+0xd3/0x210 [22626.887329] cpuidle_enter+0x24/0x40 [22626.887332] do_idle+0x190/0x200 [22626.887335] cpu_startup_entry+0x14/0x20 [22626.887339] secondary_startup_64+0xa4/0xb0 [22626.887343] ---[ end trace 2b3a3073fafbb8ba ]--- I will try using ethtool -K enp6s0 tso off and see if it happens again. It could be TCP Segmentation Offloading that is the problem. I don't have any issues transferring large files at gigabit speeds on the LAN. This tends to happen if I use software like qBittorrent to push 80-100mbit to the Internet - which means that the Realtek card in the machine is at less than 1/10th utilization. Also in my case it doesn't happen under heavy load, but rather when uploading something to SharePoint or Google Drive with somewhere between 5 to 25 Mbps on a gigabit link. It has been 3 days since I started using ethtool -K enp6s0 tso off and I have so far had zero issues. That does not mean that it is certain this fixes it, I could get a problem next week, but it seems extremely likely that this does indeed fix the problem. I am not sure if I could leave this bug as NEW or RESOLVED; it appears to be RESOLVED for me but it is obviously not resolved for anyone with similar hardware who is not aware that they need to use ethtool to change that option in order to have a stable Realtek NIC. With following commit the default is changed back to SG/TSO being disabled. It should show up in the stable kernels soon. https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=95099c569a9fdbe186a27447dfa8a5a0562d4b7f Based on that you can set the issue to resolved (even though we still don't know the root cause). |