Bug 216884 - NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out
Summary: NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: Other Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-04 15:43 UTC by HougeLangley
Modified: 2023-08-01 13:52 UTC (History)
6 users (show)

See Also:
Kernel Version: 6.2.0-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
LoongArch 6.2.0-rc2-dmesg-full-log (77.09 KB, text/plain)
2023-01-04 15:43 UTC, HougeLangley
Details

Description HougeLangley 2023-01-04 15:43:49 UTC
Created attachment 303523 [details]
LoongArch 6.2.0-rc2-dmesg-full-log

Full dmesg log see attachment.

[  185.824079] ------------[ cut here ]------------
[  185.825695] NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out
[  185.826826] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x2d8/0x2e0
[  185.827899] Modules linked in: rfkill tun acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler fuse dm_mirror dm_region_hash dm_log efivarfs
[  185.828992] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc2 #1
[  185.830059] Hardware name: LOONGSON LOONGSON T100 T2208A/Loongson-LS2C5LE, BIOS Loongson-UDK2018-V2.0.0-prebeta9 11/01/2022
[  185.831144] $ 0   : 0000000000000000 9000000000f5bba8 90000000016cc000 900000010004bdd0
[  185.832243] $ 4   : 000000000000003b 90000000017d6410 0000000000000000 900000010004bc50
[  185.833337] $ 8   : 0000000000000040 ffffffffffffffff 0000000000000002 0000000000000001
[  185.834433] $12   : 0000000000000102 0000000000000027 0000000000000102 00000000ffffdfff
[  185.835529] $16   : 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  185.836632] $20   : 0000000000000430 ffff800012008fa8 9000000000f5b8d0 900000010d0e0488
[  185.837730] $24   : 900000010d0e03dc 900000010d0e0000 0000000000000000 90000000016de000
[  185.838836] $28   : ffff800012008fa8 ffffffffffffffff ffff800012008ff0 900000010004be90
[  185.839948] era   : 9000000000f5bba8 dev_watchdog+0x2d8/0x2e0
[  185.841041] ra    : 9000000000f5bba8 dev_watchdog+0x2d8/0x2e0
[  185.842108] CSR crmd: 000000b0	
[  185.842110] CSR prmd: 00000004	
[  185.843162] CSR euen: 00000000	
[  185.844207] CSR ecfg: 00071c1d	
[  185.845255] CSR estat: 000c0000	
[  185.847340] ExcCode : c (SubCode 0)
[  185.848377] PrId  : 0014c011 (Loongson-64bit)
[  185.849414] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc2 #1
[  185.850455] Hardware name: LOONGSON LOONGSON T100 T2208A/Loongson-LS2C5LE, BIOS Loongson-UDK2018-V2.0.0-prebeta9 11/01/2022
[  185.851518] Stack : 0000000000000000 9000000000222be8 90000000016cc000 900000010004ba70
[  185.852594]         900000010004ba70 0000000000000000 0000000000000000 900000010004ba70
[  185.853665]         0000000000000040 ffffffffffffffff 0000000000000002 0000000000000001
[  185.854735]         900000010004ba70 6fff800010e28000 9000000100d4f000 00000000ffffdfff
[  185.855807]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  185.856884]         0000000000000430 6fff800010e28000 9000000000f5b8d0 0000000000000000
[  185.857948]         0000000000000000 9000000001597c58 0000000000000000 90000000016de000
[  185.859003]         ffff800012008fa8 ffffffffffffffff ffff800012008ff0 900000010004be90
[  185.860062]         0000000000000000 9000000000222c00 00007ffff34cb210 00000000000000b0
[  185.861127]         0000000000000004 0000000000000000 0000000000071c1d 00000000000c0000
[  185.862181]         ...
[  185.863215] Call Trace:
[  185.863218] [<9000000000222c00>] show_stack+0x4c/0x15c
[  185.865240] [<90000000011491f0>] dump_stack_lvl+0x60/0x88
[  185.866249] [<900000000112fd28>] __warn+0x84/0xc8
[  185.867248] [<9000000001104724>] report_bug+0xa8/0x150
[  185.868245] [<900000000114981c>] do_bp+0x2dc/0x33c
[  185.869238] [<9000000000221600>] __arch_cpu_idle+0x20/0x24

[  185.871185] ---[ end trace 0000000000000000 ]---
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-01-05 08:05:15 UTC
Is this new in 6.2 (e.g. did 6.1 work)?
Comment 2 HougeLangley 2023-01-05 09:23:00 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #1)
> Is this new in 6.2 (e.g. did 6.1 work)?

Yes, this is 6.2, 6.1 is working
Comment 3 Heiner Kallweit 2023-01-05 10:14:51 UTC
Then please bisect.
Comment 4 Heiner Kallweit 2023-01-05 10:21:18 UTC
You can also check whether the following makes a difference under 6.2-rc.

echo 0 > /sys/class/net/enp6s0/napi_defer_hard_irqs
Comment 5 HougeLangley 2023-01-05 14:59:06 UTC
(In reply to Heiner Kallweit from comment #4)
> You can also check whether the following makes a difference under 6.2-rc.
> 
> echo 0 > /sys/class/net/enp6s0/napi_defer_hard_irqs

Thanks, I will try tomorrow
Comment 6 HougeLangley 2023-01-06 13:54:56 UTC
(In reply to Heiner Kallweit from comment #4)
> You can also check whether the following makes a difference under 6.2-rc.
> 
> echo 0 > /sys/class/net/enp6s0/napi_defer_hard_irqs

I have try. It's not working
Comment 7 Heiner Kallweit 2023-01-06 17:50:34 UTC
Apart from using software interrupt coalescing per default there has been no relevant change to r8169 since 6.1. You can bisect to find the offending commit.

Something that hasn't changed since 6.1 but might be related:
I saw the following in your dmesg log:
r8169 0000:06:00.0: limiting MRRS to 256
This message comes from loongson_mrrs_quirk().

However this quirk reduces the MRRS only at device enable time, it doesn't prevent that the device increases the MRRS value later.
That's what r8169 does in rtl_jumbo_config().
So the current quirk may not be sufficient.
Comment 8 HougeLangley 2023-01-07 07:30:09 UTC
(In reply to Heiner Kallweit from comment #7)
> Apart from using software interrupt coalescing per default there has been no
> relevant change to r8169 since 6.1. You can bisect to find the offending
> commit.
> 
> Something that hasn't changed since 6.1 but might be related:
> I saw the following in your dmesg log:
> r8169 0000:06:00.0: limiting MRRS to 256
> This message comes from loongson_mrrs_quirk().
> 
> However this quirk reduces the MRRS only at device enable time, it doesn't
> prevent that the device increases the MRRS value later.
> That's what r8169 does in rtl_jumbo_config().
> So the current quirk may not be sufficient.

Thanks, I will ask for the reason.
Comment 9 WANG Xuerui 2023-01-07 08:35:30 UTC
Hi, according to LoongArch maintainer (Huacai) the issue was confirmed long time ago (it was already present back when Loongson products were still MIPS), but the fix is still under review: https://lore.kernel.org/all/20230106095143.3158998-2-chenhuacai@loongson.cn/

(And it's exactly because the MRRS was increased after device was enabled. Thanks for the analysis.)
Comment 10 HougeLangley 2023-01-07 08:37:11 UTC
(In reply to WANG Xuerui from comment #9)
> Hi, according to LoongArch maintainer (Huacai) the issue was confirmed long
> time ago (it was already present back when Loongson products were still
> MIPS), but the fix is still under review:
> https://lore.kernel.org/all/20230106095143.3158998-2-chenhuacai@loongson.cn/
> 
> (And it's exactly because the MRRS was increased after device was enabled.
> Thanks for the analysis.)

Thanks for your reply.
Comment 11 Bjorn Helgaas 2023-01-31 00:07:44 UTC
Can somebody explain why this behavior changed between v6.1 and v6.2?

I applied the MRRS restriction patch (https://lore.kernel.org/r/20230106095143.3158998-2-chenhuacai@loongson.cn), but if it's really a regression in v6.2, I can try to merge it earlier.
Comment 12 HougeLangley 2023-02-01 15:34:34 UTC
(In reply to Bjorn Helgaas from comment #11)
> Can somebody explain why this behavior changed between v6.1 and v6.2?
> 
> I applied the MRRS restriction patch
> (https://lore.kernel.org/r/20230106095143.3158998-2-chenhuacai@loongson.cn),
> but if it's really a regression in v6.2, I can try to merge it earlier.

I have been test these patches, it's working on 6.2-rc6

$ uname -a
Linux Gentoo-Loongson 6.2.0-rc6 #1 SMP PREEMPT Wed Feb  1 23:20:01 CST 2023 loongarch64 GNU/Linux
Comment 13 HougeLangley 2023-02-01 15:38:09 UTC
[    8.436684] RTL8211DN Gigabit Ethernet r8169-0-600:00: attached PHY driver (mii_bus:phy_addr=r8169-0-600:00, irq=MAC)
[    8.436939] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x000157, prod_id: 0x0000, dev_id: 0x23)
[    8.719772] r8169 0000:06:00.0 enp6s0: Link is Down
[    9.479422] ipmi_si IPI0001:00: IPMI kcs interface initialized
[   11.459350] r8169 0000:06:00.0 enp6s0: Link is Up - 1Gbps/Full - flow control rx/tx
[   11.459446] IPv6: ADDRCONF(NETDEV_CHANGE): enp6s0: link becomes ready
Comment 14 WANG Xuerui 2023-02-02 04:45:11 UTC
(In reply to Bjorn Helgaas from comment #11)
> Can somebody explain why this behavior changed between v6.1 and v6.2?
> 
> I applied the MRRS restriction patch
> (https://lore.kernel.org/r/20230106095143.3158998-2-chenhuacai@loongson.cn),
> but if it's really a regression in v6.2, I can try to merge it earlier.

Sorry for not replying immediately, but IMO this is technically not a regression: the hardware errata is there forever and Linux behavior never changed in this regard. So I personally think it's fine waiting till the next merge window which is close enough to not bother people. And thanks for your response!
Comment 15 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-03 09:23:06 UTC
(In reply to WANG Xuerui from comment #14)
> Sorry for not replying immediately, but IMO this is technically not a
> regression: the hardware errata is there forever and Linux behavior never
> changed in this regard.

Well, kinda, as the reporter in comment #2 clearly states that 6.1 has been working. So I assume some other change exposed the issue to happen, which afaics leaves us with these options:

* apply the fix soon to resolve the issue, even if it's strictly just a side-effect that the change is fixing the regression reported here  
* force the reporter to bisect, so we find the change that exposed it
* once reviewed and acked, merge the fix in the next merge window wtih a CC <stable.. to get it backported to 6.1.y -- which likely means that it will be backported right after 6.3-rc1 is out (which in the end might mean that the change gets less tested than merging it quickly now)
Comment 16 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-03 09:32:44 UTC
(In reply to HougeLangley from comment #2)

> Yes, this is 6.2, 6.1 is working

FWIW, in 
https://lore.kernel.org/all/CAAhV-H6L3V8M4igCWBH=PzuDcoH0KreWkfqHexQwB2v+2TSi=A@mail.gmail.com/
this was said:
```
Yes, this patch can fix that issue. But I don't think this is a
regression, vanila 6.1 kernel also has this problem, maybe the
reporter uses a patched 6.1 kernel.
```
Comment 17 timo 2023-02-28 00:11:15 UTC
I think I have the same issue on 6.1, with two different NICs. 

-----

First one (lan0 = Intel 82579V NIC):

[900148.768008] ------------[ cut here ]------------
[900148.768029] NETDEV WATCHDOG: lan0 (e1000e): transmit queue 0 timed out
[900148.768049] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x207/0x210
[900148.768059] Modules linked in: ppp_async ppp_generic slhc mptcp_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag ip_set_hash_net xt_set ip_set_hash_ip ip_set xt_REDIRECT xt_nat xt_TCPMSS xt_MASQUERADE xt_conntrack xt_tcpudp nft_compat nft_chain_nat nf_tables nfnetlink binfmt_misc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio nls_ascii nls_cp437 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp vfat fat coretemp kvm_intel kvm i915 joydev snd_hda_intel irqbypass snd_intel_dspcfg snd_intel_sdw_acpi drm_buddy snd_hda_codec drm_display_helper snd_hda_core ghash_clmulni_intel cryptd snd_hwdep hid_generic ppdev cec sha512_ssse3 mei_wdt mei_hdcp evdev rc_core sha512_generic usbhid ttm snd_pcm rapl intel_cstate iTCO_wdt intel_pmc_bxt iTCO_vendor_support hid intel_uncore mei_me at24 pcspkr snd_timer drm_kms_helper watchdog snd mei i2c_algo_bit soundcore parport_pc parport button sg tcp_bbr sch_fq wireguard libchacha20poly1305
[900148.768273]  chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_nat_pptp nf_conntrack_pptp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c loop msr drivetemp fuse drm efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci libahci xhci_pci libata xhci_hcd crct10dif_pclmul crct10dif_common crc32_pclmul scsi_mod crc32c_intel e1000e i2c_i801 i2c_smbus ehci_pci ehci_hcd scsi_common lpc_ich usbcore r8169 realtek mdio_devres ptp pps_core libphy usb_common fan video wmi
[900148.768427] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.1.0-5-amd64 #1  Debian 6.1.12-1
[900148.768433] Hardware name:  /DB75EN, BIOS ENB7510H.86A.0046.2013.0704.1354 07/04/2013
[900148.768437] RIP: 0010:dev_watchdog+0x207/0x210
[900148.768443] Code: 00 e9 40 ff ff ff 48 89 df c6 05 9d ea 5d 01 01 e8 fe 7a f9 ff 44 89 e9 48 89 de 48 c7 c7 40 3e 1a b1 48 89 c2 e8 3f c2 1a 00 <0f> 0b e9 22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f
[900148.768448] RSP: 0018:ffffc3a9400e4e80 EFLAGS: 00010286
[900148.768454] RAX: 0000000000000000 RBX: ffffa0f7e215c000 RCX: 0000000000000000
[900148.768458] RDX: 0000000000000103 RSI: ffffffffb1133216 RDI: 00000000ffffffff
[900148.768462] RBP: ffffa0f7e215c488 R08: 0000000000000000 R09: ffffc3a9400e4d08
[900148.768465] R10: 0000000000000003 R11: ffffffffb1ad43a8 R12: ffffa0f7e215c3dc
[900148.768469] R13: 0000000000000000 R14: ffffffffb06255c0 R15: ffffa0f7e215c488
[900148.768473] FS:  0000000000000000(0000) GS:ffffa0f8d6500000(0000) knlGS:0000000000000000
[900148.768477] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[900148.768481] CR2: 000056024825e280 CR3: 00000001d0010003 CR4: 00000000000606e0
[900148.768486] Call Trace:
[900148.768491]  <IRQ>
[900148.768496]  ? pfifo_fast_reset+0x140/0x140
[900148.768504]  call_timer_fn+0x27/0x130
[900148.768511]  __run_timers+0x21c/0x2a0
[900148.768519]  run_timer_softirq+0x2b/0x50
[900148.768525]  __do_softirq+0xf0/0x2fe
[900148.768531]  __irq_exit_rcu+0xc7/0x130
[900148.768538]  sysvec_apic_timer_interrupt+0x9e/0xc0
[900148.768545]  </IRQ>
[900148.768548]  <TASK>
[900148.768551]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[900148.768559] RIP: 0010:cpuidle_enter_state+0xde/0x420
[900148.768566] Code: 00 00 31 ff e8 03 2e 98 ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 25 03 00 00 31 ff e8 08 e9 9e ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[900148.768570] RSP: 0018:ffffc3a9400afe90 EFLAGS: 00000246
[900148.768575] RAX: ffffa0f8d6500000 RBX: ffffa0f8d653bd00 RCX: 0000000000000000
[900148.768579] RDX: 0000000000000001 RSI: ffffffffb1133216 RDI: ffffffffb110c815
[900148.768582] RBP: 0000000000000004 R08: 0000000000000004 R09: 000000002ac37c0f
[900148.768585] R10: 0000000000000008 R11: 0000000000008285 R12: ffffffffb1b9e6c0
[900148.768589] R13: 000332ae378fc1ed R14: 0000000000000004 R15: 0000000000000000
[900148.768596]  cpuidle_enter+0x29/0x40
[900148.768601]  do_idle+0x20c/0x2b0
[900148.768608]  cpu_startup_entry+0x19/0x20
[900148.768615]  start_secondary+0x11a/0x140
[900148.768621]  secondary_startup_64_no_verify+0xe5/0xeb
[900148.768630]  </TASK>
[900148.768633] ---[ end trace 0000000000000000 ]---
[900148.771441] e1000e 0000:00:19.0 lan0: Reset adapter unexpectedly
[900152.629952] e1000e 0000:00:19.0 lan0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

-----

Second one (lan1 = Realtek RTL8168 NIC):

[  405.046724] ------------[ cut here ]------------
[  405.046748] NETDEV WATCHDOG: lan1 (r8169): transmit queue 0 timed out
[  405.046769] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x207/0x210
[  405.046779] Modules linked in: ip_set_hash_net xt_set ip_set_hash_ip ip_set xt_REDIRECT xt_nat xt_TCPMSS xt_MASQUERADE xt_conntrack xt_tcpudp nft_compat nft_chain_nat nf_tables nfnetlink intel_rapl_msr intel_rapl_common binfmt_misc x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp nls_ascii snd_hda_codec_generic ledtrig_audio nls_cp437 kvm_intel vfat fat kvm joydev irqbypass i915 snd_hda_intel snd_intel_dspcfg ghash_clmulni_intel snd_intel_sdw_acpi cryptd snd_hda_codec hid_generic drm_buddy sha512_ssse3 drm_display_helper mei_hdcp snd_hda_core usbhid mei_wdt snd_hwdep ppdev cec hid rc_core evdev sha512_generic snd_pcm rapl ttm intel_cstate iTCO_wdt intel_pmc_bxt mei_me iTCO_vendor_support intel_uncore at24 snd_timer mei pcspkr watchdog drm_kms_helper snd i2c_algo_bit soundcore parport_pc parport button sg tcp_bbr sch_fq wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_nat_pptp
[  405.046969]  nf_conntrack_pptp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c loop msr drivetemp drm configfs fuse efi_pstore efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci xhci_pci ehci_pci libahci xhci_hcd crct10dif_pclmul crct10dif_common e1000e r8169 ehci_hcd libata crc32_pclmul realtek mdio_devres usbcore scsi_mod crc32c_intel i2c_i801 i2c_smbus scsi_common lpc_ich ptp libphy pps_core usb_common fan video wmi
[  405.047090] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-5-amd64 #1  Debian 6.1.12-1
[  405.047095] Hardware name:  /DB75EN, BIOS ENB7510H.86A.0046.2013.0704.1354 07/04/2013
[  405.047099] RIP: 0010:dev_watchdog+0x207/0x210
[  405.047104] Code: 00 e9 40 ff ff ff 48 89 df c6 05 9d ea 5d 01 01 e8 fe 7a f9 ff 44 89 e9 48 89 de 48 c7 c7 40 3e 3a 96 48 89 c2 e8 3f c2 1a 00 <0f> 0b e9 22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f
[  405.047109] RSP: 0018:ffffb423c0003e80 EFLAGS: 00010286
[  405.047115] RAX: 0000000000000000 RBX: ffff98d28cb64000 RCX: 0000000000000000
[  405.047119] RDX: 0000000000000103 RSI: ffffffff96333216 RDI: 00000000ffffffff
[  405.047123] RBP: ffff98d28cb64488 R08: 0000000000000000 R09: ffffb423c0003d08
[  405.047126] R10: 0000000000000003 R11: ffffffff96cd43a8 R12: ffff98d28cb643dc
[  405.047130] R13: 0000000000000000 R14: ffffffff958255c0 R15: ffff98d28cb64488
[  405.047133] FS:  0000000000000000(0000) GS:ffff98d396400000(0000) knlGS:0000000000000000
[  405.047138] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  405.047141] CR2: 000055e2f7997c00 CR3: 0000000102810004 CR4: 00000000000606f0
[  405.047146] Call Trace:
[  405.047151]  <IRQ>
[  405.047156]  ? pfifo_fast_reset+0x140/0x140
[  405.047161]  call_timer_fn+0x27/0x130
[  405.047168]  __run_timers+0x21c/0x2a0
[  405.047176]  run_timer_softirq+0x2b/0x50
[  405.047181]  __do_softirq+0xf0/0x2fe
[  405.047188]  __irq_exit_rcu+0xc7/0x130
[  405.047195]  sysvec_apic_timer_interrupt+0x9e/0xc0
[  405.047202]  </IRQ>
[  405.047205]  <TASK>
[  405.047208]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[  405.047216] RIP: 0010:cpuidle_enter_state+0xde/0x420
[  405.047222] Code: 00 00 31 ff e8 03 2e 98 ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 25 03 00 00 31 ff e8 08 e9 9e ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[  405.047226] RSP: 0018:ffffffff96c03e48 EFLAGS: 00000246
[  405.047232] RAX: ffff98d396400000 RBX: ffff98d39643bd00 RCX: 0000000000000000
[  405.047235] RDX: 0000000000000000 RSI: ffffffff96333216 RDI: ffffffff9630c815
[  405.047239] RBP: 0000000000000004 R08: 0000000000000002 R09: 000000002ac37c0f
[  405.047242] R10: 0000000000000018 R11: 0000000000008087 R12: ffffffff96d9e6c0
[  405.047245] R13: 0000005e4eaa57cd R14: 0000000000000004 R15: 0000000000000000
[  405.047252]  ? cpuidle_enter_state+0xbd/0x420
[  405.047258]  cpuidle_enter+0x29/0x40
[  405.047263]  do_idle+0x20c/0x2b0
[  405.047270]  cpu_startup_entry+0x19/0x20
[  405.047275]  rest_init+0xcb/0xd0
[  405.047281]  arch_call_rest_init+0xa/0x14
[  405.047290]  start_kernel+0x6fe/0x727
[  405.047297]  secondary_startup_64_no_verify+0xe5/0xeb
[  405.047306]  </TASK>
[  405.047308] ---[ end trace 0000000000000000 ]---
Comment 18 Heiner Kallweit 2023-02-28 05:26:57 UTC
(In reply to timo from comment #17)
> I think I have the same issue on 6.1, with two different NICs. 
> 
I don't think so. This bug report is about Loongson architecture.
You have an old system and supposedly some ASPM incompatibility.
Comment 19 Bjorn Helgaas 2023-03-10 16:59:52 UTC
Based on comment #9, I assume the problem HougeLangley reported is related to the Loongson MRRS restrictions.  There are several commits related to that:

  https://git.kernel.org/linus/1f58cca5cf2b ("PCI: Add Loongson PCI Controller support")
  https://git.kernel.org/linus/8b3517f88ff2 ("PCI: loongson: Prevent LS7A MRRS increases")
  https://git.kernel.org/linus/c768f8c5f40f ("PCI: loongson: Add more devices that need MRRS quirk")

1f58cca5cf2b appeared in v5.8.  The others were recently merged in v6.3-rc1, so as far as I know, the issue on Loongson *should* now be resolved, and we can close this unless HougeLangley says the problem still happens in v6.3-rc1 or later.

I assume the problem timo is seeing is different.  Timo, can you open a new report (either bugzilla or, if you suspect a network issue, directly on netdev@vger.kernel.org and cc: the relevant NIC maintainers)?

If you suspect a PCI issue, e.g., ASPM, you can try to verify that by booting with "pcie_aspm=off" or by disabling ASPM with the sysfs interface: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-bus-pci?id=v6.2#n420
Comment 20 timo 2023-03-12 15:25:15 UTC
Sorry for the confusion; my reply was clearly posted in the wrong thread.

In my case, the errors happened when I connected a wireless router (TP-Link Archer C9) to these NICs of my server. Initially, I thought the error was in the Linux kernel of the server. However, I have recently replaced the wireless router by a different model and since then, I don't see the 'transmit queue 0 timed out' errors anymore. Not sure what happened; maybe the old device was sending repeated pause frames? Anyhow, it seems to be solved.
Comment 21 medv 2023-08-01 12:12:36 UTC
Some notes.
Below is the 
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)

AMD Ryzen 7 7735HS with Radeon Graphics

How to reproduce: when we boot, we config enp2s0 with IP (so, it is up, but not linked with TP; ts=43.329641) by hands. Then, we connect with TP and get wathcdog errors. Iface is working,  pinging, data transfer OK. Then (at the bottom) we disconnect TP and get Link down. And we make it 3 times (up-down). Everything works fine. So, the error is single-time at the first link-up event. And looks like something is not in the correct state at start or (some structure) is not initialized as required by HW.


[   43.117152] RTL8226B_RTL8221B 2.5Gbps PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
[   43.329641] r8169 0000:02:00.0 enp2s0: Link is Down
[   60.382329] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
[   60.382355] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
[   65.857255] ------------[ cut here ]------------
[   65.857279] NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out
[   65.857297] WARNING: CPU: 6 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x277/0x280
[   65.857311] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg bnep zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) intel_rapl_msr snd_hda_codec_generic intel_rapl_common ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi iwlmvm snd_hda_codec edac_mce_amd snd_hda_core snd_soc_dmic snd_soc_acp6x_mach snd_acp6x_pdm_dma kvm_amd mac80211 snd_hwdep btusb snd_soc_core kvm snd_seq_midi libarc4 snd_seq_midi_event snd_compress ac97_bus snd_rawmidi snd_pcm_dmaengine snd_pci_acp6x btrtl snd_seq btbcm nls_iso8859_1 btintel rapl iwlwifi wmi_bmof joydev snd_pcm input_leds snd_seq_device bluetooth snd_timer k10temp ecdh_generic snd_pci_acp5x ecc cfg80211 snd snd_rn_pci_acp3x snd_pci_acp3x ccp soundcore mac_hid acpi_tad sch_fq_codel msr parport_pc ppdev lp parport ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c r8153_ecm
[   65.857539]  cdc_ether usbnet r8152 mii dm_mirror dm_region_hash dm_log hid_generic usbhid hid amdgpu iommu_v2 gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt crct10dif_pclmul fb_sys_fops crc32_pclmul ghash_clmulni_intel cec nvme r8169 aesni_intel rc_core xhci_pci crypto_simd drm cryptd thunderbolt i2c_piix4 xhci_pci_renesas nvme_core realtek wmi video
[   65.857642] CPU: 6 PID: 0 Comm: swapper/6 Tainted: P           O      5.15.0-56-generic #62-Ubuntu
[   65.857648] Hardware name: AZW SER/SER, BIOS 113 02/21/2023
[   65.857651] RIP: 0010:dev_watchdog+0x277/0x280
[   65.857658] Code: eb 97 48 8b 5d d0 c6 05 d7 23 69 01 01 48 89 df e8 2e 67 f9 ff 44 89 e1 48 89 de 48 c7 c7 28 5a 0d 88 48 89 c2 e8 a0 d0 19 00 <0f> 0b eb 80 e9 d3 39 23 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
[   65.857663] RSP: 0018:ffffac1000338e70 EFLAGS: 00010282
[   65.857669] RAX: 0000000000000000 RBX: ffff935c9d068000 RCX: 0000000000000000
[   65.857673] RDX: ffff9362a1fac240 RSI: ffff9362a1fa0580 RDI: 0000000000000300
[   65.857676] RBP: ffffac1000338ea8 R08: 0000000000000003 R09: fffffffffffd56d0
[   65.857679] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
[   65.857682] R13: ffff935c9d063680 R14: 0000000000000001 R15: ffff935c9d0684c0
[   65.857685] FS:  0000000000000000(0000) GS:ffff9362a1f80000(0000) knlGS:0000000000000000
[   65.857689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   65.857693] CR2: 00007f37948f7ae0 CR3: 0000000399610000 CR4: 0000000000750ee0
[   65.857697] PKRU: 55555554
[   65.857699] Call Trace:
[   65.857703]  <IRQ>
[   65.857708]  ? pfifo_fast_enqueue+0x160/0x160
[   65.857715]  call_timer_fn+0x2c/0x120
[   65.857722]  __run_timers.part.0+0x1e3/0x270
[   65.857726]  ? ktime_get+0x46/0xc0
[   65.857732]  ? native_x2apic_icr_read+0x20/0x20
[   65.857739]  ? lapic_next_event+0x20/0x30
[   65.857745]  ? clockevents_program_event+0xad/0x130
[   65.857752]  run_timer_softirq+0x2a/0x60
[   65.857757]  __do_softirq+0xd9/0x2e7
[   65.857764]  irq_exit_rcu+0x94/0xc0
[   65.857773]  sysvec_apic_timer_interrupt+0x80/0x90
[   65.857779]  </IRQ>
[   65.857782]  <TASK>
[   65.857785]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[   65.857789] RIP: 0010:cpuidle_enter_state+0xd9/0x620
[   65.857797] Code: 3d 44 d8 b9 78 e8 d7 c2 68 ff 49 89 c7 0f 1f 44 00 00 31 ff e8 18 d0 68 ff 80 7d d0 00 0f 85 61 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6d 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e7 03 00 00
[   65.857801] RSP: 0018:ffffac10001b7e28 EFLAGS: 00000246
[   65.857806] RAX: ffff9362a1fb0b80 RBX: ffff935c84fe4800 RCX: 0000000000000000
[   65.857809] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[   65.857811] RBP: ffffac10001b7e78 R08: 0000000f5565c7c4 R09: 0000000000000000
[   65.857815] R10: 0000000000000000 R11: 071c71c71c71c71c R12: ffffffff88ae6aa0
[   65.857818] R13: 0000000000000003 R14: 0000000000000003 R15: 0000000f5565c7c4
[   65.857824]  ? cpuidle_enter_state+0xc8/0x620
[   65.857829]  ? tick_nohz_stop_tick+0x16a/0x1d0
[   65.857833]  cpuidle_enter+0x2e/0x50
[   65.857839]  cpuidle_idle_call+0x142/0x1e0
[   65.857845]  do_idle+0x83/0xf0
[   65.857849]  cpu_startup_entry+0x20/0x30
[   65.857854]  start_secondary+0x12a/0x180
[   65.857858]  secondary_startup_64_no_verify+0xc2/0xcb
[   65.857866]  </TASK>
[   65.857869] ---[ end trace 2f006d71b249a325 ]---



[  162.848076] r8169 0000:02:00.0 enp2s0: Link is Down
[  168.385108] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
[  222.525513] r8169 0000:02:00.0 enp2s0: Link is Down
[  230.368124] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
[  232.189332] r8169 0000:02:00.0 enp2s0: Link is Down
[  235.466225] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
Comment 22 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-08-01 12:29:33 UTC
(In reply to medv from comment #21)
> Some notes.

Thx for your report, but it afics won't lead to anything, as your notes check at least three important things on this list: https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

* you reported it as a comment to an existing bug that is likely different
* you use a heavily patched vendor kernel
* you use out-of tree modules

Please report it to your vendor or try to reproduce with a really fresh vanilla kernel – and if you can report it in a new bug. thx!
Comment 23 medv 2023-08-01 13:52:05 UTC
Notes for the message above.
It was the message from the distro kernel  (15.0.*).
The same situation is with the vanilla kernel (not patched, just compiled from kernel.org sources):

[   10.587721] Bluetooth: RFCOMM ver 1.11
[   53.203942] RTL8226B_RTL8221B 2.5Gbps PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
[   53.384456] r8169 0000:02:00.0 enp2s0: Link is Down
[   66.222707] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
[   66.222737] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
[   72.000674] ------------[ cut here ]------------
[   72.000693] NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out 5768 ms
[   72.000713] WARNING: CPU: 14 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x238/0x240
[   72.000730] Modules linked in: rfcomm(E) cmac(E) algif_hash(E) algif_skcipher(E) af_alg(E) bnep(E) iwlmvm(E) mac80211(E) snd_hda_codec_conexant(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_dspcfg(E) libarc4(E) snd_intel_sdw_acpi(E) snd_hda_codec(E) snd_acp6x_pdm_dma(E) snd_soc_dmic(E) intel_rapl_msr(E) snd_soc_acp6x_mach(E) intel_rapl_common(E) snd_soc_core(E) snd_hda_core(E) snd_hwdep(E) iwlwifi(E) edac_mce_amd(E) btusb(E) snd_compress(E) snd_seq_midi(E) ac97_bus(E) snd_seq_midi_event(E) btrtl(E) snd_pcm_dmaengine(E) kvm_amd(E) btbcm(E) btintel(E) snd_pci_acp6x(E) snd_rawmidi(E) btmtk(E) nls_iso8859_1(E) snd_pcm(E) snd_seq(E) cfg80211(E) snd_seq_device(E) kvm(E) bluetooth(E) snd_timer(E) snd_pci_acp5x(E) rapl(E) joydev(E) input_leds(E) snd_rn_pci_acp3x(E) snd(E) ecdh_generic(E) snd_acp_config(E) ccp(E) wmi_bmof(E) k10temp(E) ecc(E) snd_soc_acpi(E) snd_pci_acp3x(E) soundcore(E) mac_hid(E) acpi_tad(E) sch_fq_codel(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E)
[   72.000937]  ramoops(E) pstore_blk(E) reed_solomon(E) pstore_zone(E) efi_pstore(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) dm_mirror(E) dm_region_hash(E) dm_log(E) r8153_ecm(E) cdc_ether(E) usbnet(E) hid_generic(E) usbhid(E) amdgpu(E) r8152(E) hid(E) mii(E) iommu_v2(E) drm_buddy(E) gpu_sched(E) i2c_algo_bit(E) drm_suballoc_helper(E) drm_ttm_helper(E) ttm(E) drm_display_helper(E) cec(E) rc_core(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) drm(E) thunderbolt(E) nvme(E) nvme_core(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) aesni_intel(E) r8169(E) crypto_simd(E) cryptd(E) xhci_pci(E) video(E) i2c_piix4(E) xhci_pci_renesas(E) realtek(E) wmi(E)
[   72.001136] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G            E      6.4.7 #1
[   72.001143] Hardware name: AZW SER/SER, BIOS 113 02/21/2023
[   72.001147] RIP: 0010:dev_watchdog+0x238/0x240
[   72.001156] Code: ff ff 4c 89 e7 c6 05 f7 18 04 01 01 e8 f1 c8 f9 ff 44 8b 45 d4 44 89 f9 4c 89 e6 48 89 c2 48 c7 c7 a8 01 c2 8b e8 e8 82 36 ff <0f> 0b e9 25 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[   72.001161] RSP: 0018:ffffafdd404e0e68 EFLAGS: 00010282
[   72.001168] RAX: 0000000000000000 RBX: ffff9fe0730984c8 RCX: 0000000000000000
[   72.001171] RDX: ffff9fe5e21adc40 RSI: ffff9fe5e21a1540 RDI: 0000000000000300
[   72.001175] RBP: ffffafdd404e0e98 R08: 0000000000000003 R09: fffffffffffd5cd8
[   72.001178] R10: ffffffff8d3373c0 R11: 0000000000000045 R12: ffff9fe073098000
[   72.001181] R13: 0000000000000000 R14: ffff9fe07309841c R15: 0000000000000000
[   72.001185] FS:  0000000000000000(0000) GS:ffff9fe5e2180000(0000) knlGS:0000000000000000
[   72.001189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.001193] CR2: 000055c9909bff48 CR3: 00000004a683c000 CR4: 0000000000750ee0
[   72.001197] PKRU: 55555554
[   72.001199] Call Trace:
[   72.001203]  <IRQ>
[   72.001208]  ? show_regs+0x6e/0x80
[   72.001217]  ? dev_watchdog+0x238/0x240
[   72.001223]  ? __warn+0x91/0x150
[   72.001231]  ? dev_watchdog+0x238/0x240
[   72.001237]  ? report_bug+0x19d/0x1b0
[   72.001243]  ? handle_bug+0x46/0x80
[   72.001251]  ? exc_invalid_op+0x1d/0x80
[   72.001256]  ? asm_exc_invalid_op+0x1f/0x30
[   72.001266]  ? dev_watchdog+0x238/0x240
[   72.001272]  ? dev_watchdog+0x238/0x240
[   72.001278]  ? __pfx_dev_watchdog+0x10/0x10
[   72.001285]  call_timer_fn+0x2c/0x150
[   72.001293]  ? __pfx_dev_watchdog+0x10/0x10
[   72.001298]  __run_timers.part.0+0x1eb/0x2a0
[   72.001304]  ? ktime_get+0x4a/0xc0
[   72.001309]  ? __pfx_native_apic_mem_write+0x10/0x10
[   72.001315]  ? lapic_next_event+0x24/0x30
[   72.001322]  ? clockevents_program_event+0xb1/0x130
[   72.001329]  run_timer_softirq+0x2e/0x60
[   72.001335]  __do_softirq+0xe1/0x31e
[   72.001340]  ? hrtimer_interrupt+0x12f/0x240
[   72.001344]  __irq_exit_rcu+0x83/0xb0
[   72.001351]  irq_exit_rcu+0x12/0x20
[   72.001357]  sysvec_apic_timer_interrupt+0x84/0x90
[   72.001363]  </IRQ>
[   72.001365]  <TASK>
[   72.001368]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
[   72.001373] RIP: 0010:cpuidle_enter_state+0xde/0x710
[   72.001380] Code: 6b 17 ff e8 04 f6 ff ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 62 45 16 ff 80 7d d0 00 0f 85 50 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 e6 01 00 00 4d 63 ee 49 83 fd 09 0f 87 03 05 00 00
[   72.001384] RSP: 0018:ffffafdd401f7e28 EFLAGS: 00000246
[   72.001389] RAX: ffff9fe5e21b3700 RBX: ffff9fdfc51cbc00 RCX: 000000000000001f
[   72.001393] RDX: 000000000000000e RSI: 0000000000000002 RDI: 0000000000000000
[   72.001396] RBP: ffffafdd401f7e78 R08: 00000010c392452f R09: 0000000000000000
[   72.001399] R10: 0000000000000000 R11: ffffafdd401f7d18 R12: ffffffff8c0d43a0
[   72.001402] R13: 0000000000000003 R14: 0000000000000003 R15: 00000010c392452f
[   72.001407]  ? cpuidle_enter_state+0xce/0x710
[   72.001413]  cpuidle_enter+0x32/0x50
[   72.001423]  call_cpuidle+0x23/0x50
[   72.001431]  do_idle+0x1e7/0x240
[   72.001437]  cpu_startup_entry+0x24/0x30
[   72.001441]  start_secondary+0x13c/0x170
[   72.001448]  secondary_startup_64_no_verify+0x10b/0x10b
[   72.001456]  </TASK>
[   72.001460] ---[ end trace 0000000000000000 ]---
[   72.005611] r8169 0000:02:00.0 enp2s0: ASPM disabled on Tx timeout
[  230.097841] r8169 0000:02:00.0 enp2s0: Link is Down
[  233.757357] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
root@ser6:~# uname -a
Linux ser6 6.4.7 #1 SMP PREEMPT_DYNAMIC Tue Aug  1 16:17:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Note You need to log in before you can comment on or make changes to this bug.