Bug 217596
Description
Adilson Dantas
2023-06-26 15:09:13 UTC
(In reply to Adilson Dantas from comment #0) > I recently compiled kernel 6.4 and, after rebooting, the offboard network > adapter fails to work with this message: > > Jun 26 10:31:58 yoda kernel: [ 335.867894] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.890061] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.917102] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.920085] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.920699] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.921271] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.921832] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.922403] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.922975] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:31:58 yoda kernel: [ 335.923547] r8169 0000:05:00.0 eth1: > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > Jun 26 10:32:00 yoda kernel: [ 337.546188] r8169 0000:05:00.0 eth1: Link is > Down > > My hardware: > 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 16) -> > Onboard Ethernet. It works without any problem. > 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) -> > TP-LINK offboard. This one is affected by the problem. > > This problem does not occour with 6.3.9 and below. Can you perform bisection between v6.3 and v6.4 to find the culprit? Created attachment 304484 [details] attachment-21682-0.html I have built 6.4-rc1 and I got the same error. Em seg., 26 de jun. de 2023 às 22:32, <bugzilla-daemon@kernel.org> escreveu: > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > Bagas Sanjaya (bagasdotme@gmail.com) changed: > > What |Removed |Added > > ---------------------------------------------------------------------------- > CC| |bagasdotme@gmail.com > > --- Comment #1 from Bagas Sanjaya (bagasdotme@gmail.com) --- > (In reply to Adilson Dantas from comment #0) > > I recently compiled kernel 6.4 and, after rebooting, the offboard network > > adapter fails to work with this message: > > > > Jun 26 10:31:58 yoda kernel: [ 335.867894] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.890061] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.917102] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.920085] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.920699] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.921271] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.921832] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.922403] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.922975] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:31:58 yoda kernel: [ 335.923547] r8169 0000:05:00.0 eth1: > > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25). > > Jun 26 10:32:00 yoda kernel: [ 337.546188] r8169 0000:05:00.0 eth1: > Link is > > Down > > > > My hardware: > > 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 16) -> > > Onboard Ethernet. It works without any problem. > > 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) -> > > TP-LINK offboard. This one is affected by the problem. > > > > This problem does not occour with 6.3.9 and below. > > Can you perform bisection between v6.3 and v6.4 to find the culprit? > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. > You reported the bug. On 6/27/23 17:45, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > --- Comment #2 from Adilson Dantas (adilson@adilson.net.br) --- > I have built 6.4-rc1 and I got the same error. > Have you done v6.3..v6.4 bisection as I have requested earlier? Created attachment 304489 [details] attachment-26880-0.html Em ter., 27 de jun. de 2023 às 09:53, <bugzilla-daemon@kernel.org> escreveu: > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > --- Comment #3 from Bagas Sanjaya (bagasdotme@gmail.com) --- > On 6/27/23 17:45, bugzilla-daemon@kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > > > --- Comment #2 from Adilson Dantas (adilson@adilson.net.br) --- > > I have built 6.4-rc1 and I got the same error. > > > > Have you done v6.3..v6.4 bisection as I have requested earlier? > I have done between v6.3 and v6.4-rc1. None of the bisections, apparently, has any relation to r8169, except the first one. # bad: [ac9a78681b921877518763ba0e89202254349d1b] Linux 6.4-rc1 # good: [457391b0380335d5e9a5babdec90ac53928b23b4] Linux 6.3 git bisect start 'ac9a78681b92' '457391b03803' # good: [6e98b09da931a00bf4e0477d0fa52748bf28fcce] Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git bisect good 6e98b09da931a00bf4e0477d0fa52748bf28fcce - Maybe - RealTek (r8169): refactor to addess ASPM issues during NAPI poll # good: [70cc1b5307e8ee3076fdf2ecbeb89eb973aa0ff7] Merge tag 'powerpc-6.4-1' of git:// git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux git bisect good 70cc1b5307e8ee3076fdf2ecbeb89eb973aa0ff7 # good: [865fdb08197e657c59e74a35fa32362b12397f58] Merge tag 'input-for-v6.4-rc0' of git:// git.kernel.org/pub/scm/linux/kernel/git/dtor/input git bisect good 865fdb08197e657c59e74a35fa32362b12397f58 # good: [78b421b6a7c6dbb6a213877c742af52330f5026d] Merge tag 'linux-watchdog-6.4-rc1' of git://www.linux-watchdog.org/linux-watchdog git bisect good 78b421b6a7c6dbb6a213877c742af52330f5026d # good: [0e20f4311254193fbf9eebafb4dc5c922a885397] perf script: Print raw ip instead of binary offset for callchain git bisect good 0e20f4311254193fbf9eebafb4dc5c922a885397 # good: [ed23734c23d2fc1e6a1ff80f8c2b82faeed0ed0c] Merge tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net git bisect good ed23734c23d2fc1e6a1ff80f8c2b82faeed0ed0c # good: [994e2419f1e77724479f0ffd5ad4eeae060dec95] nfs: fix mis-merged __filemap_get_folio() error check git bisect good 994e2419f1e77724479f0ffd5ad4eeae060dec95 # good: [1c1094e47ef10be267a982fb1c69dbb80aa4f257] Merge tag 'mailbox-v6.4' of git://git.linaro.org/landing-teams/working/fujitsu/integration git bisect good 1c1094e47ef10be267a982fb1c69dbb80aa4f257 # good: [6f69c981811c8b019d7882839e31c34ea8330860] Merge tag 'v6.4-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 git bisect good 6f69c981811c8b019d7882839e31c34ea8330860 # good: [ecc68ee216c6c5b2f84915e1441adf436f1b019b] perf stat: Separate bperf from bpf_profiler git bisect good ecc68ee216c6c5b2f84915e1441adf436f1b019b # good: [9a2d5178b9d51e1c5f9e08989ff97fc8d4893f31] Revert "perf build: Make BUILD_BPF_SKEL default, rename to NO_BPF_SKEL" git bisect good 9a2d5178b9d51e1c5f9e08989ff97fc8d4893f31 # good: [17784de648be93b4eef0ef8fe28a16ff04feecc7] Merge tag 'core-debugobjects-2023-05-06' of git:// git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 17784de648be93b4eef0ef8fe28a16ff04feecc7 # good: [f085df1be60abf670315c11036261cfaec16b2eb] Merge tag 'perf-tools-for-v6.4-3-2023-05-06' of git:// git.kernel.org/pub/scm/linux/kernel/git/acme/linux git bisect good f085df1be60abf670315c11036261cfaec16b2eb # first bad commit: [ac9a78681b921877518763ba0e89202254349d1b] Linux 6.4-rc1 I don't know how to get the right code from git since the first commit doesn't show r8169_main.c. So I copied this file from v6.3.9 into v6.4 and it worked. Maybe some regression from ASPM or other code related from the first bitsection. > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. > You reported the bug. You can try CONFIG_PSTORE_RAM option in kernel configuration. It helps you in collecting last boot logs after a soft reboot. You can retrieve logs from /sys/fs/pstore/ This bug is still found at 6.4.1 as you can view here: https://imgur.com/a/xbpSZGp The only workaround, for now, is copy r8169_main.c from v6.3.9 into v6.4.1 before building. Hi, after upgrading kernel to 6.4.1 I'm having problem with my NIC that i pass through (Macvtap) to virtual machine. Reverting to kernel 6.3.9 resolves the issue. [ 540.441287] ------------[ cut here ]------------¬ [ 540.441293] NETDEV WATCHDOG: netwan0 (r8169): transmit queue 0 timed out 5430 ms¬ [ 540.441314] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240¬ [ 540.441325] Modules linked in: vhost_net tun vhost vhost_iotlb macvtap macvlan tap exfat wireguard curve25519_x86_64 libchacha [ 540.441390] hid_holtek_mouse snd_soc_core usbnet snd_compress ac97_bus kvm snd_hda_codec_realtek snd_hda_codec_generic snd_pc [ 540.441472] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.4.1-arch1-1 #1 cf145a0250459022493747c0d1c289a70a2c7109¬ [ 540.441476] Hardware name: Dell Inc. Wyse 5070 Thin Client/0K6VXP, BIOS 1.14.0 11/11/2021¬ [ 540.441477] RIP: 0010:dev_watchdog+0x232/0x240¬ [ 540.441482] Code: ff ff ff 48 89 df c6 05 04 b6 45 01 01 e8 c6 28 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 60 59 6c [ 540.441484] RSP: 0018:ffffa9eb00194e78 EFLAGS: 00010286¬ [ 540.441487] RAX: 0000000000000000 RBX: ffff9a9c4abdc000 RCX: 0000000000000027¬ [ 540.441489] RDX: ffff9a9fafda16c8 RSI: 0000000000000001 RDI: ffff9a9fafda16c0¬ [ 540.441490] RBP: ffff9a9c4abdc4c8 R08: 0000000000000000 R09: ffffa9eb00194d08¬ [ 540.441492] R10: 0000000000000003 R11: ffffffff95eca868 R12: ffff9a9c40eb2400¬ [ 540.441493] R13: ffff9a9c4abdc41c R14: 0000000000000000 R15: 0000000000001536¬ [ 540.441495] FS: 0000000000000000(0000) GS:ffff9a9fafd80000(0000) knlGS:0000000000000000¬ [ 540.441497] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033¬ [ 540.441498] CR2: 00007fc72432f5d0 CR3: 00000001cfe20000 CR4: 0000000000352ee0¬ [ 540.441500] Call Trace:¬ [ 540.441504] <IRQ>¬ [ 540.441505] ? dev_watchdog+0x232/0x240¬ [ 540.441509] ? __warn+0x81/0x130¬ [ 540.441516] ? dev_watchdog+0x232/0x240¬ [ 540.441520] ? report_bug+0x171/0x1a0¬ [ 540.441525] ? prb_read_valid+0x1b/0x30¬ [ 540.441530] ? handle_bug+0x3c/0x80¬ [ 540.441533] ? exc_invalid_op+0x17/0x70¬ [ 540.441536] ? asm_exc_invalid_op+0x1a/0x20¬ [ 540.441542] ? dev_watchdog+0x232/0x240¬ [ 540.441546] ? dev_watchdog+0x232/0x240¬ [ 540.441549] ? __pfx_dev_watchdog+0x10/0x10¬ [ 540.441552] call_timer_fn+0x24/0x130¬ [ 540.441557] ? __pfx_dev_watchdog+0x10/0x10¬ [ 540.441560] __run_timers+0x222/0x2c0¬ [ 540.441564] run_timer_softirq+0x1d/0x40¬ [ 540.441567] __do_softirq+0xd1/0x2c8¬ [ 540.441573] __irq_exit_rcu+0xbb/0xf0¬ [ 540.441577] sysvec_apic_timer_interrupt+0x72/0x90¬ [ 540.441582] </IRQ>¬ [ 540.441583] <TASK>¬ [ 540.441584] asm_sysvec_apic_timer_interrupt+0x1a/0x20¬ [ 540.441588] RIP: 0010:cpuidle_enter_state+0xcc/0x440¬ [ 540.441591] Code: 5a 22 3c ff e8 c5 f3 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 93 24 3b ff 45 84 ff 0f 85 56 02 00 00 [ 540.441593] RSP: 0018:ffffa9eb000efe90 EFLAGS: 00000246¬ [ 540.441595] RAX: ffff9a9fafdb3f40 RBX: ffff9a9fafdbf440 RCX: 0000000000000000¬ [ 540.441596] RDX: 0000000000000003 RSI: fffffffb347b8900 RDI: 0000000000000000¬ [ 540.441598] RBP: 0000000000000004 R08: 0000000000000002 R09: 0000000055785785¬ [ 540.441599] R10: ffff9a9fafdb2944 R11: 00000000000000f6 R12: ffffffff95f45960¬ [ 540.441600] R13: 0000007dd4cf7b9b R14: 0000000000000004 R15: 0000000000000000¬ [ 540.441605] cpuidle_enter+0x2d/0x40¬ [ 540.441609] do_idle+0x1d8/0x230¬ [ 540.441613] cpu_startup_entry+0x1d/0x20¬ [ 540.441615] start_secondary+0x12b/0x150¬ [ 540.441620] secondary_startup_64_no_verify+0x10b/0x10b¬ [ 540.441627] </TASK>¬ [ 540.441628] ---[ end trace 0000000000000000 ]---¬ [ 541.694899] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬ [ 542.734928] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬ [ 544.921384] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬ [ 549.187696] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬ [ 557.507543] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting¬ [ 574.787182] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting¬ [ 608.919873] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up¬ [ 608.920766] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC¬ [ 610.172703] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬ [ 611.216035] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬ [ 613.399528] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬ [ 617.666092] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬ [ 625.986007] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting¬ [ 643.051701] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting¬ [ 677.187279] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up¬ [ 677.188045] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC (In reply to Jay Mann from comment #7) > Hi, after upgrading kernel to 6.4.1 I'm having problem with my NIC that i > pass through (Macvtap) to virtual machine. > > Reverting to kernel 6.3.9 resolves the issue. > Can you copy r8169_main.c from kernel 6.3.9 into 6.4.1 to see if you get the same issue? > > [ 540.441287] ------------[ cut here ]------------¬ > > [ 540.441293] NETDEV WATCHDOG: netwan0 (r8169): transmit queue 0 timed out > 5430 ms¬ > [ 540.441314] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 > dev_watchdog+0x232/0x240¬ > [ 540.441325] Modules linked in: vhost_net tun vhost vhost_iotlb macvtap > macvlan tap exfat wireguard curve25519_x86_64 libchacha > [ 540.441390] hid_holtek_mouse snd_soc_core usbnet snd_compress ac97_bus > kvm snd_hda_codec_realtek snd_hda_codec_generic snd_pc > [ 540.441472] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.4.1-arch1-1 #1 > cf145a0250459022493747c0d1c289a70a2c7109¬ > [ 540.441476] Hardware name: Dell Inc. Wyse 5070 Thin Client/0K6VXP, BIOS > 1.14.0 11/11/2021¬ > [ 540.441477] RIP: 0010:dev_watchdog+0x232/0x240¬ > [ 540.441482] Code: ff ff ff 48 89 df c6 05 04 b6 45 01 01 e8 c6 28 fa ff > 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 60 59 6c > [ 540.441484] RSP: 0018:ffffa9eb00194e78 EFLAGS: 00010286¬ > [ 540.441487] RAX: 0000000000000000 RBX: ffff9a9c4abdc000 RCX: > 0000000000000027¬ > [ 540.441489] RDX: ffff9a9fafda16c8 RSI: 0000000000000001 RDI: > ffff9a9fafda16c0¬ > [ 540.441490] RBP: ffff9a9c4abdc4c8 R08: 0000000000000000 R09: > ffffa9eb00194d08¬ > [ 540.441492] R10: 0000000000000003 R11: ffffffff95eca868 R12: > ffff9a9c40eb2400¬ > [ 540.441493] R13: ffff9a9c4abdc41c R14: 0000000000000000 R15: > 0000000000001536¬ > [ 540.441495] FS: 0000000000000000(0000) GS:ffff9a9fafd80000(0000) > knlGS:0000000000000000¬ > [ 540.441497] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033¬ > [ 540.441498] CR2: 00007fc72432f5d0 CR3: 00000001cfe20000 CR4: > 0000000000352ee0¬ > [ 540.441500] Call Trace:¬ > [ 540.441504] <IRQ>¬ > [ 540.441505] ? dev_watchdog+0x232/0x240¬ > [ 540.441509] ? __warn+0x81/0x130¬ > [ 540.441516] ? dev_watchdog+0x232/0x240¬ > [ 540.441520] ? report_bug+0x171/0x1a0¬ > [ 540.441525] ? prb_read_valid+0x1b/0x30¬ > [ 540.441530] ? handle_bug+0x3c/0x80¬ > [ 540.441533] ? exc_invalid_op+0x17/0x70¬ > [ 540.441536] ? asm_exc_invalid_op+0x1a/0x20¬ > [ 540.441542] ? dev_watchdog+0x232/0x240¬ > [ 540.441546] ? dev_watchdog+0x232/0x240¬ > [ 540.441549] ? __pfx_dev_watchdog+0x10/0x10¬ > [ 540.441552] call_timer_fn+0x24/0x130¬ > [ 540.441557] ? __pfx_dev_watchdog+0x10/0x10¬ > [ 540.441560] __run_timers+0x222/0x2c0¬ > [ 540.441564] run_timer_softirq+0x1d/0x40¬ > [ 540.441567] __do_softirq+0xd1/0x2c8¬ > [ 540.441573] __irq_exit_rcu+0xbb/0xf0¬ > [ 540.441577] sysvec_apic_timer_interrupt+0x72/0x90¬ > [ 540.441582] </IRQ>¬ > [ 540.441583] <TASK>¬ > [ 540.441584] asm_sysvec_apic_timer_interrupt+0x1a/0x20¬ > [ 540.441588] RIP: 0010:cpuidle_enter_state+0xcc/0x440¬ > [ 540.441591] Code: 5a 22 3c ff e8 c5 f3 ff ff 8b 53 04 49 89 c5 0f 1f 44 > 00 00 31 ff e8 93 24 3b ff 45 84 ff 0f 85 56 02 00 00 > [ 540.441593] RSP: 0018:ffffa9eb000efe90 EFLAGS: 00000246¬ > [ 540.441595] RAX: ffff9a9fafdb3f40 RBX: ffff9a9fafdbf440 RCX: > 0000000000000000¬ > [ 540.441596] RDX: 0000000000000003 RSI: fffffffb347b8900 RDI: > 0000000000000000¬ > [ 540.441598] RBP: 0000000000000004 R08: 0000000000000002 R09: > 0000000055785785¬ > [ 540.441599] R10: ffff9a9fafdb2944 R11: 00000000000000f6 R12: > ffffffff95f45960¬ > [ 540.441600] R13: 0000007dd4cf7b9b R14: 0000000000000004 R15: > 0000000000000000¬ > [ 540.441605] cpuidle_enter+0x2d/0x40¬ > [ 540.441609] do_idle+0x1d8/0x230¬ > [ 540.441613] cpu_startup_entry+0x1d/0x20¬ > [ 540.441615] start_secondary+0x12b/0x150¬ > [ 540.441620] secondary_startup_64_no_verify+0x10b/0x10b¬ > [ 540.441627] </TASK>¬ > [ 540.441628] ---[ end trace 0000000000000000 ]---¬ > [ 541.694899] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬ > [ 542.734928] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬ > [ 544.921384] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬ > [ 549.187696] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬ > [ 557.507543] r8169 0000:02:00.0: not ready 16383ms after bus reset; > waiting¬ > [ 574.787182] r8169 0000:02:00.0: not ready 32767ms after bus reset; > waiting¬ > [ 608.919873] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving > up¬ > [ 608.920766] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, > detach NIC¬ > [ 610.172703] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬ > [ 611.216035] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬ > [ 613.399528] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬ > [ 617.666092] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬ > [ 625.986007] r8169 0000:02:00.0: not ready 16383ms after bus reset; > waiting¬ > [ 643.051701] r8169 0000:02:00.0: not ready 32767ms after bus reset; > waiting¬ > [ 677.187279] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving > up¬ > [ 677.188045] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, > detach NIC Sorry the computer would take very long time to build the kernel. Can you just try to revert this single commit and see if it works correctly? If not i may be able to build later on a different computer. commit d6c36cbc5e533f48bd89a7b5f339bd82b8b4378a Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Mon May 22 15:41:21 2023 +0200 r8169: Use a raw_spinlock_t for the register locks. The driver's interrupt service routine is requested with the IRQF_NO_THREAD if MSI is available. This means that the routine is invoked in hardirq context even on PREEMPT_RT. The routine itself is relatively short and schedules a worker, performs register access and schedules NAPI. On PREEMPT_RT, scheduling NAPI from hardirq results in waking ksoftirqd for further processing so using NAPI threads with this driver is highly recommended since it NULL routes the threaded-IRQ efforts. Adding rtl_hw_aspm_clkreq_enable() to the ISR is problematic on PREEMPT_RT because the function uses spinlock_t locks which become sleeping locks on PREEMPT_RT. The locks are only used to protect register access and don't nest into other functions or locks. They are also not used for unbounded period of time. Therefore it looks okay to convert them to raw_spinlock_t. Convert the three locks which are used from the interrupt service routine to raw_spinlock_t. Fixes: e1ed3e4d9111 ("r8169: disable ASPM during NAPI poll") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/20230522134121.uxjax0F5@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> drivers/net/ethernet/realtek/r8169_main.c | 44 ++++++++++++++++++++++---------------------- 1 file changed, 22 insertions(+), 22 deletions(-) Created attachment 304546 [details] attachment-2814-0.html It didn't work. I checked the differences between this revert and 6.3.9 and, maybe, is something related to this commit: 2ab19de62d67e403105ba860971e5ff0d511ad15 r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll Part of the removed code from this commit that was not found on the revert is /* Disable ASPM L1 as that cause random device stop working * problems as well as full system hangs for some PCIe devices users. * Chips from RTL8168h partially have issues with L1.2, but seem * to work fine with L1 and L1.1. */ if (rtl_aspm_is_safe(tp)) rc = 0; else if (tp->mac_version >= RTL_GIGA_MAC_VER_46) rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2); else rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1); tp->aspm_manageable = !rc; I would like to build reverting this and other 5 commits from the same module right before this. But I don't have too much time to do it this week. Em ter., 4 de jul. de 2023 às 12:29, <bugzilla-daemon@kernel.org> escreveu: > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > --- Comment #9 from Jay Mann (jmandawg@hotmail.com) --- > Sorry the computer would take very long time to build the kernel. Can you > just > try to revert this single commit and see if it works correctly? If not i > may > be able to build later on a different computer. > > commit d6c36cbc5e533f48bd89a7b5f339bd82b8b4378a > Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Date: Mon May 22 15:41:21 2023 +0200 > > r8169: Use a raw_spinlock_t for the register locks. > > The driver's interrupt service routine is requested with the > IRQF_NO_THREAD if MSI is available. This means that the routine is > invoked in hardirq context even on PREEMPT_RT. The routine itself is > relatively short and schedules a worker, performs register access and > schedules NAPI. On PREEMPT_RT, scheduling NAPI from hardirq results in > waking ksoftirqd for further processing so using NAPI threads with this > driver is highly recommended since it NULL routes the threaded-IRQ > efforts. > > Adding rtl_hw_aspm_clkreq_enable() to the ISR is problematic on > PREEMPT_RT because the function uses spinlock_t locks which become > sleeping locks on PREEMPT_RT. The locks are only used to protect > register access and don't nest into other functions or locks. They are > also not used for unbounded period of time. Therefore it looks okay to > convert them to raw_spinlock_t. > > Convert the three locks which are used from the interrupt service > routine to raw_spinlock_t. > > Fixes: e1ed3e4d9111 ("r8169: disable ASPM during NAPI poll") > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> > Link: https://lore.kernel.org/r/20230522134121.uxjax0F5@linutronix.de > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > drivers/net/ethernet/realtek/r8169_main.c | 44 > ++++++++++++++++++++++---------------------- > 1 file changed, 22 insertions(+), 22 deletions(-) > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. > You reported the bug. I built the kernel 6.4.0-1 with the following patch which reverts the "Disable ASPM L1" code removal as suggested by Adilson Dantas, and it's been running without an issue for an hour without problems. Will update if it crashes: diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 4b19803a7dd0..aeeb6cd312d7 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -623,6 +623,7 @@ struct rtl8169_private { int cfg9346_usage_count; unsigned supports_gmii:1; + unsigned aspm_manageable:1; dma_addr_t counters_phys_addr; struct rtl8169_counters *counters; struct rtl8169_tc_offsets tc_offset; @@ -5158,6 +5159,16 @@ static void rtl_init_mac_address(struct rtl8169_private *tp) rtl_rar_set(tp, mac_addr); } +/* register is set if system vendor successfully tested ASPM 1.2 */ +static bool rtl_aspm_is_safe(struct rtl8169_private *tp) +{ + if (tp->mac_version >= RTL_GIGA_MAC_VER_61 && + r8168_mac_ocp_read(tp, 0xc0b2) & 0xf) + return true; + + return false; +} + static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { struct rtl8169_private *tp; @@ -5229,6 +5240,19 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) tp->mac_version = chipset; + /* Disable ASPM L1 as that cause random device stop working + * problems as well as full system hangs for some PCIe devices users. + * Chips from RTL8168h partially have issues with L1.2, but seem + * to work fine with L1 and L1.1. + */ + if (rtl_aspm_is_safe(tp)) + rc = 0; + else if (tp->mac_version >= RTL_GIGA_MAC_VER_46) + rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2); + else + rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1); + tp->aspm_manageable = !rc; + tp->dash_type = rtl_check_dash(tp); tp->cp_cmd = RTL_R16(tp, CPlusCmd) & CPCMD_MASK; It still ended up crashing, i will try to revert all the commits since 6.3.9 and see if that fixes the issue. [17326.067824] ------------[ cut here ]------------ [17326.067838] NETDEV WATCHDOG: netwan0 (r8169): transmit queue 0 timed out 6557 ms [17326.067896] WARNING: CPU: 2 PID: 2330 at net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240 [17326.067920] Modules linked in: exfat wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel vhost_net tun vhost vhost_iotlb macvtap macvlan tap veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter bridge stp llc overlay snd_hda_codec_hdmi snd_sof_pci_intel_apl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_bus snd_soc_avs snd_soc_hda_codec snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core intel_rapl_msr intel_rapl_common snd_compress intel_pmc_bxt intel_telemetry_pltdrv mousedev intel_punit_ipc snd_hda_codec_realtek ac97_bus intel_telemetry_core snd_hda_codec_generic [17326.068102] snd_pcm_dmaengine x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi snd_hda_codec r8153_ecm joydev hid_holtek_mouse snd_hda_core cdc_ether snd_hwdep kvm usbnet mei_pxp mei_hdcp irqbypass snd_pcm crct10dif_pclmul ee1004 crc32_pclmul polyval_generic gf128mul nls_iso8859_1 snd_timer ghash_clmulni_intel dell_wmi sha512_ssse3 vfat aesni_intel serio crypto_simd usbhid r8169 ucsi_acpi snd rfkill fat typec_ucsi mei_me realtek dell_smbios cryptd ledtrig_audio i2c_i801 intel_lpss_pci rapl dcdbas r8152 intel_lpss mdio_devres wmi_bmof dell_wmi_descriptor mii sparse_keymap intel_cstate typec pcspkr i2c_smbus idma64 libphy mei roles soundcore mac_hid dm_multipath fuse loop dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 uas usb_storage mmc_block i915 i2c_algo_bit drm_buddy intel_gtt drm_display_helper sdhci_pci cqhci sdhci crc32c_intel xhci_pci cec mmc_core xhci_pci_renesas ttm video wmi [17326.068353] CPU: 2 PID: 2330 Comm: chromium Not tainted 6.4.0-1-mainline #3 fe50d2b946c00ecafa54655e238b01581437b546 [17326.068365] Hardware name: Dell Inc. Wyse 5070 Thin Client/0K6VXP, BIOS 1.14.0 11/11/2021 [17326.068371] RIP: 0010:dev_watchdog+0x232/0x240 [17326.068384] Code: ff ff ff 48 89 df c6 05 13 f9 45 01 01 e8 c6 28 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 d0 46 cc ae e8 fe b1 54 ff <0f> 0b e9 2d ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 [17326.068391] RSP: 0000:ffffc2f4c3273dc0 EFLAGS: 00010282 [17326.068399] RAX: 0000000000000000 RBX: ffff9f1743cb4000 RCX: 0000000000000027 [17326.068405] RDX: ffff9f1aafd21688 RSI: 0000000000000001 RDI: ffff9f1aafd21680 [17326.068411] RBP: ffff9f1743cb44c8 R08: 0000000000000000 R09: ffffc2f4c3273c50 [17326.068416] R10: 0000000000000003 R11: ffffffffaf4ca828 R12: ffff9f17418bb200 [17326.068421] R13: ffff9f1743cb441c R14: 0000000000000000 R15: 000000000000199d [17326.068426] FS: 00007ff94cd38180(0000) GS:ffff9f1aafd00000(0000) knlGS:0000000000000000 [17326.068434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [17326.068439] CR2: 000025030064c000 CR3: 000000015b8ac000 CR4: 0000000000352ee0 [17326.068446] Call Trace: [17326.068454] <TASK> [17326.068458] ? dev_watchdog+0x232/0x240 [17326.068469] ? __warn+0x81/0x130 [17326.068485] ? dev_watchdog+0x232/0x240 [17326.068495] ? report_bug+0x171/0x1a0 [17326.068506] ? prb_read_valid+0x1b/0x30 [17326.068520] ? handle_bug+0x3c/0x80 [17326.068533] ? exc_invalid_op+0x17/0x70 [17326.068540] ? asm_exc_invalid_op+0x1a/0x20 [17326.068556] ? dev_watchdog+0x232/0x240 [17326.068567] ? __pfx_dev_watchdog+0x10/0x10 [17326.068577] call_timer_fn+0x24/0x130 [17326.068590] ? __pfx_dev_watchdog+0x10/0x10 [17326.068598] __run_timers+0x222/0x2c0 [17326.068613] run_timer_softirq+0x1d/0x40 [17326.068622] __do_softirq+0xd1/0x2c8 [17326.068637] __irq_exit_rcu+0xbb/0xf0 [17326.068649] sysvec_apic_timer_interrupt+0x3e/0x90 [17326.068661] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [17326.068672] RIP: 0033:0x5611f5966cee [17326.068744] Code: 4c 8b 65 80 0f 1f 84 00 00 00 00 00 80 7d 90 00 74 63 48 8b 45 80 48 63 04 18 48 8b 0d 1b a2 a0 03 48 01 c0 48 21 c8 48 63 00 <48> 01 c0 48 21 c8 48 63 78 24 48 01 ff 48 21 cf 48 8b 0f 48 89 c8 [17326.068751] RSP: 002b:00007ffe0591f110 EFLAGS: 00000202 [17326.068757] RAX: ffffffff805dec58 RBX: 0000000000000900 RCX: 0000190bffffffff [17326.068762] RDX: 0000190b0020cd88 RSI: 0000190b001fbd98 RDI: 0000190b008ccb10 [17326.068767] RBP: 00007ffe0591f1b0 R08: 00005611f90c1578 R09: 00005611ecb712d8 [17326.068772] R10: 00007ffe059b4080 R11: 00000000009bb64a R12: 0000190b0476f658 [17326.068776] R13: 0000000000000000 R14: 000000000000092c R15: 0000000000000000 [17326.068787] </TASK> [17326.068791] ---[ end trace 0000000000000000 ]--- [17327.321173] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting [17328.361180] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting [17330.548391] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting [17334.818719] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting [17343.134979] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting [17360.415001] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting [17394.548241] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up [17394.549408] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC [17395.801610] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting [17396.844852] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting [17399.028494] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting [17403.298498] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting [17411.615204] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting [17428.681669] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting [17462.815268] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up [17462.816405] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC I did some more testing (i missed a code block in my previous revert) and it definitely looks like the issue is caused by: commit 2ab19de62d67e403105ba860971e5ff0d511ad15 Author: Heiner Kallweit <hkallweit1@gmail.com> Date: Mon Mar 6 22:28:06 2023 +0100 r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll My system has 3 realtek NICs 2 of them are passed through to a VM via MACVTAP. Looks like the problem is you removed the "aspm_manageable" flag. Previously my nic would never go into this code block since my nic is not ASPM manageable, but now it does since the flag was removed: ORIG CODE: /* Don't enable ASPM in the chip if OS can't control ASPM */ if (enable && tp->aspm_manageable) { rtl_mod_config5(tp, 0, ASPM_en); rtl_mod_config2(tp, 0, ClkReqEn); NEW CODE: if (enable) { rtl_mod_config5(tp, 0, ASPM_en); rtl_mod_config2(tp, 0, ClkReqEn); (In reply to Jay Mann from comment #13) > I did some more testing (i missed a code block in my previous revert) and it > definitely looks like the issue is caused by: > > commit 2ab19de62d67e403105ba860971e5ff0d511ad15 > Author: Heiner Kallweit <hkallweit1@gmail.com> > Date: Mon Mar 6 22:28:06 2023 +0100 > > r8169: remove ASPM restrictions now that ASPM is disabled during NAPI > poll I rebuilt 6.4.2 with this revert and it worked. > > > My system has 3 realtek NICs 2 of them are passed through to a VM via > MACVTAP. Could you please provide a full dmesg log of the affected system and the lspci -vv output? I'd like to avoid increasing power consumption for 99.9% of the users just because of very few systems with broken ASPM. Created attachment 304598 [details]
dmesg
dmesg from affected system
Created attachment 304599 [details]
lspci -vv
lspci -vv on affected system.
I thought the removed flag "aspm_manageable" was supposed to flag the systems with broken ASPM. Files attached. I hit the same issue on kernel 6.4.x with Dell OptiPlex3000 ThinClient, eth0 NIC stop working with error: "NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out 6387 ms" dmesg: [ 0.000000] DMI: Dell Inc. OptiPlex 3000 Thin Client/07Y42Y, BIOS 1.9.1 05/12/2023 [ 0.632236] r8169 0000:01:00.0 eth0: RTL8168h/8111h, 00:be:43:f1:a1:62, XID 541, IRQ 126 [ 0.632262] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko] [ 1.716836] Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver (mii_bus:phy_addr=r8169-0-100:00, irq=MAC) lspci: 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) Subsystem: Dell Device 0ae8 Flags: bus master, fast devsel, latency 0, IRQ 18 I/O ports at 3000 [size=256] Memory at 7f504000 (64-bit, non-prefetchable) [size=4K] Memory at 7f500000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [70] Express Endpoint, MSI 01 Capabilities: [b0] MSI-X: Enable+ Count=4 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [140] Virtual Channel Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00 Capabilities: [170] Latency Tolerance Reporting Capabilities: [178] L1 PM Substates Kernel driver in use: r8169 so I tested 6.4.3 with reverting commits for ASPM related to drivers/net/ethernet/realtek/r8169_main.c applied on 2023-03-08 and 2023-03-20 (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/ethernet/realtek/r8169_main.c?h=linux-6.4.y), and confirmed "NETDEV WATCHDOG" is gone, the OS is up and looks stable. (In reply to y_satou from comment #20) > I hit the same issue on kernel 6.4.x with Dell OptiPlex3000 ThinClient, eth0 > NIC stop working with error: "NETDEV WATCHDOG: eth0 (r8169): transmit queue > 0 timed out 6387 ms" > > dmesg: > > [ 0.000000] DMI: Dell Inc. OptiPlex 3000 Thin Client/07Y42Y, BIOS 1.9.1 > 05/12/2023 > [ 0.632236] r8169 0000:01:00.0 eth0: RTL8168h/8111h, 00:be:43:f1:a1:62, > XID 541, IRQ 126 > [ 0.632262] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, > tx checksumming: ko] > [ 1.716836] Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver > (mii_bus:phy_addr=r8169-0-100:00, irq=MAC) > > lspci: > > 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) > Subsystem: Dell Device 0ae8 > Flags: bus master, fast devsel, latency 0, IRQ 18 > I/O ports at 3000 [size=256] > Memory at 7f504000 (64-bit, non-prefetchable) [size=4K] > Memory at 7f500000 (64-bit, non-prefetchable) [size=16K] > Capabilities: [40] Power Management version 3 > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ > Capabilities: [70] Express Endpoint, MSI 01 > Capabilities: [b0] MSI-X: Enable+ Count=4 Masked- > Capabilities: [100] Advanced Error Reporting > Capabilities: [140] Virtual Channel > Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00 > Capabilities: [170] Latency Tolerance Reporting > Capabilities: [178] L1 PM Substates > Kernel driver in use: r8169 > > so I tested 6.4.3 with reverting commits for ASPM related to > drivers/net/ethernet/realtek/r8169_main.c applied on 2023-03-08 and > 2023-03-20 > (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/ > drivers/net/ethernet/realtek/r8169_main.c?h=linux-6.4.y), and confirmed > "NETDEV WATCHDOG" is gone, the OS is up and looks stable. Please also provide a full dmesg log and the lcpci -vv output. (-vv to get ASPM L1 substate information) Created attachment 304623 [details]
dmesg output from Dell OptiPlex3000 ThinClient
attached dmesg output from Dell OptiPlex3000 ThinClient as requested.
Created attachment 304624 [details]
lspci -vv output from Dell OptiPlex3000 ThinClient
attached lspci -vv output from Dell OptiPlex3000 ThinClient as requested.
(In reply to y_satou from comment #23) > Created attachment 304624 [details] > lspci -vv output from Dell OptiPlex3000 ThinClient > > attached lspci -vv output from Dell OptiPlex3000 ThinClient as requested. Please repeat with root permissions, otherwise a lot of details is missing. Created attachment 304627 [details]
dmesg from OptiPlex 3080
Created attachment 304628 [details]
lspci from OptiPlex 3080
I was facing the same issue when I was using Google Meet. Reverting 2ab19de62d67e403105ba860971e5ff0d511ad15 fixed the issue. Created attachment 304629 [details] dmesg.txt Here is dmesg and lspci -vv from my machine with an unpatched 6.4.0 kernel. Em qui., 13 de jul. de 2023 às 11:25, <bugzilla-daemon@kernel.org> escreveu: > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > --- Comment #27 from Demitrius Belai (demitriusbelai@gmail.com) --- > I was facing the same issue when I was using Google Meet. Reverting > 2ab19de62d67e403105ba860971e5ff0d511ad15 fixed the issue. > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. > You reported the bug. Created attachment 304630 [details]
lspci.txt
Created attachment 304631 [details]
lspci -vv (by root) output from Dell OptiPlex3000 ThinClient
attached Dell Optiplex3000 ThinClient lspci -vv by root by request.
Apparently on all affected systems the BIOS claims exclusive access to ASPM settings. Could you please boot with cmd line parameter pcie_aspm=force (to overrule this BIOS setting) and see whether this makes any difference? > Could you please boot with cmd line parameter pcie_aspm=force (to overrule
> this BIOS setting) and see whether this makes any difference?
I rebuilt 6.4.3 kernel without change (not rollback ASPM related commits in drivers/net/ethernet/realtek/r8169_main.c), and boot it with "pcie_aspm=force" option.
It looks enable the host survive without "NETDEV WATCHDOG - transmit queue timeout" error, but it looks the performance (throughput) got significant degradation.
Attached dmesg (you can confirm boot option pcie_aspm=force) and lspci -vv output.
Created attachment 304632 [details]
dmesg output from Dell OptiPlex3000 ThinClient with pcie_aspm=force
Created attachment 304633 [details]
lspci -vv (by root) output from Dell OptiPlex3000 ThinClient with pcie_aspm=force
(In reply to y_satou from comment #32) > > Could you please boot with cmd line parameter pcie_aspm=force (to overrule > > this BIOS setting) and see whether this makes any difference? > > I rebuilt 6.4.3 kernel without change (not rollback ASPM related commits in > drivers/net/ethernet/realtek/r8169_main.c), and boot it with > "pcie_aspm=force" option. > > It looks enable the host survive without "NETDEV WATCHDOG - transmit queue > timeout" error, but it looks the performance (throughput) got significant > degradation. > > Attached dmesg (you can confirm boot option pcie_aspm=force) and lspci -vv > output. Thanks for testing. Regarding the performance degradation: - How was it measured? - If not iperf, could you please test with iperf? - Please provide the ethtool -S <if> output. I'd like to know whether missed rx packets are the cause. Created attachment 304634 [details] dmesg-01.txt I still got the same error with unpatched 6.4.0 and pcie_aspm=force parameter. Here is dmesg and lspci output attached below. Em sex., 14 de jul. de 2023 às 02:47, <bugzilla-daemon@kernel.org> escreveu: > https://bugzilla.kernel.org/show_bug.cgi?id=217596 > > --- Comment #31 from Heiner Kallweit (hkallweit1@gmail.com) --- > Apparently on all affected systems the BIOS claims exclusive access to ASPM > settings. > Could you please boot with cmd line parameter pcie_aspm=force (to overrule > this > BIOS setting) and see whether this makes any difference? > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. > You reported the bug. Created attachment 304635 [details]
lspci-01.txt
Created attachment 304636 [details]
journalctl-root-optiplex3060
Troubleshooting an issue I've had recently led me here, hopefully my journalctl output helps in some way. I appear to have the same network adapter, machine is a Dell Optiplex 3060. I can provide additional logs if necessary.
(In reply to Heiner Kallweit from comment #35) > > Regarding the performance degradation: > - How was it measured? My image is boot over network (using ramdisk), download vmlinuz and base packges by iPXE http first, then rest of the packages are downloaded via http, roughly total 400MB. "The rest of the packages download" part showed significant difference - as you could see on dmesg output, * the kernel with removing ASPM related commits = 397sec. * genuine kernel with pcie_aspm=force option = 2059sec. > - If not iperf, could you please test with iperf? > - Please provide the ethtool -S <if> output. I'd like to know whether missed > rx packets are the cause. I'm away for next few days, once I return then I'm going to prepare them for test. > I'd like to avoid increasing power consumption for 99.9% of the users just
> because of very few systems with broken ASPM.
You would more realistically disable ASPM for the 40% of users who have systems it works on in order to make the kernel actually work as it should for the 60% of users who have systems that are broken - if you were to disable it completely. That scenario is irrelevant anyway, the code that used to be in the kernel - when it wasn't broken and useless - didn't disable ASPM across the board. It only disabled it in cases where it's needed - as the kernel should:
+ /* Disable ASPM L1 as that cause random device stop working
+ * problems as well as full system hangs for some PCIe devices users.
+ * Chips from RTL8168h partially have issues with L1.2, but seem
+ * to work fine with L1 and L1.1.
+ */
+ if (rtl_aspm_is_safe(tp))
+ rc = 0;
+ else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
+ rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
+ else
+ rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
+ tp->aspm_manageable = !rc;
Having to encounter this silly bug & visiting this bug tracker & writing here is a complete waste of my time. This has apparently been broken since 6.4 and we're on 6.4.3 now. The right thing to do when the kernel has worked fine for years and one change to one kernel version breaks the kernel for a large portion of users it to revert the change that caused it. Just do it already.
(In reply to oyvinds from comment #40) > > I'd like to avoid increasing power consumption for 99.9% of the users just > > because of very few systems with broken ASPM. > > You would more realistically disable ASPM for the 40% of users who have > systems it works on in order to make the kernel actually work as it should > for the 60% of users who have systems that are broken - if you were to > disable it completely. That scenario is irrelevant anyway, the code that > used to be in the kernel - when it wasn't broken and useless - didn't > disable ASPM across the board. It only disabled it in cases where it's > needed - as the kernel should: > > + /* Disable ASPM L1 as that cause random device stop working > + * problems as well as full system hangs for some PCIe devices users. > + * Chips from RTL8168h partially have issues with L1.2, but seem > + * to work fine with L1 and L1.1. > + */ > + if (rtl_aspm_is_safe(tp)) > + rc = 0; > + else if (tp->mac_version >= RTL_GIGA_MAC_VER_46) > + rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2); > + else > + rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1); > + tp->aspm_manageable = !rc; > > Having to encounter this silly bug & visiting this bug tracker & writing > here is a complete waste of my time. This has apparently been broken since > 6.4 and we're on 6.4.3 now. The right thing to do when the kernel has worked > fine for years and one change to one kernel version breaks the kernel for a > large portion of users it to revert the change that caused it. Just do it > already. I agree with this statement 100%. The longer you wait the more systems you break. Created attachment 304650 [details] iperf test result on Dell OptiPlex3000 ThinClient kernel 6.4.3 (In reply to y_satou from comment #39) > > > > Regarding the performance degradation: > > - If not iperf, could you please test with iperf? > > - Please provide the ethtool -S <if> output. I'd like to know whether > missed rx packets are the cause. > > I'm away for next few days, once I return then I'm going to prepare them for > test. Attached the test result by iperf - which is kernel 6.4.3 genuine code with "pcie_aspm=force" option. tx vs rx showed significant performance deference. upload (=mainly tx) shows 13389 KBytes/sec download (=mainly rx) shows 207 KBytes/sec Created attachment 304662 [details]
iperf test result on Dell OptiPlex3000 ThinClient kernel 6.4.3 with removing ASPM related commits
performed another iperf test with reverting ASPM related commits on kernel 6.4.3 for comparison (using same hardware, Dell OptiPlex3000 ThinClient)
genuine 6.4.3 kernel (keeps ASPM related commits) with pcie_aspm=force option:
upload (=mainly tx) shows 13389 KBytes/sec
download (=mainly rx) shows 207 KBytes/sec
rx_misses count continuously increased
6.4.3 kernel with removing ASPM related commits on r8169_main.c, without pcie_aspm=force option:
upload (=mainly tx) shows 13210 KBytes/sec
download (=mainly rx) shows 7533 KBytes/sec
rx_misses count is not increased
The same issue was reported on openSUSE Bugzilla for 6.4.2 and 6.4.3 kernels, too: https://bugzilla.suse.com/show_bug.cgi?id=1213491 Interestingly, pcie_aspm=force option didn't help for the reporter while reverting the commit seems working. Created attachment 304692 [details]
lspci -vv
Attached lspci -vv output
I'm sorry to report the same bud for Dell G3 laptop, loosing the only one NIC makes this machine unusable. :( [Št júl 20 18:33:12 2023] ------------[ cut here ]------------ [Št júl 20 18:33:12 2023] NETDEV WATCHDOG: enp60s0 (r8169): transmit queue 0 timed out 7734 ms [Št júl 20 18:33:12 2023] WARNING: CPU: 1 PID: 668 at net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240 [Št júl 20 18:33:12 2023] Modules linked in: xt_LOG nf_log_syslog xt_limit xt_pkttype 8021q garp mrp rfcomm nvidia(POE) snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge stp llc cmac algif_hash algif_skcipher af_alg bnep snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_generic_allocation soundwire_bus snd_soc_skl snd_soc_hdac_hda intel_tcc_cooling snd_hda_ext_core x86_pkg_temp_thermal snd_soc_sst_ipc intel_powerclamp coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi kvm_intel hid_multitouch i915 snd_soc_core kvm snd_compress ac97_bus irqbypass iwlmvm crct10dif_pclmul crc32_pclmul snd_pcm_dmaengine polyval_clmulni snd_hda_codec_hdmi [Št júl 20 18:33:12 2023] polyval_generic drm_buddy gf128mul mac80211 ghash_clmulni_intel sha512_ssse3 iTCO_wdt dell_laptop aesni_intel snd_hda_intel btusb intel_pmc_bxt ee1004 iTCO_vendor_support btrtl uvc i2c_algo_bit snd_intel_dspcfg crypto_simd libarc4 mei_hdcp mei_pxp snd_intel_sdw_acpi cryptd dell_wmi iwlwifi btbcm r8169 snd_hda_codec ttm i2c_i801 videobuf2_v4l2 dell_smbios rtsx_usb_ms btintel dell_wmi_sysman intel_rapl_msr realtek snd_hda_core rapl spi_nor memstick firmware_attributes_class mdio_devres ledtrig_audio dell_wmi_descriptor wmi_bmof videodev btmtk snd_hwdep intel_cstate dcdbas intel_wmi_thunderbolt mxm_wmi drm_display_helper processor_thermal_device_pci_legacy cfg80211 mtd libphy i2c_smbus intel_uncore psmouse bluetooth pcspkr videobuf2_common cec snd_pcm mei_me intel_lpss_pci cdc_acm mc processor_thermal_device intel_gtt ucsi_acpi intel_lpss mei processor_thermal_rfim idma64 video typec_ucsi snd_timer processor_thermal_mbox processor_thermal_rapl ecdh_generic pl2303 snd intel_rapl_common typec rfkill soundcore [Št júl 20 18:33:12 2023] i2c_hid_acpi intel_pch_thermal intel_soc_dts_iosf roles i2c_hid int3403_thermal int340x_thermal_zone intel_hid int3400_thermal wmi mousedev acpi_thermal_rel sparse_keymap joydev acpi_pad mac_hid dm_multipath sg crypto_user fuse loop dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 rtsx_usb_sdmmc mmc_core rtsx_usb usbhid nvme serio_raw nvme_core atkbd spi_intel_pci xhci_pci crc32c_intel libps2 spi_intel nvme_common xhci_pci_renesas vivaldi_fmap i8042 serio bbswitch(OE) [last unloaded: videobuf2_memops] [Št júl 20 18:33:12 2023] CPU: 1 PID: 668 Comm: thermald Tainted: P OE 6.4.4-arch1-1 #1 655744e6f70dbd2f57b072f7158d7c5b4468b4ff [Št júl 20 18:33:12 2023] Hardware name: Dell Inc. G3 3779/04R93M, BIOS 1.9.0 03/15/2019 [Št júl 20 18:33:12 2023] RIP: 0010:dev_watchdog+0x232/0x240 [Št júl 20 18:33:12 2023] Code: ff ff ff 48 89 df c6 05 a7 a5 45 01 01 e8 f6 26 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 78 8c ec aa e8 ee 5a 54 ff <0f> 0b e9 2d ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 [Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c01b4e78 EFLAGS: 00010286 [Št júl 20 18:33:12 2023] RAX: 0000000000000000 RBX: ffff915b0c738000 RCX: 0000000000000027 [Št júl 20 18:33:12 2023] RDX: ffff91625e2616c8 RSI: 0000000000000001 RDI: ffff91625e2616c0 [Št júl 20 18:33:12 2023] RBP: ffff915b0c7384c8 R08: 0000000000000000 R09: ffffa148c01b4d08 [Št júl 20 18:33:12 2023] R10: 0000000000000003 R11: ffffffffab6ca868 R12: ffff915b05b77000 [Št júl 20 18:33:12 2023] R13: ffff915b0c73841c R14: 0000000000000000 R15: 0000000000001e36 [Št júl 20 18:33:12 2023] FS: 00007f325edfc6c0(0000) GS:ffff91625e240000(0000) knlGS:0000000000000000 [Št júl 20 18:33:12 2023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Št júl 20 18:33:12 2023] CR2: 0000329601744010 CR3: 00000001176b0005 CR4: 00000000003706e0 [Št júl 20 18:33:12 2023] Call Trace: [Št júl 20 18:33:12 2023] <IRQ> [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 [Št júl 20 18:33:12 2023] ? __warn+0x81/0x130 [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 [Št júl 20 18:33:12 2023] ? report_bug+0x171/0x1a0 [Št júl 20 18:33:12 2023] ? prb_read_valid+0x1b/0x30 [Št júl 20 18:33:12 2023] ? handle_bug+0x3c/0x80 [Št júl 20 18:33:12 2023] ? exc_invalid_op+0x17/0x70 [Št júl 20 18:33:12 2023] ? asm_exc_invalid_op+0x1a/0x20 [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 [Št júl 20 18:33:12 2023] ? __pfx_dev_watchdog+0x10/0x10 [Št júl 20 18:33:12 2023] call_timer_fn+0x24/0x130 [Št júl 20 18:33:12 2023] ? __pfx_dev_watchdog+0x10/0x10 [Št júl 20 18:33:12 2023] __run_timers+0x222/0x2c0 [Št júl 20 18:33:12 2023] run_timer_softirq+0x1d/0x40 [Št júl 20 18:33:12 2023] __do_softirq+0xd1/0x2c8 [Št júl 20 18:33:12 2023] __irq_exit_rcu+0xbb/0xf0 [Št júl 20 18:33:12 2023] sysvec_apic_timer_interrupt+0x72/0x90 [Št júl 20 18:33:12 2023] </IRQ> [Št júl 20 18:33:12 2023] <TASK> [Št júl 20 18:33:12 2023] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [Št júl 20 18:33:12 2023] RIP: 0010:acpi_ns_search_one_scope+0x6f/0x250 [Št júl 20 18:33:12 2023] Code: 04 0f 85 84 00 00 00 4c 8b 65 18 4d 85 e4 0f 84 c0 00 00 00 8b 44 24 04 4c 89 e3 eb 0d 48 8b 5b 20 48 85 db 0f 84 94 00 00 00 <39> 43 0c 75 ee 48 89 df e8 24 08 00 00 83 f8 16 75 03 48 8b 1b f6 [Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c4b0f9e8 EFLAGS: 00000286 [Št júl 20 18:33:12 2023] RAX: 00000000584d4345 RBX: ffff915b01acb750 RCX: 0000000000000010 [Št júl 20 18:33:12 2023] RDX: ffffffffaa94ef50 RSI: ffffffffaa94ef30 RDI: ffffa148c4b0f9d0 [Št júl 20 18:33:12 2023] RBP: ffffffffac161140 R08: 0000000000000005 R09: 0000000000000003 [Št júl 20 18:33:12 2023] R10: 0000000000000042 R11: ffffffffac161140 R12: ffff915b001ee660 [Št júl 20 18:33:12 2023] R13: 0000000000000000 R14: ffffa148c4b0fac0 R15: 0000000000000005 [Št júl 20 18:33:12 2023] ? acpi_ns_search_one_scope+0x3f/0x250 [Št júl 20 18:33:12 2023] acpi_ns_search_and_enter+0x332/0x570 [Št júl 20 18:33:12 2023] acpi_ns_lookup+0x49a/0xa70 [Št júl 20 18:33:12 2023] acpi_ps_get_next_namepath+0x9d/0x390 [Št júl 20 18:33:12 2023] acpi_ps_get_next_arg+0xd7/0x910 [Št júl 20 18:33:12 2023] acpi_ps_parse_loop+0x45e/0xa30 [Št júl 20 18:33:12 2023] acpi_ps_parse_aml+0x221/0x5e0 [Št júl 20 18:33:12 2023] acpi_ps_execute_method+0x171/0x3e0 [Št júl 20 18:33:12 2023] acpi_ns_evaluate+0x174/0x5d0 [Št júl 20 18:33:12 2023] acpi_evaluate_object+0x16f/0x450 [Št júl 20 18:33:12 2023] acpi_evaluate_integer+0x6f/0x130 [Št júl 20 18:33:12 2023] int340x_thermal_get_zone_temp+0x4a/0xb0 [int340x_thermal_zone c9ebf538f873cd311f4997ede84c93646b5df8e3] [Št júl 20 18:33:12 2023] __thermal_zone_get_temp+0x1e/0x60 [Št júl 20 18:33:12 2023] ? __kmem_cache_alloc_node+0x18d/0x310 [Št júl 20 18:33:12 2023] thermal_zone_get_temp+0x6d/0x90 [Št júl 20 18:33:12 2023] temp_show+0x37/0x70 [Št júl 20 18:33:12 2023] dev_attr_show+0x19/0x60 [Št júl 20 18:33:12 2023] sysfs_kf_seq_show+0xa8/0x100 [Št júl 20 18:33:12 2023] seq_read_iter+0x120/0x480 [Št júl 20 18:33:12 2023] vfs_read+0x1f3/0x320 [Št júl 20 18:33:12 2023] ksys_read+0x6f/0xf0 [Št júl 20 18:33:12 2023] do_syscall_64+0x5d/0x90 [Št júl 20 18:33:12 2023] ? syscall_exit_to_user_mode+0x1b/0x40 [Št júl 20 18:33:12 2023] ? do_syscall_64+0x6c/0x90 [Št júl 20 18:33:12 2023] ? do_syscall_64+0x6c/0x90 [Št júl 20 18:33:12 2023] ? do_syscall_64+0x6c/0x90 [Št júl 20 18:33:12 2023] entry_SYSCALL_64_after_hwframe+0x72/0xdc [Št júl 20 18:33:12 2023] RIP: 0033:0x7f326330fb5c [Št júl 20 18:33:12 2023] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 89 9c f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 df 9c f8 ff 48 [Št júl 20 18:33:12 2023] RSP: 002b:00007f325edfa4c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Št júl 20 18:33:12 2023] RAX: ffffffffffffffda RBX: 00007f325edfa660 RCX: 00007f326330fb5c [Št júl 20 18:33:12 2023] RDX: 0000000000001fff RSI: 00007f3248001500 RDI: 000000000000000d [Št júl 20 18:33:12 2023] RBP: 0000000000001fff R08: 0000000000000000 R09: 0000000000000001 [Št júl 20 18:33:12 2023] R10: 0000000000000004 R11: 0000000000000246 R12: 00007f3248001500 [Št júl 20 18:33:12 2023] R13: 00007f325edfa6c8 R14: 00007f325edfa650 R15: 00005647dbed2950 [Št júl 20 18:33:12 2023] </TASK> [Št júl 20 18:33:12 2023] ---[ end trace 0000000000000000 ]--- [Št júl 20 18:33:14 2023] pcieport 0000:00:1d.5: Data Link Layer Link Active not set in 1000 msec [Št júl 20 18:33:14 2023] r8169 0000:3c:00.0 enp60s0: Can't reset secondary PCI bus, detach NIC [Pi júl 21 14:35:57 2023] perf: interrupt took too long (3143 > 3128), lowering kernel.perf_event_max_sample_rate to 63600 (In reply to Milan Oravec from comment #46) > I'm sorry to report the same bud for Dell G3 laptop, loosing the only one > NIC makes this machine unusable. :( > > [Št júl 20 18:33:12 2023] ------------[ cut here ]------------ > [Št júl 20 18:33:12 2023] NETDEV WATCHDOG: enp60s0 (r8169): transmit queue 0 > timed out 7734 ms > [Št júl 20 18:33:12 2023] WARNING: CPU: 1 PID: 668 at > net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240 > [Št júl 20 18:33:12 2023] Modules linked in: xt_LOG nf_log_syslog xt_limit > xt_pkttype 8021q garp mrp rfcomm nvidia(POE) snd_ctl_led > snd_hda_codec_realtek snd_hda_codec_generic xt_CHECKSUM xt_MASQUERADE > xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle > ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge > stp llc cmac algif_hash algif_skcipher af_alg bnep snd_sof_pci_intel_cnl > snd_sof_intel_hda_common soundwire_intel soundwire_cadence > snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp > snd_sof snd_sof_utils soundwire_generic_allocation soundwire_bus snd_soc_skl > snd_soc_hdac_hda intel_tcc_cooling snd_hda_ext_core x86_pkg_temp_thermal > snd_soc_sst_ipc intel_powerclamp coretemp snd_soc_sst_dsp > snd_soc_acpi_intel_match snd_soc_acpi kvm_intel hid_multitouch i915 > snd_soc_core kvm snd_compress ac97_bus irqbypass iwlmvm crct10dif_pclmul > crc32_pclmul snd_pcm_dmaengine polyval_clmulni snd_hda_codec_hdmi > [Št júl 20 18:33:12 2023] polyval_generic drm_buddy gf128mul mac80211 > ghash_clmulni_intel sha512_ssse3 iTCO_wdt dell_laptop aesni_intel > snd_hda_intel btusb intel_pmc_bxt ee1004 iTCO_vendor_support btrtl uvc > i2c_algo_bit snd_intel_dspcfg crypto_simd libarc4 mei_hdcp mei_pxp > snd_intel_sdw_acpi cryptd dell_wmi iwlwifi btbcm r8169 snd_hda_codec ttm > i2c_i801 videobuf2_v4l2 dell_smbios rtsx_usb_ms btintel dell_wmi_sysman > intel_rapl_msr realtek snd_hda_core rapl spi_nor memstick > firmware_attributes_class mdio_devres ledtrig_audio dell_wmi_descriptor > wmi_bmof videodev btmtk snd_hwdep intel_cstate dcdbas intel_wmi_thunderbolt > mxm_wmi drm_display_helper processor_thermal_device_pci_legacy cfg80211 mtd > libphy i2c_smbus intel_uncore psmouse bluetooth pcspkr videobuf2_common cec > snd_pcm mei_me intel_lpss_pci cdc_acm mc processor_thermal_device intel_gtt > ucsi_acpi intel_lpss mei processor_thermal_rfim idma64 video typec_ucsi > snd_timer processor_thermal_mbox processor_thermal_rapl ecdh_generic pl2303 > snd intel_rapl_common typec rfkill soundcore > [Št júl 20 18:33:12 2023] i2c_hid_acpi intel_pch_thermal intel_soc_dts_iosf > roles i2c_hid int3403_thermal int340x_thermal_zone intel_hid int3400_thermal > wmi mousedev acpi_thermal_rel sparse_keymap joydev acpi_pad mac_hid > dm_multipath sg crypto_user fuse loop dm_mod bpf_preload ip_tables x_tables > ext4 crc32c_generic crc16 mbcache jbd2 rtsx_usb_sdmmc mmc_core rtsx_usb > usbhid nvme serio_raw nvme_core atkbd spi_intel_pci xhci_pci crc32c_intel > libps2 spi_intel nvme_common xhci_pci_renesas vivaldi_fmap i8042 serio > bbswitch(OE) [last unloaded: videobuf2_memops] > [Št júl 20 18:33:12 2023] CPU: 1 PID: 668 Comm: thermald Tainted: P OE > 6.4.4-arch1-1 #1 655744e6f70dbd2f57b072f7158d7c5b4468b4ff > [Št júl 20 18:33:12 2023] Hardware name: Dell Inc. G3 3779/04R93M, BIOS > 1.9.0 03/15/2019 > [Št júl 20 18:33:12 2023] RIP: 0010:dev_watchdog+0x232/0x240 > [Št júl 20 18:33:12 2023] Code: ff ff ff 48 89 df c6 05 a7 a5 45 01 01 e8 f6 > 26 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 78 8c ec aa e8 ee 5a > 54 ff <0f> 0b e9 2d ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 > [Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c01b4e78 EFLAGS: 00010286 > [Št júl 20 18:33:12 2023] RAX: 0000000000000000 RBX: ffff915b0c738000 RCX: > 0000000000000027 > [Št júl 20 18:33:12 2023] RDX: ffff91625e2616c8 RSI: 0000000000000001 RDI: > ffff91625e2616c0 > [Št júl 20 18:33:12 2023] RBP: ffff915b0c7384c8 R08: 0000000000000000 R09: > ffffa148c01b4d08 > [Št júl 20 18:33:12 2023] R10: 0000000000000003 R11: ffffffffab6ca868 R12: > ffff915b05b77000 > [Št júl 20 18:33:12 2023] R13: ffff915b0c73841c R14: 0000000000000000 R15: > 0000000000001e36 > [Št júl 20 18:33:12 2023] FS: 00007f325edfc6c0(0000) > GS:ffff91625e240000(0000) knlGS:0000000000000000 > [Št júl 20 18:33:12 2023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [Št júl 20 18:33:12 2023] CR2: 0000329601744010 CR3: 00000001176b0005 CR4: > 00000000003706e0 > [Št júl 20 18:33:12 2023] Call Trace: > [Št júl 20 18:33:12 2023] <IRQ> > [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 > [Št júl 20 18:33:12 2023] ? __warn+0x81/0x130 > [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 > [Št júl 20 18:33:12 2023] ? report_bug+0x171/0x1a0 > [Št júl 20 18:33:12 2023] ? prb_read_valid+0x1b/0x30 > [Št júl 20 18:33:12 2023] ? handle_bug+0x3c/0x80 > [Št júl 20 18:33:12 2023] ? exc_invalid_op+0x17/0x70 > [Št júl 20 18:33:12 2023] ? asm_exc_invalid_op+0x1a/0x20 > [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 > [Št júl 20 18:33:12 2023] ? dev_watchdog+0x232/0x240 > [Št júl 20 18:33:12 2023] ? __pfx_dev_watchdog+0x10/0x10 > [Št júl 20 18:33:12 2023] call_timer_fn+0x24/0x130 > [Št júl 20 18:33:12 2023] ? __pfx_dev_watchdog+0x10/0x10 > [Št júl 20 18:33:12 2023] __run_timers+0x222/0x2c0 > [Št júl 20 18:33:12 2023] run_timer_softirq+0x1d/0x40 > [Št júl 20 18:33:12 2023] __do_softirq+0xd1/0x2c8 > [Št júl 20 18:33:12 2023] __irq_exit_rcu+0xbb/0xf0 > [Št júl 20 18:33:12 2023] sysvec_apic_timer_interrupt+0x72/0x90 > [Št júl 20 18:33:12 2023] </IRQ> > [Št júl 20 18:33:12 2023] <TASK> > [Št júl 20 18:33:12 2023] asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [Št júl 20 18:33:12 2023] RIP: 0010:acpi_ns_search_one_scope+0x6f/0x250 > [Št júl 20 18:33:12 2023] Code: 04 0f 85 84 00 00 00 4c 8b 65 18 4d 85 e4 0f > 84 c0 00 00 00 8b 44 24 04 4c 89 e3 eb 0d 48 8b 5b 20 48 85 db 0f 84 94 00 > 00 00 <39> 43 0c 75 ee 48 89 df e8 24 08 00 00 83 f8 16 75 03 48 8b 1b f6 > [Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c4b0f9e8 EFLAGS: 00000286 > [Št júl 20 18:33:12 2023] RAX: 00000000584d4345 RBX: ffff915b01acb750 RCX: > 0000000000000010 > [Št júl 20 18:33:12 2023] RDX: ffffffffaa94ef50 RSI: ffffffffaa94ef30 RDI: > ffffa148c4b0f9d0 > [Št júl 20 18:33:12 2023] RBP: ffffffffac161140 R08: 0000000000000005 R09: > 0000000000000003 > [Št júl 20 18:33:12 2023] R10: 0000000000000042 R11: ffffffffac161140 R12: > ffff915b001ee660 > [Št júl 20 18:33:12 2023] R13: 0000000000000000 R14: ffffa148c4b0fac0 R15: > 0000000000000005 > [Št júl 20 18:33:12 2023] ? acpi_ns_search_one_scope+0x3f/0x250 > [Št júl 20 18:33:12 2023] acpi_ns_search_and_enter+0x332/0x570 > [Št júl 20 18:33:12 2023] acpi_ns_lookup+0x49a/0xa70 > [Št júl 20 18:33:12 2023] acpi_ps_get_next_namepath+0x9d/0x390 > [Št júl 20 18:33:12 2023] acpi_ps_get_next_arg+0xd7/0x910 > [Št júl 20 18:33:12 2023] acpi_ps_parse_loop+0x45e/0xa30 > [Št júl 20 18:33:12 2023] acpi_ps_parse_aml+0x221/0x5e0 > [Št júl 20 18:33:12 2023] acpi_ps_execute_method+0x171/0x3e0 > [Št júl 20 18:33:12 2023] acpi_ns_evaluate+0x174/0x5d0 > [Št júl 20 18:33:12 2023] acpi_evaluate_object+0x16f/0x450 > [Št júl 20 18:33:12 2023] acpi_evaluate_integer+0x6f/0x130 > [Št júl 20 18:33:12 2023] int340x_thermal_get_zone_temp+0x4a/0xb0 > [int340x_thermal_zone c9ebf538f873cd311f4997ede84c93646b5df8e3] > [Št júl 20 18:33:12 2023] __thermal_zone_get_temp+0x1e/0x60 > [Št júl 20 18:33:12 2023] ? __kmem_cache_alloc_node+0x18d/0x310 > [Št júl 20 18:33:12 2023] thermal_zone_get_temp+0x6d/0x90 > [Št júl 20 18:33:12 2023] temp_show+0x37/0x70 > [Št júl 20 18:33:12 2023] dev_attr_show+0x19/0x60 > [Št júl 20 18:33:12 2023] sysfs_kf_seq_show+0xa8/0x100 > [Št júl 20 18:33:12 2023] seq_read_iter+0x120/0x480 > [Št júl 20 18:33:12 2023] vfs_read+0x1f3/0x320 > [Št júl 20 18:33:12 2023] ksys_read+0x6f/0xf0 > [Št júl 20 18:33:12 2023] do_syscall_64+0x5d/0x90 > [Št júl 20 18:33:12 2023] ? syscall_exit_to_user_mode+0x1b/0x40 > [Št júl 20 18:33:12 2023] ? do_syscall_64+0x6c/0x90 > [Št júl 20 18:33:12 2023] ? do_syscall_64+0x6c/0x90 > [Št júl 20 18:33:12 2023] ? do_syscall_64+0x6c/0x90 > [Št júl 20 18:33:12 2023] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [Št júl 20 18:33:12 2023] RIP: 0033:0x7f326330fb5c > [Št júl 20 18:33:12 2023] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 > 08 e8 89 9c f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 > 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 df 9c f8 ff 48 > [Št júl 20 18:33:12 2023] RSP: 002b:00007f325edfa4c0 EFLAGS: 00000246 > ORIG_RAX: 0000000000000000 > [Št júl 20 18:33:12 2023] RAX: ffffffffffffffda RBX: 00007f325edfa660 RCX: > 00007f326330fb5c > [Št júl 20 18:33:12 2023] RDX: 0000000000001fff RSI: 00007f3248001500 RDI: > 000000000000000d > [Št júl 20 18:33:12 2023] RBP: 0000000000001fff R08: 0000000000000000 R09: > 0000000000000001 > [Št júl 20 18:33:12 2023] R10: 0000000000000004 R11: 0000000000000246 R12: > 00007f3248001500 > [Št júl 20 18:33:12 2023] R13: 00007f325edfa6c8 R14: 00007f325edfa650 R15: > 00005647dbed2950 > [Št júl 20 18:33:12 2023] </TASK> > [Št júl 20 18:33:12 2023] ---[ end trace 0000000000000000 ]--- > [Št júl 20 18:33:14 2023] pcieport 0000:00:1d.5: Data Link Layer Link Active > not set in 1000 msec > [Št júl 20 18:33:14 2023] r8169 0000:3c:00.0 enp60s0: Can't reset secondary > PCI bus, detach NIC > [Pi júl 21 14:35:57 2023] perf: interrupt took too long (3143 > 3128), > lowering kernel.perf_event_max_sample_rate to 63600 It may be fixed in kernel 6.5, for I saw they reverted the relevant bug patches. https://github.com/torvalds/linux/commits/master/drivers/net/ethernet/realtek/r8169_main.c If you use arch, I think you can downgrade kernel to 6.3.9, and don't update it until the kernel 6.5 release. :) Thank you for information. I've tried pcie_aspm=force option, but no luck either, NIC was lost 6 hours after boot. :( Yes, I'm on Arch and will switch back to 6.3 tree, this is only known solution so far. Looks like kernel 6.4.7 will include both bug patches that will be in 6.5 :) https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/ (In reply to edwin.frank.loeffler from comment #49) > Looks like kernel 6.4.7 will include both bug patches that will be in 6.5 :) > > https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/ It doesn't fix. I built 6.4.7 and it got the same issue. I had to go back to a patched 6.4.6 to get my offborad nic working. (In reply to Adilson Dantas from comment #50) > (In reply to edwin.frank.loeffler from comment #49) > > Looks like kernel 6.4.7 will include both bug patches that will be in 6.5 > :) > > > > https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/ > > It doesn't fix. I built 6.4.7 and it got the same issue. I had to go back to > a patched 6.4.6 to get my offborad nic working. Ah yes, apologies. 6.4.7 seems to still be missing commit cf2ffde (In reply to Edwin from comment #51) > (In reply to Adilson Dantas from comment #50) > > (In reply to edwin.frank.loeffler from comment #49) > > > Looks like kernel 6.4.7 will include both bug patches that will be in 6.5 > > :) > > > > > > https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/ > > > > It doesn't fix. I built 6.4.7 and it got the same issue. I had to go back > to > > a patched 6.4.6 to get my offborad nic working. > > Ah yes, apologies. 6.4.7 seems to still be missing commit cf2ffde And I confirm it. I applied this commit and now I can use 6.4.7 without any issues. 6.4.8 was released today and really fix this issue for me. From the changelog: r8169: revert 2ab19de62d67 ("r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll") commit cf2ffdea0839398cb0551762af7f5efb0a6e0fea upstream. There have been reports that on a number of systems this change breaks network connectivity. Therefore effectively revert it. Mainly affected seem to be systems where BIOS denies ASPM access to OS. Due to later changes we can't do a direct revert. Can anyone, who was affected by this bug, can test to see if it will still get the same errors or not before closing this bug report? (In reply to Adilson Dantas from comment #53) > 6.4.8 was released today and really fix this issue for me. > > From the changelog: > > r8169: revert 2ab19de62d67 ("r8169: remove ASPM restrictions now that > ASPM is disabled during NAPI poll") > > commit cf2ffdea0839398cb0551762af7f5efb0a6e0fea upstream. > > There have been reports that on a number of systems this change breaks > network connectivity. Therefore effectively revert it. Mainly affected > seem to be systems where BIOS denies ASPM access to OS. > Due to later changes we can't do a direct revert. > > > Can anyone, who was affected by this bug, can test to see if it will still > get the same errors or not before closing this bug report? Yes it looks to be fixed. Thanks. Since there is no more answers I will consider this bug fixed since 6.4.8. So I'm closing this bug report. It seems that (expectedly), this is causing higher power consumption on some systems which handled ASPM well before (see e.g. https://forum.odroid.com/viewtopic.php?p=377001&sid=03fc5da54400c7a76a88003cd7212117#p377001 for odroid H3+, which draws ~2 W in idle, power consumption seems to increase by +3W with disabled ASPM for the Realtek cards, i.e. over twice as much idle power drawn after kernel upgrade). Apparently, users suggest kernel downgrades and did not investigate the issue further. Sadly, it seems that this change can not even be overridden with pcie_aspm=force on such systems which worked fine with ASPM for the NICs before. Is this expected? As long as BIOS allows it you can use the standard sysfs attributes to enable/disable each individual ASPM state. Please also verify that EEE is enabled. Thanks, confirmed to work, both with older kernels (e.g. 6.1) and most recent ones — I was completely unaware of the sysfs overide until now. I'll let the odroid community know, too, and indeed enabling EEE saves even more. Thanks! |