Bug 217596

Summary: r8169: One network adapter stops working after update to 6.4
Product: Drivers Reporter: Adilson Dantas (adilson)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED CODE_FIX    
Severity: normal CC: adilson, bagasdotme, demitriusbelai, dev, edwin.frank.loeffler, hkallweit1, jmandawg, mauro.orlic, moravec, o.freyermuth, oyvinds, qasim.majeed20, qwe2968, tiwai, y_satou
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: attachment-21682-0.html
attachment-26880-0.html
attachment-2814-0.html
dmesg
lspci -vv
dmesg output from Dell OptiPlex3000 ThinClient
lspci -vv output from Dell OptiPlex3000 ThinClient
dmesg from OptiPlex 3080
lspci from OptiPlex 3080
dmesg.txt
lspci.txt
lspci -vv (by root) output from Dell OptiPlex3000 ThinClient
dmesg output from Dell OptiPlex3000 ThinClient with pcie_aspm=force
lspci -vv (by root) output from Dell OptiPlex3000 ThinClient with pcie_aspm=force
dmesg-01.txt
lspci-01.txt
journalctl-root-optiplex3060
iperf test result on Dell OptiPlex3000 ThinClient kernel 6.4.3
iperf test result on Dell OptiPlex3000 ThinClient kernel 6.4.3 with removing ASPM related commits
lspci -vv

Description Adilson Dantas 2023-06-26 15:09:13 UTC
I recently compiled kernel 6.4 and, after rebooting, the offboard network adapter fails to work with this message:

Jun 26 10:31:58 yoda kernel: [  335.867894] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.890061] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.917102] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.920085] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.920699] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.921271] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.921832] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.922403] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.922975] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:31:58 yoda kernel: [  335.923547] r8169 0000:05:00.0 eth1: rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
Jun 26 10:32:00 yoda kernel: [  337.546188] r8169 0000:05:00.0 eth1: Link is Down

My hardware: 
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 16) -> Onboard Ethernet. It works without any problem. 
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) -> TP-LINK offboard. This one is affected by the problem.

This problem does not occour with 6.3.9 and below.
Comment 1 Bagas Sanjaya 2023-06-27 01:32:31 UTC
(In reply to Adilson Dantas from comment #0)
> I recently compiled kernel 6.4 and, after rebooting, the offboard network
> adapter fails to work with this message:
> 
> Jun 26 10:31:58 yoda kernel: [  335.867894] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.890061] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.917102] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.920085] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.920699] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.921271] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.921832] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.922403] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.922975] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:31:58 yoda kernel: [  335.923547] r8169 0000:05:00.0 eth1:
> rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> Jun 26 10:32:00 yoda kernel: [  337.546188] r8169 0000:05:00.0 eth1: Link is
> Down
> 
> My hardware: 
> 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 16) ->
> Onboard Ethernet. It works without any problem. 
> 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) ->
> TP-LINK offboard. This one is affected by the problem.
> 
> This problem does not occour with 6.3.9 and below.

Can you perform bisection between v6.3 and v6.4 to find the culprit?
Comment 2 Adilson Dantas 2023-06-27 10:45:57 UTC
Created attachment 304484 [details]
attachment-21682-0.html

I have built 6.4-rc1 and I got the same error.

Em seg., 26 de jun. de 2023 às 22:32, <bugzilla-daemon@kernel.org> escreveu:

> https://bugzilla.kernel.org/show_bug.cgi?id=217596
>
> Bagas Sanjaya (bagasdotme@gmail.com) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |bagasdotme@gmail.com
>
> --- Comment #1 from Bagas Sanjaya (bagasdotme@gmail.com) ---
> (In reply to Adilson Dantas from comment #0)
> > I recently compiled kernel 6.4 and, after rebooting, the offboard network
> > adapter fails to work with this message:
> >
> > Jun 26 10:31:58 yoda kernel: [  335.867894] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.890061] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.917102] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.920085] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.920699] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.921271] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.921832] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.922403] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.922975] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:31:58 yoda kernel: [  335.923547] r8169 0000:05:00.0 eth1:
> > rtl_ocp_gphy_cond == 1 (loop: 10, delay: 25).
> > Jun 26 10:32:00 yoda kernel: [  337.546188] r8169 0000:05:00.0 eth1:
> Link is
> > Down
> >
> > My hardware:
> > 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 16) ->
> > Onboard Ethernet. It works without any problem.
> > 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) ->
> > TP-LINK offboard. This one is affected by the problem.
> >
> > This problem does not occour with 6.3.9 and below.
>
> Can you perform bisection between v6.3 and v6.4 to find the culprit?
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
Comment 3 Bagas Sanjaya 2023-06-27 12:53:33 UTC
On 6/27/23 17:45, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217596
> 
> --- Comment #2 from Adilson Dantas (adilson@adilson.net.br) ---
> I have built 6.4-rc1 and I got the same error.
> 

Have you done v6.3..v6.4 bisection as I have requested earlier?
Comment 4 Adilson Dantas 2023-06-27 18:20:20 UTC
Created attachment 304489 [details]
attachment-26880-0.html

Em ter., 27 de jun. de 2023 às 09:53, <bugzilla-daemon@kernel.org> escreveu:

> https://bugzilla.kernel.org/show_bug.cgi?id=217596
>
> --- Comment #3 from Bagas Sanjaya (bagasdotme@gmail.com) ---
> On 6/27/23 17:45, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=217596
> >
> > --- Comment #2 from Adilson Dantas (adilson@adilson.net.br) ---
> > I have built 6.4-rc1 and I got the same error.
> >
>
> Have you done v6.3..v6.4 bisection as I have requested earlier?
>

I have done between v6.3 and v6.4-rc1. None of the bisections, apparently,
has any relation to r8169, except the first one.

# bad: [ac9a78681b921877518763ba0e89202254349d1b] Linux 6.4-rc1
# good: [457391b0380335d5e9a5babdec90ac53928b23b4] Linux 6.3
git bisect start 'ac9a78681b92' '457391b03803'


# good: [6e98b09da931a00bf4e0477d0fa52748bf28fcce] Merge tag 'net-next-6.4'
of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect good 6e98b09da931a00bf4e0477d0fa52748bf28fcce - Maybe

- RealTek (r8169): refactor to addess ASPM issues during NAPI poll

# good: [70cc1b5307e8ee3076fdf2ecbeb89eb973aa0ff7] Merge tag
'powerpc-6.4-1' of git://
git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good 70cc1b5307e8ee3076fdf2ecbeb89eb973aa0ff7
# good: [865fdb08197e657c59e74a35fa32362b12397f58] Merge tag
'input-for-v6.4-rc0' of git://
git.kernel.org/pub/scm/linux/kernel/git/dtor/input
git bisect good 865fdb08197e657c59e74a35fa32362b12397f58
# good: [78b421b6a7c6dbb6a213877c742af52330f5026d] Merge tag
'linux-watchdog-6.4-rc1' of git://www.linux-watchdog.org/linux-watchdog
git bisect good 78b421b6a7c6dbb6a213877c742af52330f5026d
# good: [0e20f4311254193fbf9eebafb4dc5c922a885397] perf script: Print raw
ip instead of binary offset for callchain
git bisect good 0e20f4311254193fbf9eebafb4dc5c922a885397
# good: [ed23734c23d2fc1e6a1ff80f8c2b82faeed0ed0c] Merge tag 'net-6.4-rc1'
of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good ed23734c23d2fc1e6a1ff80f8c2b82faeed0ed0c
# good: [994e2419f1e77724479f0ffd5ad4eeae060dec95] nfs: fix mis-merged
__filemap_get_folio() error check
git bisect good 994e2419f1e77724479f0ffd5ad4eeae060dec95
# good: [1c1094e47ef10be267a982fb1c69dbb80aa4f257] Merge tag 'mailbox-v6.4'
of git://git.linaro.org/landing-teams/working/fujitsu/integration
git bisect good 1c1094e47ef10be267a982fb1c69dbb80aa4f257
# good: [6f69c981811c8b019d7882839e31c34ea8330860] Merge tag 'v6.4-p2' of
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good 6f69c981811c8b019d7882839e31c34ea8330860
# good: [ecc68ee216c6c5b2f84915e1441adf436f1b019b] perf stat: Separate
bperf from bpf_profiler
git bisect good ecc68ee216c6c5b2f84915e1441adf436f1b019b
# good: [9a2d5178b9d51e1c5f9e08989ff97fc8d4893f31] Revert "perf build: Make
BUILD_BPF_SKEL default, rename to NO_BPF_SKEL"
git bisect good 9a2d5178b9d51e1c5f9e08989ff97fc8d4893f31
# good: [17784de648be93b4eef0ef8fe28a16ff04feecc7] Merge tag
'core-debugobjects-2023-05-06' of git://
git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 17784de648be93b4eef0ef8fe28a16ff04feecc7
# good: [f085df1be60abf670315c11036261cfaec16b2eb] Merge tag
'perf-tools-for-v6.4-3-2023-05-06' of git://
git.kernel.org/pub/scm/linux/kernel/git/acme/linux
git bisect good f085df1be60abf670315c11036261cfaec16b2eb
# first bad commit: [ac9a78681b921877518763ba0e89202254349d1b] Linux 6.4-rc1


I don't know how to get the right code from git since the first commit
doesn't show r8169_main.c. So I copied this file from v6.3.9 into v6.4 and
it worked.
Maybe some regression from ASPM or other code related from the first
bitsection.


>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
Comment 5 Qasim 2023-06-28 14:32:37 UTC
You can try CONFIG_PSTORE_RAM option in kernel configuration. 
It helps you in collecting last boot logs after a soft reboot. You can retrieve logs from /sys/fs/pstore/
Comment 6 Adilson Dantas 2023-07-02 14:15:16 UTC
This bug is still found at 6.4.1 as you can view here:

https://imgur.com/a/xbpSZGp


The only workaround, for now, is copy r8169_main.c from v6.3.9 into v6.4.1 before building.
Comment 7 Jay Mann 2023-07-04 15:03:08 UTC
Hi, after upgrading kernel to 6.4.1 I'm having problem with my NIC that i pass through (Macvtap) to virtual machine.

Reverting to kernel 6.3.9 resolves the issue.


[  540.441287] ------------[ cut here ]------------¬                                                                             
[  540.441293] NETDEV WATCHDOG: netwan0 (r8169): transmit queue 0 timed out 5430 ms¬
[  540.441314] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240¬
[  540.441325] Modules linked in: vhost_net tun vhost vhost_iotlb macvtap macvlan tap exfat wireguard curve25519_x86_64 libchacha
[  540.441390]  hid_holtek_mouse snd_soc_core usbnet snd_compress ac97_bus kvm snd_hda_codec_realtek snd_hda_codec_generic snd_pc
[  540.441472] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.4.1-arch1-1 #1 cf145a0250459022493747c0d1c289a70a2c7109¬
[  540.441476] Hardware name: Dell Inc. Wyse 5070 Thin Client/0K6VXP, BIOS 1.14.0 11/11/2021¬
[  540.441477] RIP: 0010:dev_watchdog+0x232/0x240¬
[  540.441482] Code: ff ff ff 48 89 df c6 05 04 b6 45 01 01 e8 c6 28 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 60 59 6c
[  540.441484] RSP: 0018:ffffa9eb00194e78 EFLAGS: 00010286¬
[  540.441487] RAX: 0000000000000000 RBX: ffff9a9c4abdc000 RCX: 0000000000000027¬
[  540.441489] RDX: ffff9a9fafda16c8 RSI: 0000000000000001 RDI: ffff9a9fafda16c0¬
[  540.441490] RBP: ffff9a9c4abdc4c8 R08: 0000000000000000 R09: ffffa9eb00194d08¬
[  540.441492] R10: 0000000000000003 R11: ffffffff95eca868 R12: ffff9a9c40eb2400¬
[  540.441493] R13: ffff9a9c4abdc41c R14: 0000000000000000 R15: 0000000000001536¬
[  540.441495] FS:  0000000000000000(0000) GS:ffff9a9fafd80000(0000) knlGS:0000000000000000¬
[  540.441497] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033¬
[  540.441498] CR2: 00007fc72432f5d0 CR3: 00000001cfe20000 CR4: 0000000000352ee0¬
[  540.441500] Call Trace:¬
[  540.441504]  <IRQ>¬
[  540.441505]  ? dev_watchdog+0x232/0x240¬
[  540.441509]  ? __warn+0x81/0x130¬
[  540.441516]  ? dev_watchdog+0x232/0x240¬
[  540.441520]  ? report_bug+0x171/0x1a0¬
[  540.441525]  ? prb_read_valid+0x1b/0x30¬
[  540.441530]  ? handle_bug+0x3c/0x80¬
[  540.441533]  ? exc_invalid_op+0x17/0x70¬
[  540.441536]  ? asm_exc_invalid_op+0x1a/0x20¬
[  540.441542]  ? dev_watchdog+0x232/0x240¬
[  540.441546]  ? dev_watchdog+0x232/0x240¬
[  540.441549]  ? __pfx_dev_watchdog+0x10/0x10¬
[  540.441552]  call_timer_fn+0x24/0x130¬
[  540.441557]  ? __pfx_dev_watchdog+0x10/0x10¬
[  540.441560]  __run_timers+0x222/0x2c0¬
[  540.441564]  run_timer_softirq+0x1d/0x40¬
[  540.441567]  __do_softirq+0xd1/0x2c8¬
[  540.441573]  __irq_exit_rcu+0xbb/0xf0¬
[  540.441577]  sysvec_apic_timer_interrupt+0x72/0x90¬
[  540.441582]  </IRQ>¬
[  540.441583]  <TASK>¬
[  540.441584]  asm_sysvec_apic_timer_interrupt+0x1a/0x20¬
[  540.441588] RIP: 0010:cpuidle_enter_state+0xcc/0x440¬
[  540.441591] Code: 5a 22 3c ff e8 c5 f3 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 93 24 3b ff 45 84 ff 0f 85 56 02 00 00
[  540.441593] RSP: 0018:ffffa9eb000efe90 EFLAGS: 00000246¬
[  540.441595] RAX: ffff9a9fafdb3f40 RBX: ffff9a9fafdbf440 RCX: 0000000000000000¬
[  540.441596] RDX: 0000000000000003 RSI: fffffffb347b8900 RDI: 0000000000000000¬
[  540.441598] RBP: 0000000000000004 R08: 0000000000000002 R09: 0000000055785785¬
[  540.441599] R10: ffff9a9fafdb2944 R11: 00000000000000f6 R12: ffffffff95f45960¬
[  540.441600] R13: 0000007dd4cf7b9b R14: 0000000000000004 R15: 0000000000000000¬
[  540.441605]  cpuidle_enter+0x2d/0x40¬
[  540.441609]  do_idle+0x1d8/0x230¬
[  540.441613]  cpu_startup_entry+0x1d/0x20¬
[  540.441615]  start_secondary+0x12b/0x150¬
[  540.441620]  secondary_startup_64_no_verify+0x10b/0x10b¬
[  540.441627]  </TASK>¬
[  540.441628] ---[ end trace 0000000000000000 ]---¬
[  541.694899] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬
[  542.734928] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬
[  544.921384] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬
[  549.187696] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬
[  557.507543] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting¬
[  574.787182] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting¬
[  608.919873] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up¬
[  608.920766] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC¬
[  610.172703] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬
[  611.216035] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬
[  613.399528] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬
[  617.666092] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬
[  625.986007] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting¬
[  643.051701] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting¬
[  677.187279] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up¬
[  677.188045] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC
Comment 8 Adilson Dantas 2023-07-04 15:09:28 UTC
(In reply to Jay Mann from comment #7)
> Hi, after upgrading kernel to 6.4.1 I'm having problem with my NIC that i
> pass through (Macvtap) to virtual machine.
> 
> Reverting to kernel 6.3.9 resolves the issue.
> 

Can you copy r8169_main.c from kernel 6.3.9 into 6.4.1 to see if you get the same issue?


> 
> [  540.441287] ------------[ cut here ]------------¬                        
> 
> [  540.441293] NETDEV WATCHDOG: netwan0 (r8169): transmit queue 0 timed out
> 5430 ms¬
> [  540.441314] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525
> dev_watchdog+0x232/0x240¬
> [  540.441325] Modules linked in: vhost_net tun vhost vhost_iotlb macvtap
> macvlan tap exfat wireguard curve25519_x86_64 libchacha
> [  540.441390]  hid_holtek_mouse snd_soc_core usbnet snd_compress ac97_bus
> kvm snd_hda_codec_realtek snd_hda_codec_generic snd_pc
> [  540.441472] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.4.1-arch1-1 #1
> cf145a0250459022493747c0d1c289a70a2c7109¬
> [  540.441476] Hardware name: Dell Inc. Wyse 5070 Thin Client/0K6VXP, BIOS
> 1.14.0 11/11/2021¬
> [  540.441477] RIP: 0010:dev_watchdog+0x232/0x240¬
> [  540.441482] Code: ff ff ff 48 89 df c6 05 04 b6 45 01 01 e8 c6 28 fa ff
> 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 60 59 6c
> [  540.441484] RSP: 0018:ffffa9eb00194e78 EFLAGS: 00010286¬
> [  540.441487] RAX: 0000000000000000 RBX: ffff9a9c4abdc000 RCX:
> 0000000000000027¬
> [  540.441489] RDX: ffff9a9fafda16c8 RSI: 0000000000000001 RDI:
> ffff9a9fafda16c0¬
> [  540.441490] RBP: ffff9a9c4abdc4c8 R08: 0000000000000000 R09:
> ffffa9eb00194d08¬
> [  540.441492] R10: 0000000000000003 R11: ffffffff95eca868 R12:
> ffff9a9c40eb2400¬
> [  540.441493] R13: ffff9a9c4abdc41c R14: 0000000000000000 R15:
> 0000000000001536¬
> [  540.441495] FS:  0000000000000000(0000) GS:ffff9a9fafd80000(0000)
> knlGS:0000000000000000¬
> [  540.441497] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033¬
> [  540.441498] CR2: 00007fc72432f5d0 CR3: 00000001cfe20000 CR4:
> 0000000000352ee0¬
> [  540.441500] Call Trace:¬
> [  540.441504]  <IRQ>¬
> [  540.441505]  ? dev_watchdog+0x232/0x240¬
> [  540.441509]  ? __warn+0x81/0x130¬
> [  540.441516]  ? dev_watchdog+0x232/0x240¬
> [  540.441520]  ? report_bug+0x171/0x1a0¬
> [  540.441525]  ? prb_read_valid+0x1b/0x30¬
> [  540.441530]  ? handle_bug+0x3c/0x80¬
> [  540.441533]  ? exc_invalid_op+0x17/0x70¬
> [  540.441536]  ? asm_exc_invalid_op+0x1a/0x20¬
> [  540.441542]  ? dev_watchdog+0x232/0x240¬
> [  540.441546]  ? dev_watchdog+0x232/0x240¬
> [  540.441549]  ? __pfx_dev_watchdog+0x10/0x10¬
> [  540.441552]  call_timer_fn+0x24/0x130¬
> [  540.441557]  ? __pfx_dev_watchdog+0x10/0x10¬
> [  540.441560]  __run_timers+0x222/0x2c0¬
> [  540.441564]  run_timer_softirq+0x1d/0x40¬
> [  540.441567]  __do_softirq+0xd1/0x2c8¬
> [  540.441573]  __irq_exit_rcu+0xbb/0xf0¬
> [  540.441577]  sysvec_apic_timer_interrupt+0x72/0x90¬
> [  540.441582]  </IRQ>¬
> [  540.441583]  <TASK>¬
> [  540.441584]  asm_sysvec_apic_timer_interrupt+0x1a/0x20¬
> [  540.441588] RIP: 0010:cpuidle_enter_state+0xcc/0x440¬
> [  540.441591] Code: 5a 22 3c ff e8 c5 f3 ff ff 8b 53 04 49 89 c5 0f 1f 44
> 00 00 31 ff e8 93 24 3b ff 45 84 ff 0f 85 56 02 00 00
> [  540.441593] RSP: 0018:ffffa9eb000efe90 EFLAGS: 00000246¬
> [  540.441595] RAX: ffff9a9fafdb3f40 RBX: ffff9a9fafdbf440 RCX:
> 0000000000000000¬
> [  540.441596] RDX: 0000000000000003 RSI: fffffffb347b8900 RDI:
> 0000000000000000¬
> [  540.441598] RBP: 0000000000000004 R08: 0000000000000002 R09:
> 0000000055785785¬
> [  540.441599] R10: ffff9a9fafdb2944 R11: 00000000000000f6 R12:
> ffffffff95f45960¬
> [  540.441600] R13: 0000007dd4cf7b9b R14: 0000000000000004 R15:
> 0000000000000000¬
> [  540.441605]  cpuidle_enter+0x2d/0x40¬
> [  540.441609]  do_idle+0x1d8/0x230¬
> [  540.441613]  cpu_startup_entry+0x1d/0x20¬
> [  540.441615]  start_secondary+0x12b/0x150¬
> [  540.441620]  secondary_startup_64_no_verify+0x10b/0x10b¬
> [  540.441627]  </TASK>¬
> [  540.441628] ---[ end trace 0000000000000000 ]---¬
> [  541.694899] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬
> [  542.734928] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬
> [  544.921384] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬
> [  549.187696] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬
> [  557.507543] r8169 0000:02:00.0: not ready 16383ms after bus reset;
> waiting¬
> [  574.787182] r8169 0000:02:00.0: not ready 32767ms after bus reset;
> waiting¬
> [  608.919873] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving
> up¬
> [  608.920766] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus,
> detach NIC¬
> [  610.172703] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting¬
> [  611.216035] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting¬
> [  613.399528] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting¬
> [  617.666092] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting¬
> [  625.986007] r8169 0000:02:00.0: not ready 16383ms after bus reset;
> waiting¬
> [  643.051701] r8169 0000:02:00.0: not ready 32767ms after bus reset;
> waiting¬
> [  677.187279] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving
> up¬
> [  677.188045] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus,
> detach NIC
Comment 9 Jay Mann 2023-07-04 15:29:13 UTC
Sorry the computer would take very long time to build the kernel.  Can you just try to revert this single commit and see if it works correctly?  If not i may be able to build later on a different computer.

commit d6c36cbc5e533f48bd89a7b5f339bd82b8b4378a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon May 22 15:41:21 2023 +0200

    r8169: Use a raw_spinlock_t for the register locks.
    
    The driver's interrupt service routine is requested with the
    IRQF_NO_THREAD if MSI is available. This means that the routine is
    invoked in hardirq context even on PREEMPT_RT. The routine itself is
    relatively short and schedules a worker, performs register access and
    schedules NAPI. On PREEMPT_RT, scheduling NAPI from hardirq results in
    waking ksoftirqd for further processing so using NAPI threads with this
    driver is highly recommended since it NULL routes the threaded-IRQ
    efforts.
    
    Adding rtl_hw_aspm_clkreq_enable() to the ISR is problematic on
    PREEMPT_RT because the function uses spinlock_t locks which become
    sleeping locks on PREEMPT_RT. The locks are only used to protect
    register access and don't nest into other functions or locks. They are
    also not used for unbounded period of time. Therefore it looks okay to
    convert them to raw_spinlock_t.
    
    Convert the three locks which are used from the interrupt service
    routine to raw_spinlock_t.
    
    Fixes: e1ed3e4d9111 ("r8169: disable ASPM during NAPI poll")
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
    Link: https://lore.kernel.org/r/20230522134121.uxjax0F5@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

 drivers/net/ethernet/realtek/r8169_main.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)
Comment 10 Adilson Dantas 2023-07-05 01:59:31 UTC
Created attachment 304546 [details]
attachment-2814-0.html

It didn't work.  I checked the differences between this revert and 6.3.9
and, maybe, is something related to this commit:
2ab19de62d67e403105ba860971e5ff0d511ad15 r8169: remove ASPM restrictions
now that ASPM is disabled during NAPI poll

Part of the removed code from this commit that was not found on the revert
is

       /* Disable ASPM L1 as that cause random device stop working
        * problems as well as full system hangs for some PCIe devices users.
        * Chips from RTL8168h partially have issues with L1.2, but seem
        * to work fine with L1 and L1.1.
        */
       if (rtl_aspm_is_safe(tp))
               rc = 0;
       else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
       else
               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
       tp->aspm_manageable = !rc;

I would like to build reverting this and other 5 commits from the same
module right before this. But I don't have too much time to do it this week.



Em ter., 4 de jul. de 2023 às 12:29, <bugzilla-daemon@kernel.org> escreveu:

> https://bugzilla.kernel.org/show_bug.cgi?id=217596
>
> --- Comment #9 from Jay Mann (jmandawg@hotmail.com) ---
> Sorry the computer would take very long time to build the kernel.  Can you
> just
> try to revert this single commit and see if it works correctly?  If not i
> may
> be able to build later on a different computer.
>
> commit d6c36cbc5e533f48bd89a7b5f339bd82b8b4378a
> Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date:   Mon May 22 15:41:21 2023 +0200
>
>     r8169: Use a raw_spinlock_t for the register locks.
>
>     The driver's interrupt service routine is requested with the
>     IRQF_NO_THREAD if MSI is available. This means that the routine is
>     invoked in hardirq context even on PREEMPT_RT. The routine itself is
>     relatively short and schedules a worker, performs register access and
>     schedules NAPI. On PREEMPT_RT, scheduling NAPI from hardirq results in
>     waking ksoftirqd for further processing so using NAPI threads with this
>     driver is highly recommended since it NULL routes the threaded-IRQ
>     efforts.
>
>     Adding rtl_hw_aspm_clkreq_enable() to the ISR is problematic on
>     PREEMPT_RT because the function uses spinlock_t locks which become
>     sleeping locks on PREEMPT_RT. The locks are only used to protect
>     register access and don't nest into other functions or locks. They are
>     also not used for unbounded period of time. Therefore it looks okay to
>     convert them to raw_spinlock_t.
>
>     Convert the three locks which are used from the interrupt service
>     routine to raw_spinlock_t.
>
>     Fixes: e1ed3e4d9111 ("r8169: disable ASPM during NAPI poll")
>     Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>     Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
>     Link: https://lore.kernel.org/r/20230522134121.uxjax0F5@linutronix.de
>     Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>
>  drivers/net/ethernet/realtek/r8169_main.c | 44
> ++++++++++++++++++++++----------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
Comment 11 Jay Mann 2023-07-06 01:56:36 UTC
I built the kernel 6.4.0-1 with the following patch which reverts the "Disable ASPM L1" code removal as suggested by Adilson Dantas, and it's been running without an issue for an hour without problems.  Will update if it crashes:

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c                         
index 4b19803a7dd0..aeeb6cd312d7 100644                                                                                    
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -623,6 +623,7 @@ struct rtl8169_private {
        int cfg9346_usage_count;
 
        unsigned supports_gmii:1;
+       unsigned aspm_manageable:1;
        dma_addr_t counters_phys_addr;
        struct rtl8169_counters *counters;
        struct rtl8169_tc_offsets tc_offset;
@@ -5158,6 +5159,16 @@ static void rtl_init_mac_address(struct rtl8169_private *tp)
        rtl_rar_set(tp, mac_addr);
 }
 
+/* register is set if system vendor successfully tested ASPM 1.2 */
+static bool rtl_aspm_is_safe(struct rtl8169_private *tp)
+{
+       if (tp->mac_version >= RTL_GIGA_MAC_VER_61 &&
+           r8168_mac_ocp_read(tp, 0xc0b2) & 0xf)
+               return true;
+
+       return false;
+}
+
 static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
        struct rtl8169_private *tp;
@@ -5229,6 +5240,19 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
        tp->mac_version = chipset;
 
+       /* Disable ASPM L1 as that cause random device stop working
+        * problems as well as full system hangs for some PCIe devices users.
+        * Chips from RTL8168h partially have issues with L1.2, but seem
+        * to work fine with L1 and L1.1.
+        */
+       if (rtl_aspm_is_safe(tp))
+               rc = 0;
+       else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
+               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
+       else
+               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
+       tp->aspm_manageable = !rc;
+
        tp->dash_type = rtl_check_dash(tp);
 
        tp->cp_cmd = RTL_R16(tp, CPlusCmd) & CPCMD_MASK;
Comment 12 Jay Mann 2023-07-06 09:51:16 UTC
It still ended up crashing, i will try to revert all the commits since 6.3.9 and see if that fixes the issue.

[17326.067824] ------------[ cut here ]------------            
[17326.067838] NETDEV WATCHDOG: netwan0 (r8169): transmit queue 0 timed out 6557 ms
[17326.067896] WARNING: CPU: 2 PID: 2330 at net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240
[17326.067920] Modules linked in: exfat wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel vhost_net tun vhost vhost_iotlb macvtap macvlan tap veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter bridge stp llc overlay snd_hda_codec_hdmi snd_sof_pci_intel_apl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_bus snd_soc_avs snd_soc_hda_codec snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core intel_rapl_msr intel_rapl_common snd_compress intel_pmc_bxt intel_telemetry_pltdrv mousedev intel_punit_ipc snd_hda_codec_realtek ac97_bus intel_telemetry_core snd_hda_codec_generic
[17326.068102]  snd_pcm_dmaengine x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi snd_hda_codec r8153_ecm joydev hid_holtek_mouse snd_hda_core cdc_ether snd_hwdep kvm usbnet mei_pxp mei_hdcp irqbypass snd_pcm crct10dif_pclmul ee1004 crc32_pclmul polyval_generic gf128mul nls_iso8859_1 snd_timer ghash_clmulni_intel dell_wmi sha512_ssse3 vfat aesni_intel serio crypto_simd usbhid r8169 ucsi_acpi snd rfkill fat typec_ucsi mei_me realtek dell_smbios cryptd ledtrig_audio i2c_i801 intel_lpss_pci rapl dcdbas r8152 intel_lpss mdio_devres wmi_bmof dell_wmi_descriptor mii sparse_keymap intel_cstate typec pcspkr i2c_smbus idma64 libphy mei roles soundcore mac_hid dm_multipath fuse loop dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 uas usb_storage mmc_block i915 i2c_algo_bit drm_buddy intel_gtt drm_display_helper sdhci_pci cqhci sdhci crc32c_intel xhci_pci cec mmc_core xhci_pci_renesas ttm video wmi                                         
[17326.068353] CPU: 2 PID: 2330 Comm: chromium Not tainted 6.4.0-1-mainline #3 fe50d2b946c00ecafa54655e238b01581437b546
[17326.068365] Hardware name: Dell Inc. Wyse 5070 Thin Client/0K6VXP, BIOS 1.14.0 11/11/2021         
[17326.068371] RIP: 0010:dev_watchdog+0x232/0x240                                                    
[17326.068384] Code: ff ff ff 48 89 df c6 05 13 f9 45 01 01 e8 c6 28 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 d0 46 cc ae e8 fe b1 54 ff <0f> 0b e9 2d ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90                     
[17326.068391] RSP: 0000:ffffc2f4c3273dc0 EFLAGS: 00010282                                                      
[17326.068399] RAX: 0000000000000000 RBX: ffff9f1743cb4000 RCX: 0000000000000027                                
[17326.068405] RDX: ffff9f1aafd21688 RSI: 0000000000000001 RDI: ffff9f1aafd21680                                
[17326.068411] RBP: ffff9f1743cb44c8 R08: 0000000000000000 R09: ffffc2f4c3273c50                                
[17326.068416] R10: 0000000000000003 R11: ffffffffaf4ca828 R12: ffff9f17418bb200                                
[17326.068421] R13: ffff9f1743cb441c R14: 0000000000000000 R15: 000000000000199d                                      
[17326.068426] FS:  00007ff94cd38180(0000) GS:ffff9f1aafd00000(0000) knlGS:0000000000000000                           
[17326.068434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                           
[17326.068439] CR2: 000025030064c000 CR3: 000000015b8ac000 CR4: 0000000000352ee0                                           
[17326.068446] Call Trace:                                                                                                 
[17326.068454]  <TASK>                                                                                                         
[17326.068458]  ? dev_watchdog+0x232/0x240                                                                                        
[17326.068469]  ? __warn+0x81/0x130                                                                                                  
[17326.068485]  ? dev_watchdog+0x232/0x240
[17326.068495]  ? report_bug+0x171/0x1a0
[17326.068506]  ? prb_read_valid+0x1b/0x30
[17326.068520]  ? handle_bug+0x3c/0x80
[17326.068533]  ? exc_invalid_op+0x17/0x70
[17326.068540]  ? asm_exc_invalid_op+0x1a/0x20
[17326.068556]  ? dev_watchdog+0x232/0x240
[17326.068567]  ? __pfx_dev_watchdog+0x10/0x10
[17326.068577]  call_timer_fn+0x24/0x130
[17326.068590]  ? __pfx_dev_watchdog+0x10/0x10
[17326.068598]  __run_timers+0x222/0x2c0
[17326.068613]  run_timer_softirq+0x1d/0x40
[17326.068622]  __do_softirq+0xd1/0x2c8
[17326.068637]  __irq_exit_rcu+0xbb/0xf0
[17326.068649]  sysvec_apic_timer_interrupt+0x3e/0x90
[17326.068661]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[17326.068672] RIP: 0033:0x5611f5966cee
[17326.068744] Code: 4c 8b 65 80 0f 1f 84 00 00 00 00 00 80 7d 90 00 74 63 48 8b 45 80 48 63 04 18 48 8b 0d 1b a2 a0 03 48 01 c0 48 21 c8 48 63 00 <48> 01 c0 48 21 c8 48 63 78 24 48 01 ff 48 21 cf 48 8b 0f 48 89 c8
[17326.068751] RSP: 002b:00007ffe0591f110 EFLAGS: 00000202
[17326.068757] RAX: ffffffff805dec58 RBX: 0000000000000900 RCX: 0000190bffffffff
[17326.068762] RDX: 0000190b0020cd88 RSI: 0000190b001fbd98 RDI: 0000190b008ccb10
[17326.068767] RBP: 00007ffe0591f1b0 R08: 00005611f90c1578 R09: 00005611ecb712d8
[17326.068772] R10: 00007ffe059b4080 R11: 00000000009bb64a R12: 0000190b0476f658
[17326.068776] R13: 0000000000000000 R14: 000000000000092c R15: 0000000000000000
[17326.068787]  </TASK>
[17326.068791] ---[ end trace 0000000000000000 ]---
[17327.321173] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting
[17328.361180] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting
[17330.548391] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting
[17334.818719] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting
[17343.134979] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting
[17360.415001] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting
[17394.548241] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up
[17394.549408] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC
[17395.801610] r8169 0000:02:00.0: not ready 1023ms after bus reset; waiting
[17396.844852] r8169 0000:02:00.0: not ready 2047ms after bus reset; waiting
[17399.028494] r8169 0000:02:00.0: not ready 4095ms after bus reset; waiting
[17403.298498] r8169 0000:02:00.0: not ready 8191ms after bus reset; waiting
[17411.615204] r8169 0000:02:00.0: not ready 16383ms after bus reset; waiting
[17428.681669] r8169 0000:02:00.0: not ready 32767ms after bus reset; waiting
[17462.815268] r8169 0000:02:00.0: not ready 65535ms after bus reset; giving up
[17462.816405] r8169 0000:02:00.0 netwan0: Can't reset secondary PCI bus, detach NIC
Comment 13 Jay Mann 2023-07-08 18:44:59 UTC
I did some more testing (i missed a code block in my previous revert) and it definitely looks like the issue is caused by:

commit 2ab19de62d67e403105ba860971e5ff0d511ad15
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Mon Mar 6 22:28:06 2023 +0100

    r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll


My system has 3 realtek NICs 2 of them are passed through to a VM via MACVTAP.
Comment 14 Jay Mann 2023-07-08 20:33:34 UTC
Looks like the problem is you removed the "aspm_manageable" flag.
Previously my nic would never go into this code block since my nic is not ASPM manageable, but now it does since the flag was removed:

ORIG CODE:

/* Don't enable ASPM in the chip if OS can't control ASPM */                           
if (enable && tp->aspm_manageable) {                                                   
  rtl_mod_config5(tp, 0, ASPM_en);                                                                                                      
  rtl_mod_config2(tp, 0, ClkReqEn);    

NEW CODE:
if (enable) {                                                  
  rtl_mod_config5(tp, 0, ASPM_en);                                                                                           
  rtl_mod_config2(tp, 0, ClkReqEn);
Comment 15 Adilson Dantas 2023-07-08 23:51:20 UTC
(In reply to Jay Mann from comment #13)
> I did some more testing (i missed a code block in my previous revert) and it
> definitely looks like the issue is caused by:
> 
> commit 2ab19de62d67e403105ba860971e5ff0d511ad15
> Author: Heiner Kallweit <hkallweit1@gmail.com>
> Date:   Mon Mar 6 22:28:06 2023 +0100
> 
>     r8169: remove ASPM restrictions now that ASPM is disabled during NAPI
> poll

I rebuilt 6.4.2 with this revert and it worked.
> 
> 
> My system has 3 realtek NICs 2 of them are passed through to a VM via
> MACVTAP.
Comment 16 Heiner Kallweit 2023-07-10 11:43:37 UTC
Could you please provide a full dmesg log of the affected system and the lspci -vv output?
I'd like to avoid increasing power consumption for 99.9% of the users just because of very few systems with broken ASPM.
Comment 17 Jay Mann 2023-07-10 18:10:28 UTC
Created attachment 304598 [details]
dmesg

dmesg from affected system
Comment 18 Jay Mann 2023-07-10 18:11:08 UTC
Created attachment 304599 [details]
lspci -vv

lspci -vv on affected system.
Comment 19 Jay Mann 2023-07-10 18:13:04 UTC
I thought the removed flag "aspm_manageable" was supposed to flag the systems with broken ASPM.

Files attached.
Comment 20 y_satou 2023-07-12 01:33:35 UTC
I hit the same issue on kernel 6.4.x with Dell OptiPlex3000 ThinClient, eth0 NIC stop working with error: "NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out 6387 ms"

dmesg:

[    0.000000] DMI: Dell Inc. OptiPlex 3000 Thin Client/07Y42Y, BIOS 1.9.1 05/12/2023
[    0.632236] r8169 0000:01:00.0 eth0: RTL8168h/8111h, 00:be:43:f1:a1:62, XID 541, IRQ 126
[    0.632262] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[    1.716836] Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver (mii_bus:phy_addr=r8169-0-100:00, irq=MAC)

lspci:

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
	Subsystem: Dell Device 0ae8
	Flags: bus master, fast devsel, latency 0, IRQ 18
	I/O ports at 3000 [size=256]
	Memory at 7f504000 (64-bit, non-prefetchable) [size=4K]
	Memory at 7f500000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [70] Express Endpoint, MSI 01
	Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Virtual Channel
	Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
	Capabilities: [170] Latency Tolerance Reporting
	Capabilities: [178] L1 PM Substates
	Kernel driver in use: r8169

so I tested 6.4.3 with reverting commits for ASPM related to drivers/net/ethernet/realtek/r8169_main.c applied on 2023-03-08 and 2023-03-20 (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/ethernet/realtek/r8169_main.c?h=linux-6.4.y), and confirmed "NETDEV WATCHDOG" is gone, the OS is up and looks stable.
Comment 21 Heiner Kallweit 2023-07-13 05:44:19 UTC
(In reply to y_satou from comment #20)
> I hit the same issue on kernel 6.4.x with Dell OptiPlex3000 ThinClient, eth0
> NIC stop working with error: "NETDEV WATCHDOG: eth0 (r8169): transmit queue
> 0 timed out 6387 ms"
> 
> dmesg:
> 
> [    0.000000] DMI: Dell Inc. OptiPlex 3000 Thin Client/07Y42Y, BIOS 1.9.1
> 05/12/2023
> [    0.632236] r8169 0000:01:00.0 eth0: RTL8168h/8111h, 00:be:43:f1:a1:62,
> XID 541, IRQ 126
> [    0.632262] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes,
> tx checksumming: ko]
> [    1.716836] Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver
> (mii_bus:phy_addr=r8169-0-100:00, irq=MAC)
> 
> lspci:
> 
> 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
>       Subsystem: Dell Device 0ae8
>       Flags: bus master, fast devsel, latency 0, IRQ 18
>       I/O ports at 3000 [size=256]
>       Memory at 7f504000 (64-bit, non-prefetchable) [size=4K]
>       Memory at 7f500000 (64-bit, non-prefetchable) [size=16K]
>       Capabilities: [40] Power Management version 3
>       Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>       Capabilities: [70] Express Endpoint, MSI 01
>       Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>       Capabilities: [100] Advanced Error Reporting
>       Capabilities: [140] Virtual Channel
>       Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
>       Capabilities: [170] Latency Tolerance Reporting
>       Capabilities: [178] L1 PM Substates
>       Kernel driver in use: r8169
> 
> so I tested 6.4.3 with reverting commits for ASPM related to
> drivers/net/ethernet/realtek/r8169_main.c applied on 2023-03-08 and
> 2023-03-20
> (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/
> drivers/net/ethernet/realtek/r8169_main.c?h=linux-6.4.y), and confirmed
> "NETDEV WATCHDOG" is gone, the OS is up and looks stable.

Please also provide a full dmesg log and the lcpci -vv output.
(-vv to get ASPM L1 substate information)
Comment 22 y_satou 2023-07-13 08:19:21 UTC
Created attachment 304623 [details]
dmesg output from Dell OptiPlex3000 ThinClient

attached dmesg output from Dell OptiPlex3000 ThinClient as requested.
Comment 23 y_satou 2023-07-13 08:19:50 UTC
Created attachment 304624 [details]
lspci -vv output from Dell OptiPlex3000 ThinClient

attached lspci -vv output from Dell OptiPlex3000 ThinClient as requested.
Comment 24 Heiner Kallweit 2023-07-13 10:31:40 UTC
(In reply to y_satou from comment #23)
> Created attachment 304624 [details]
> lspci -vv output from Dell OptiPlex3000 ThinClient
> 
> attached lspci -vv output from Dell OptiPlex3000 ThinClient as requested.

Please repeat with root permissions, otherwise a lot of details is missing.
Comment 25 Demitrius Belai 2023-07-13 14:08:21 UTC
Created attachment 304627 [details]
dmesg from OptiPlex 3080
Comment 26 Demitrius Belai 2023-07-13 14:08:50 UTC
Created attachment 304628 [details]
lspci from OptiPlex 3080
Comment 27 Demitrius Belai 2023-07-13 14:25:15 UTC
I was facing the same issue when I was using Google Meet. Reverting 2ab19de62d67e403105ba860971e5ff0d511ad15 fixed the issue.
Comment 28 Adilson Dantas 2023-07-13 15:03:14 UTC
Created attachment 304629 [details]
dmesg.txt

Here is dmesg and lspci -vv from my machine with an unpatched 6.4.0 kernel.

Em qui., 13 de jul. de 2023 às 11:25, <bugzilla-daemon@kernel.org> escreveu:

> https://bugzilla.kernel.org/show_bug.cgi?id=217596
>
> --- Comment #27 from Demitrius Belai (demitriusbelai@gmail.com) ---
> I was facing the same issue when I was using Google Meet. Reverting
> 2ab19de62d67e403105ba860971e5ff0d511ad15 fixed the issue.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
Comment 29 Adilson Dantas 2023-07-13 15:03:14 UTC
Created attachment 304630 [details]
lspci.txt
Comment 30 y_satou 2023-07-13 22:33:30 UTC
Created attachment 304631 [details]
lspci -vv (by root) output from Dell OptiPlex3000 ThinClient

attached Dell Optiplex3000 ThinClient lspci -vv by root by request.
Comment 31 Heiner Kallweit 2023-07-14 05:47:13 UTC
Apparently on all affected systems the BIOS claims exclusive access to ASPM settings.
Could you please boot with cmd line parameter pcie_aspm=force (to overrule this BIOS setting) and see whether this makes any difference?
Comment 32 y_satou 2023-07-14 10:04:21 UTC
> Could you please boot with cmd line parameter pcie_aspm=force (to overrule
> this BIOS setting) and see whether this makes any difference?

I rebuilt 6.4.3 kernel without change (not rollback ASPM related commits in drivers/net/ethernet/realtek/r8169_main.c), and boot it with "pcie_aspm=force" option.

It looks enable the host survive without "NETDEV WATCHDOG - transmit queue timeout" error, but it looks the performance (throughput) got significant degradation.

Attached dmesg (you can confirm boot option pcie_aspm=force) and lspci -vv output.
Comment 33 y_satou 2023-07-14 10:05:24 UTC
Created attachment 304632 [details]
dmesg output from Dell OptiPlex3000 ThinClient with pcie_aspm=force
Comment 34 y_satou 2023-07-14 10:05:48 UTC
Created attachment 304633 [details]
lspci -vv (by root) output from Dell OptiPlex3000 ThinClient with pcie_aspm=force
Comment 35 Heiner Kallweit 2023-07-14 10:26:46 UTC
(In reply to y_satou from comment #32)
> > Could you please boot with cmd line parameter pcie_aspm=force (to overrule
> > this BIOS setting) and see whether this makes any difference?
> 
> I rebuilt 6.4.3 kernel without change (not rollback ASPM related commits in
> drivers/net/ethernet/realtek/r8169_main.c), and boot it with
> "pcie_aspm=force" option.
> 
> It looks enable the host survive without "NETDEV WATCHDOG - transmit queue
> timeout" error, but it looks the performance (throughput) got significant
> degradation.
> 
> Attached dmesg (you can confirm boot option pcie_aspm=force) and lspci -vv
> output.

Thanks for testing. Regarding the performance degradation:
- How was it measured?
- If not iperf, could you please test with iperf?
- Please provide the ethtool -S <if> output. I'd like to know whether missed rx 
  packets are the cause.
Comment 36 Adilson Dantas 2023-07-14 11:23:30 UTC
Created attachment 304634 [details]
dmesg-01.txt

I still got the same error with unpatched 6.4.0 and pcie_aspm=force
parameter.

Here is dmesg and lspci output attached below.

Em sex., 14 de jul. de 2023 às 02:47, <bugzilla-daemon@kernel.org> escreveu:

> https://bugzilla.kernel.org/show_bug.cgi?id=217596
>
> --- Comment #31 from Heiner Kallweit (hkallweit1@gmail.com) ---
> Apparently on all affected systems the BIOS claims exclusive access to ASPM
> settings.
> Could you please boot with cmd line parameter pcie_aspm=force (to overrule
> this
> BIOS setting) and see whether this makes any difference?
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
Comment 37 Adilson Dantas 2023-07-14 11:23:31 UTC
Created attachment 304635 [details]
lspci-01.txt
Comment 38 Mauro Orlić 2023-07-14 11:49:19 UTC
Created attachment 304636 [details]
journalctl-root-optiplex3060

Troubleshooting an issue I've had recently led me here, hopefully my journalctl output helps in some way. I appear to have the same network adapter, machine is a Dell Optiplex 3060. I can provide additional logs if necessary.
Comment 39 y_satou 2023-07-14 12:20:48 UTC
(In reply to Heiner Kallweit from comment #35)
> 
> Regarding the performance degradation:
> - How was it measured?

My image is boot over network (using ramdisk), download vmlinuz and base packges by iPXE http first, then rest of the packages are downloaded via http, roughly total 400MB.

"The rest of the packages download" part showed significant difference - as you could see on dmesg output, 
* the kernel with removing ASPM related commits = 397sec.
* genuine kernel with pcie_aspm=force option = 2059sec.


> - If not iperf, could you please test with iperf?
> - Please provide the ethtool -S <if> output. I'd like to know whether missed
> rx 
 packets are the cause.

I'm away for next few days, once I return then I'm going to prepare them for test.
Comment 40 oyvinds 2023-07-18 00:30:38 UTC
> I'd like to avoid increasing power consumption for 99.9% of the users just
> because of very few systems with broken ASPM.

You would more realistically disable ASPM for the 40% of users who have systems it works on in order to make the kernel actually work as it should for the 60% of users who have systems that are broken - if you were to disable it completely. That scenario is irrelevant anyway, the code that used to be in the kernel - when it wasn't broken and useless - didn't disable ASPM across the board. It only disabled it in cases where it's needed - as the kernel should:

+       /* Disable ASPM L1 as that cause random device stop working
+        * problems as well as full system hangs for some PCIe devices users.
+        * Chips from RTL8168h partially have issues with L1.2, but seem
+        * to work fine with L1 and L1.1.
+        */
+       if (rtl_aspm_is_safe(tp))
+               rc = 0;
+       else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
+               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
+       else
+               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
+       tp->aspm_manageable = !rc;

Having to encounter this silly bug & visiting this bug tracker & writing here is a complete waste of my time. This has apparently been broken since 6.4 and we're on 6.4.3 now. The right thing to do when the kernel has worked fine for years and one change to one kernel version breaks the kernel for a large portion of users it to revert the change that caused it. Just do it already.
Comment 41 Jay Mann 2023-07-18 03:45:33 UTC
(In reply to oyvinds from comment #40)
> > I'd like to avoid increasing power consumption for 99.9% of the users just
> > because of very few systems with broken ASPM.
> 
> You would more realistically disable ASPM for the 40% of users who have
> systems it works on in order to make the kernel actually work as it should
> for the 60% of users who have systems that are broken - if you were to
> disable it completely. That scenario is irrelevant anyway, the code that
> used to be in the kernel - when it wasn't broken and useless - didn't
> disable ASPM across the board. It only disabled it in cases where it's
> needed - as the kernel should:
> 
> +       /* Disable ASPM L1 as that cause random device stop working
> +        * problems as well as full system hangs for some PCIe devices users.
> +        * Chips from RTL8168h partially have issues with L1.2, but seem
> +        * to work fine with L1 and L1.1.
> +        */
> +       if (rtl_aspm_is_safe(tp))
> +               rc = 0;
> +       else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
> +               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
> +       else
> +               rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
> +       tp->aspm_manageable = !rc;
> 
> Having to encounter this silly bug & visiting this bug tracker & writing
> here is a complete waste of my time. This has apparently been broken since
> 6.4 and we're on 6.4.3 now. The right thing to do when the kernel has worked
> fine for years and one change to one kernel version breaks the kernel for a
> large portion of users it to revert the change that caused it. Just do it
> already.

I agree with this statement 100%.  The longer you wait the more systems you break.
Comment 42 y_satou 2023-07-18 07:56:23 UTC
Created attachment 304650 [details]
iperf test result on Dell OptiPlex3000 ThinClient kernel 6.4.3

(In reply to y_satou from comment #39)
> > 
> > Regarding the performance degradation:
> > - If not iperf, could you please test with iperf?
> > - Please provide the ethtool -S <if> output. I'd like to know whether
> missed rx packets are the cause.
> 
> I'm away for next few days, once I return then I'm going to prepare them for
> test.

Attached the test result by iperf - which is kernel 6.4.3 genuine code with "pcie_aspm=force" option.

tx vs rx showed significant performance deference.

upload   (=mainly tx) shows  13389 KBytes/sec
download (=mainly rx) shows    207 KBytes/sec
Comment 43 y_satou 2023-07-19 05:57:03 UTC
Created attachment 304662 [details]
iperf test result on Dell OptiPlex3000 ThinClient kernel 6.4.3 with removing ASPM related commits

performed another iperf test with reverting ASPM related commits on kernel 6.4.3 for comparison (using same hardware, Dell OptiPlex3000 ThinClient)

genuine 6.4.3 kernel (keeps ASPM related commits) with pcie_aspm=force option:

  upload   (=mainly tx) shows  13389 KBytes/sec
  download (=mainly rx) shows    207 KBytes/sec
  rx_misses count continuously increased

6.4.3 kernel with removing ASPM related commits on r8169_main.c, without pcie_aspm=force option:

  upload   (=mainly tx) shows  13210 KBytes/sec
  download (=mainly rx) shows   7533 KBytes/sec
  rx_misses count is not increased
Comment 44 Takashi Iwai 2023-07-20 15:37:14 UTC
The same issue was reported on openSUSE Bugzilla for 6.4.2 and 6.4.3 kernels, too:
  https://bugzilla.suse.com/show_bug.cgi?id=1213491

Interestingly, pcie_aspm=force option didn't help for the reporter while reverting the commit seems working.
Comment 45 Milan Oravec 2023-07-25 11:43:13 UTC
Created attachment 304692 [details]
lspci -vv

Attached lspci -vv output
Comment 46 Milan Oravec 2023-07-25 11:44:23 UTC
I'm sorry to report the same bud for Dell G3 laptop, loosing the only one NIC makes this machine unusable. :(

[Št júl 20 18:33:12 2023] ------------[ cut here ]------------
[Št júl 20 18:33:12 2023] NETDEV WATCHDOG: enp60s0 (r8169): transmit queue 0 timed out 7734 ms
[Št júl 20 18:33:12 2023] WARNING: CPU: 1 PID: 668 at net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240
[Št júl 20 18:33:12 2023] Modules linked in: xt_LOG nf_log_syslog xt_limit xt_pkttype 8021q garp mrp rfcomm nvidia(POE) snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge stp llc cmac algif_hash algif_skcipher af_alg bnep snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_generic_allocation soundwire_bus snd_soc_skl snd_soc_hdac_hda intel_tcc_cooling snd_hda_ext_core x86_pkg_temp_thermal snd_soc_sst_ipc intel_powerclamp coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi kvm_intel hid_multitouch i915 snd_soc_core kvm snd_compress ac97_bus irqbypass iwlmvm crct10dif_pclmul crc32_pclmul snd_pcm_dmaengine polyval_clmulni snd_hda_codec_hdmi
[Št júl 20 18:33:12 2023]  polyval_generic drm_buddy gf128mul mac80211 ghash_clmulni_intel sha512_ssse3 iTCO_wdt dell_laptop aesni_intel snd_hda_intel btusb intel_pmc_bxt ee1004 iTCO_vendor_support btrtl uvc i2c_algo_bit snd_intel_dspcfg crypto_simd libarc4 mei_hdcp mei_pxp snd_intel_sdw_acpi cryptd dell_wmi iwlwifi btbcm r8169 snd_hda_codec ttm i2c_i801 videobuf2_v4l2 dell_smbios rtsx_usb_ms btintel dell_wmi_sysman intel_rapl_msr realtek snd_hda_core rapl spi_nor memstick firmware_attributes_class mdio_devres ledtrig_audio dell_wmi_descriptor wmi_bmof videodev btmtk snd_hwdep intel_cstate dcdbas intel_wmi_thunderbolt mxm_wmi drm_display_helper processor_thermal_device_pci_legacy cfg80211 mtd libphy i2c_smbus intel_uncore psmouse bluetooth pcspkr videobuf2_common cec snd_pcm mei_me intel_lpss_pci cdc_acm mc processor_thermal_device intel_gtt ucsi_acpi intel_lpss mei processor_thermal_rfim idma64 video typec_ucsi snd_timer processor_thermal_mbox processor_thermal_rapl ecdh_generic pl2303 snd intel_rapl_common typec rfkill soundcore
[Št júl 20 18:33:12 2023]  i2c_hid_acpi intel_pch_thermal intel_soc_dts_iosf roles i2c_hid int3403_thermal int340x_thermal_zone intel_hid int3400_thermal wmi mousedev acpi_thermal_rel sparse_keymap joydev acpi_pad mac_hid dm_multipath sg crypto_user fuse loop dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 rtsx_usb_sdmmc mmc_core rtsx_usb usbhid nvme serio_raw nvme_core atkbd spi_intel_pci xhci_pci crc32c_intel libps2 spi_intel nvme_common xhci_pci_renesas vivaldi_fmap i8042 serio bbswitch(OE) [last unloaded: videobuf2_memops]
[Št júl 20 18:33:12 2023] CPU: 1 PID: 668 Comm: thermald Tainted: P  OE      6.4.4-arch1-1 #1 655744e6f70dbd2f57b072f7158d7c5b4468b4ff
[Št júl 20 18:33:12 2023] Hardware name: Dell Inc. G3 3779/04R93M, BIOS 1.9.0 03/15/2019
[Št júl 20 18:33:12 2023] RIP: 0010:dev_watchdog+0x232/0x240
[Št júl 20 18:33:12 2023] Code: ff ff ff 48 89 df c6 05 a7 a5 45 01 01 e8 f6 26 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 78 8c ec aa e8 ee 5a 54 ff <0f> 0b e9 2d ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90
[Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c01b4e78 EFLAGS: 00010286
[Št júl 20 18:33:12 2023] RAX: 0000000000000000 RBX: ffff915b0c738000 RCX: 0000000000000027
[Št júl 20 18:33:12 2023] RDX: ffff91625e2616c8 RSI: 0000000000000001 RDI: ffff91625e2616c0
[Št júl 20 18:33:12 2023] RBP: ffff915b0c7384c8 R08: 0000000000000000 R09: ffffa148c01b4d08
[Št júl 20 18:33:12 2023] R10: 0000000000000003 R11: ffffffffab6ca868 R12: ffff915b05b77000
[Št júl 20 18:33:12 2023] R13: ffff915b0c73841c R14: 0000000000000000 R15: 0000000000001e36
[Št júl 20 18:33:12 2023] FS:  00007f325edfc6c0(0000) GS:ffff91625e240000(0000) knlGS:0000000000000000
[Št júl 20 18:33:12 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Št júl 20 18:33:12 2023] CR2: 0000329601744010 CR3: 00000001176b0005 CR4: 00000000003706e0
[Št júl 20 18:33:12 2023] Call Trace:
[Št júl 20 18:33:12 2023]  <IRQ>
[Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
[Št júl 20 18:33:12 2023]  ? __warn+0x81/0x130
[Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
[Št júl 20 18:33:12 2023]  ? report_bug+0x171/0x1a0
[Št júl 20 18:33:12 2023]  ? prb_read_valid+0x1b/0x30
[Št júl 20 18:33:12 2023]  ? handle_bug+0x3c/0x80
[Št júl 20 18:33:12 2023]  ? exc_invalid_op+0x17/0x70
[Št júl 20 18:33:12 2023]  ? asm_exc_invalid_op+0x1a/0x20
[Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
[Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
[Št júl 20 18:33:12 2023]  ? __pfx_dev_watchdog+0x10/0x10
[Št júl 20 18:33:12 2023]  call_timer_fn+0x24/0x130
[Št júl 20 18:33:12 2023]  ? __pfx_dev_watchdog+0x10/0x10
[Št júl 20 18:33:12 2023]  __run_timers+0x222/0x2c0
[Št júl 20 18:33:12 2023]  run_timer_softirq+0x1d/0x40
[Št júl 20 18:33:12 2023]  __do_softirq+0xd1/0x2c8
[Št júl 20 18:33:12 2023]  __irq_exit_rcu+0xbb/0xf0
[Št júl 20 18:33:12 2023]  sysvec_apic_timer_interrupt+0x72/0x90
[Št júl 20 18:33:12 2023]  </IRQ>
[Št júl 20 18:33:12 2023]  <TASK>
[Št júl 20 18:33:12 2023]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[Št júl 20 18:33:12 2023] RIP: 0010:acpi_ns_search_one_scope+0x6f/0x250
[Št júl 20 18:33:12 2023] Code: 04 0f 85 84 00 00 00 4c 8b 65 18 4d 85 e4 0f 84 c0 00 00 00 8b 44 24 04 4c 89 e3 eb 0d 48 8b 5b 20 48 85 db 0f 84 94 00 00 00 <39> 43 0c 75 ee 48 89 df e8 24 08 00 00 83 f8 16 75 03 48 8b 1b f6
[Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c4b0f9e8 EFLAGS: 00000286
[Št júl 20 18:33:12 2023] RAX: 00000000584d4345 RBX: ffff915b01acb750 RCX: 0000000000000010
[Št júl 20 18:33:12 2023] RDX: ffffffffaa94ef50 RSI: ffffffffaa94ef30 RDI: ffffa148c4b0f9d0
[Št júl 20 18:33:12 2023] RBP: ffffffffac161140 R08: 0000000000000005 R09: 0000000000000003
[Št júl 20 18:33:12 2023] R10: 0000000000000042 R11: ffffffffac161140 R12: ffff915b001ee660
[Št júl 20 18:33:12 2023] R13: 0000000000000000 R14: ffffa148c4b0fac0 R15: 0000000000000005
[Št júl 20 18:33:12 2023]  ? acpi_ns_search_one_scope+0x3f/0x250
[Št júl 20 18:33:12 2023]  acpi_ns_search_and_enter+0x332/0x570
[Št júl 20 18:33:12 2023]  acpi_ns_lookup+0x49a/0xa70
[Št júl 20 18:33:12 2023]  acpi_ps_get_next_namepath+0x9d/0x390
[Št júl 20 18:33:12 2023]  acpi_ps_get_next_arg+0xd7/0x910
[Št júl 20 18:33:12 2023]  acpi_ps_parse_loop+0x45e/0xa30
[Št júl 20 18:33:12 2023]  acpi_ps_parse_aml+0x221/0x5e0
[Št júl 20 18:33:12 2023]  acpi_ps_execute_method+0x171/0x3e0
[Št júl 20 18:33:12 2023]  acpi_ns_evaluate+0x174/0x5d0
[Št júl 20 18:33:12 2023]  acpi_evaluate_object+0x16f/0x450
[Št júl 20 18:33:12 2023]  acpi_evaluate_integer+0x6f/0x130
[Št júl 20 18:33:12 2023]  int340x_thermal_get_zone_temp+0x4a/0xb0 [int340x_thermal_zone c9ebf538f873cd311f4997ede84c93646b5df8e3]
[Št júl 20 18:33:12 2023]  __thermal_zone_get_temp+0x1e/0x60
[Št júl 20 18:33:12 2023]  ? __kmem_cache_alloc_node+0x18d/0x310
[Št júl 20 18:33:12 2023]  thermal_zone_get_temp+0x6d/0x90
[Št júl 20 18:33:12 2023]  temp_show+0x37/0x70
[Št júl 20 18:33:12 2023]  dev_attr_show+0x19/0x60
[Št júl 20 18:33:12 2023]  sysfs_kf_seq_show+0xa8/0x100
[Št júl 20 18:33:12 2023]  seq_read_iter+0x120/0x480
[Št júl 20 18:33:12 2023]  vfs_read+0x1f3/0x320
[Št júl 20 18:33:12 2023]  ksys_read+0x6f/0xf0
[Št júl 20 18:33:12 2023]  do_syscall_64+0x5d/0x90
[Št júl 20 18:33:12 2023]  ? syscall_exit_to_user_mode+0x1b/0x40
[Št júl 20 18:33:12 2023]  ? do_syscall_64+0x6c/0x90
[Št júl 20 18:33:12 2023]  ? do_syscall_64+0x6c/0x90
[Št júl 20 18:33:12 2023]  ? do_syscall_64+0x6c/0x90
[Št júl 20 18:33:12 2023]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[Št júl 20 18:33:12 2023] RIP: 0033:0x7f326330fb5c
[Št júl 20 18:33:12 2023] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 89 9c f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 df 9c f8 ff 48
[Št júl 20 18:33:12 2023] RSP: 002b:00007f325edfa4c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Št júl 20 18:33:12 2023] RAX: ffffffffffffffda RBX: 00007f325edfa660 RCX: 00007f326330fb5c
[Št júl 20 18:33:12 2023] RDX: 0000000000001fff RSI: 00007f3248001500 RDI: 000000000000000d
[Št júl 20 18:33:12 2023] RBP: 0000000000001fff R08: 0000000000000000 R09: 0000000000000001
[Št júl 20 18:33:12 2023] R10: 0000000000000004 R11: 0000000000000246 R12: 00007f3248001500
[Št júl 20 18:33:12 2023] R13: 00007f325edfa6c8 R14: 00007f325edfa650 R15: 00005647dbed2950
[Št júl 20 18:33:12 2023]  </TASK>
[Št júl 20 18:33:12 2023] ---[ end trace 0000000000000000 ]---
[Št júl 20 18:33:14 2023] pcieport 0000:00:1d.5: Data Link Layer Link Active not set in 1000 msec
[Št júl 20 18:33:14 2023] r8169 0000:3c:00.0 enp60s0: Can't reset secondary PCI bus, detach NIC
[Pi júl 21 14:35:57 2023] perf: interrupt took too long (3143 > 3128), lowering kernel.perf_event_max_sample_rate to 63600
Comment 47 qwe2968 2023-07-25 22:46:15 UTC
(In reply to Milan Oravec from comment #46)
> I'm sorry to report the same bud for Dell G3 laptop, loosing the only one
> NIC makes this machine unusable. :(
> 
> [Št júl 20 18:33:12 2023] ------------[ cut here ]------------
> [Št júl 20 18:33:12 2023] NETDEV WATCHDOG: enp60s0 (r8169): transmit queue 0
> timed out 7734 ms
> [Št júl 20 18:33:12 2023] WARNING: CPU: 1 PID: 668 at
> net/sched/sch_generic.c:525 dev_watchdog+0x232/0x240
> [Št júl 20 18:33:12 2023] Modules linked in: xt_LOG nf_log_syslog xt_limit
> xt_pkttype 8021q garp mrp rfcomm nvidia(POE) snd_ctl_led
> snd_hda_codec_realtek snd_hda_codec_generic xt_CHECKSUM xt_MASQUERADE
> xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle
> ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge
> stp llc cmac algif_hash algif_skcipher af_alg bnep snd_sof_pci_intel_cnl
> snd_sof_intel_hda_common soundwire_intel soundwire_cadence
> snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp
> snd_sof snd_sof_utils soundwire_generic_allocation soundwire_bus snd_soc_skl
> snd_soc_hdac_hda intel_tcc_cooling snd_hda_ext_core x86_pkg_temp_thermal
> snd_soc_sst_ipc intel_powerclamp coretemp snd_soc_sst_dsp
> snd_soc_acpi_intel_match snd_soc_acpi kvm_intel hid_multitouch i915
> snd_soc_core kvm snd_compress ac97_bus irqbypass iwlmvm crct10dif_pclmul
> crc32_pclmul snd_pcm_dmaengine polyval_clmulni snd_hda_codec_hdmi
> [Št júl 20 18:33:12 2023]  polyval_generic drm_buddy gf128mul mac80211
> ghash_clmulni_intel sha512_ssse3 iTCO_wdt dell_laptop aesni_intel
> snd_hda_intel btusb intel_pmc_bxt ee1004 iTCO_vendor_support btrtl uvc
> i2c_algo_bit snd_intel_dspcfg crypto_simd libarc4 mei_hdcp mei_pxp
> snd_intel_sdw_acpi cryptd dell_wmi iwlwifi btbcm r8169 snd_hda_codec ttm
> i2c_i801 videobuf2_v4l2 dell_smbios rtsx_usb_ms btintel dell_wmi_sysman
> intel_rapl_msr realtek snd_hda_core rapl spi_nor memstick
> firmware_attributes_class mdio_devres ledtrig_audio dell_wmi_descriptor
> wmi_bmof videodev btmtk snd_hwdep intel_cstate dcdbas intel_wmi_thunderbolt
> mxm_wmi drm_display_helper processor_thermal_device_pci_legacy cfg80211 mtd
> libphy i2c_smbus intel_uncore psmouse bluetooth pcspkr videobuf2_common cec
> snd_pcm mei_me intel_lpss_pci cdc_acm mc processor_thermal_device intel_gtt
> ucsi_acpi intel_lpss mei processor_thermal_rfim idma64 video typec_ucsi
> snd_timer processor_thermal_mbox processor_thermal_rapl ecdh_generic pl2303
> snd intel_rapl_common typec rfkill soundcore
> [Št júl 20 18:33:12 2023]  i2c_hid_acpi intel_pch_thermal intel_soc_dts_iosf
> roles i2c_hid int3403_thermal int340x_thermal_zone intel_hid int3400_thermal
> wmi mousedev acpi_thermal_rel sparse_keymap joydev acpi_pad mac_hid
> dm_multipath sg crypto_user fuse loop dm_mod bpf_preload ip_tables x_tables
> ext4 crc32c_generic crc16 mbcache jbd2 rtsx_usb_sdmmc mmc_core rtsx_usb
> usbhid nvme serio_raw nvme_core atkbd spi_intel_pci xhci_pci crc32c_intel
> libps2 spi_intel nvme_common xhci_pci_renesas vivaldi_fmap i8042 serio
> bbswitch(OE) [last unloaded: videobuf2_memops]
> [Št júl 20 18:33:12 2023] CPU: 1 PID: 668 Comm: thermald Tainted: P  OE     
> 6.4.4-arch1-1 #1 655744e6f70dbd2f57b072f7158d7c5b4468b4ff
> [Št júl 20 18:33:12 2023] Hardware name: Dell Inc. G3 3779/04R93M, BIOS
> 1.9.0 03/15/2019
> [Št júl 20 18:33:12 2023] RIP: 0010:dev_watchdog+0x232/0x240
> [Št júl 20 18:33:12 2023] Code: ff ff ff 48 89 df c6 05 a7 a5 45 01 01 e8 f6
> 26 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 78 8c ec aa e8 ee 5a
> 54 ff <0f> 0b e9 2d ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90
> [Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c01b4e78 EFLAGS: 00010286
> [Št júl 20 18:33:12 2023] RAX: 0000000000000000 RBX: ffff915b0c738000 RCX:
> 0000000000000027
> [Št júl 20 18:33:12 2023] RDX: ffff91625e2616c8 RSI: 0000000000000001 RDI:
> ffff91625e2616c0
> [Št júl 20 18:33:12 2023] RBP: ffff915b0c7384c8 R08: 0000000000000000 R09:
> ffffa148c01b4d08
> [Št júl 20 18:33:12 2023] R10: 0000000000000003 R11: ffffffffab6ca868 R12:
> ffff915b05b77000
> [Št júl 20 18:33:12 2023] R13: ffff915b0c73841c R14: 0000000000000000 R15:
> 0000000000001e36
> [Št júl 20 18:33:12 2023] FS:  00007f325edfc6c0(0000)
> GS:ffff91625e240000(0000) knlGS:0000000000000000
> [Št júl 20 18:33:12 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Št júl 20 18:33:12 2023] CR2: 0000329601744010 CR3: 00000001176b0005 CR4:
> 00000000003706e0
> [Št júl 20 18:33:12 2023] Call Trace:
> [Št júl 20 18:33:12 2023]  <IRQ>
> [Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
> [Št júl 20 18:33:12 2023]  ? __warn+0x81/0x130
> [Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
> [Št júl 20 18:33:12 2023]  ? report_bug+0x171/0x1a0
> [Št júl 20 18:33:12 2023]  ? prb_read_valid+0x1b/0x30
> [Št júl 20 18:33:12 2023]  ? handle_bug+0x3c/0x80
> [Št júl 20 18:33:12 2023]  ? exc_invalid_op+0x17/0x70
> [Št júl 20 18:33:12 2023]  ? asm_exc_invalid_op+0x1a/0x20
> [Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
> [Št júl 20 18:33:12 2023]  ? dev_watchdog+0x232/0x240
> [Št júl 20 18:33:12 2023]  ? __pfx_dev_watchdog+0x10/0x10
> [Št júl 20 18:33:12 2023]  call_timer_fn+0x24/0x130
> [Št júl 20 18:33:12 2023]  ? __pfx_dev_watchdog+0x10/0x10
> [Št júl 20 18:33:12 2023]  __run_timers+0x222/0x2c0
> [Št júl 20 18:33:12 2023]  run_timer_softirq+0x1d/0x40
> [Št júl 20 18:33:12 2023]  __do_softirq+0xd1/0x2c8
> [Št júl 20 18:33:12 2023]  __irq_exit_rcu+0xbb/0xf0
> [Št júl 20 18:33:12 2023]  sysvec_apic_timer_interrupt+0x72/0x90
> [Št júl 20 18:33:12 2023]  </IRQ>
> [Št júl 20 18:33:12 2023]  <TASK>
> [Št júl 20 18:33:12 2023]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [Št júl 20 18:33:12 2023] RIP: 0010:acpi_ns_search_one_scope+0x6f/0x250
> [Št júl 20 18:33:12 2023] Code: 04 0f 85 84 00 00 00 4c 8b 65 18 4d 85 e4 0f
> 84 c0 00 00 00 8b 44 24 04 4c 89 e3 eb 0d 48 8b 5b 20 48 85 db 0f 84 94 00
> 00 00 <39> 43 0c 75 ee 48 89 df e8 24 08 00 00 83 f8 16 75 03 48 8b 1b f6
> [Št júl 20 18:33:12 2023] RSP: 0018:ffffa148c4b0f9e8 EFLAGS: 00000286
> [Št júl 20 18:33:12 2023] RAX: 00000000584d4345 RBX: ffff915b01acb750 RCX:
> 0000000000000010
> [Št júl 20 18:33:12 2023] RDX: ffffffffaa94ef50 RSI: ffffffffaa94ef30 RDI:
> ffffa148c4b0f9d0
> [Št júl 20 18:33:12 2023] RBP: ffffffffac161140 R08: 0000000000000005 R09:
> 0000000000000003
> [Št júl 20 18:33:12 2023] R10: 0000000000000042 R11: ffffffffac161140 R12:
> ffff915b001ee660
> [Št júl 20 18:33:12 2023] R13: 0000000000000000 R14: ffffa148c4b0fac0 R15:
> 0000000000000005
> [Št júl 20 18:33:12 2023]  ? acpi_ns_search_one_scope+0x3f/0x250
> [Št júl 20 18:33:12 2023]  acpi_ns_search_and_enter+0x332/0x570
> [Št júl 20 18:33:12 2023]  acpi_ns_lookup+0x49a/0xa70
> [Št júl 20 18:33:12 2023]  acpi_ps_get_next_namepath+0x9d/0x390
> [Št júl 20 18:33:12 2023]  acpi_ps_get_next_arg+0xd7/0x910
> [Št júl 20 18:33:12 2023]  acpi_ps_parse_loop+0x45e/0xa30
> [Št júl 20 18:33:12 2023]  acpi_ps_parse_aml+0x221/0x5e0
> [Št júl 20 18:33:12 2023]  acpi_ps_execute_method+0x171/0x3e0
> [Št júl 20 18:33:12 2023]  acpi_ns_evaluate+0x174/0x5d0
> [Št júl 20 18:33:12 2023]  acpi_evaluate_object+0x16f/0x450
> [Št júl 20 18:33:12 2023]  acpi_evaluate_integer+0x6f/0x130
> [Št júl 20 18:33:12 2023]  int340x_thermal_get_zone_temp+0x4a/0xb0
> [int340x_thermal_zone c9ebf538f873cd311f4997ede84c93646b5df8e3]
> [Št júl 20 18:33:12 2023]  __thermal_zone_get_temp+0x1e/0x60
> [Št júl 20 18:33:12 2023]  ? __kmem_cache_alloc_node+0x18d/0x310
> [Št júl 20 18:33:12 2023]  thermal_zone_get_temp+0x6d/0x90
> [Št júl 20 18:33:12 2023]  temp_show+0x37/0x70
> [Št júl 20 18:33:12 2023]  dev_attr_show+0x19/0x60
> [Št júl 20 18:33:12 2023]  sysfs_kf_seq_show+0xa8/0x100
> [Št júl 20 18:33:12 2023]  seq_read_iter+0x120/0x480
> [Št júl 20 18:33:12 2023]  vfs_read+0x1f3/0x320
> [Št júl 20 18:33:12 2023]  ksys_read+0x6f/0xf0
> [Št júl 20 18:33:12 2023]  do_syscall_64+0x5d/0x90
> [Št júl 20 18:33:12 2023]  ? syscall_exit_to_user_mode+0x1b/0x40
> [Št júl 20 18:33:12 2023]  ? do_syscall_64+0x6c/0x90
> [Št júl 20 18:33:12 2023]  ? do_syscall_64+0x6c/0x90
> [Št júl 20 18:33:12 2023]  ? do_syscall_64+0x6c/0x90
> [Št júl 20 18:33:12 2023]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [Št júl 20 18:33:12 2023] RIP: 0033:0x7f326330fb5c
> [Št júl 20 18:33:12 2023] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24
> 08 e8 89 9c f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0
> 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 df 9c f8 ff 48
> [Št júl 20 18:33:12 2023] RSP: 002b:00007f325edfa4c0 EFLAGS: 00000246
> ORIG_RAX: 0000000000000000
> [Št júl 20 18:33:12 2023] RAX: ffffffffffffffda RBX: 00007f325edfa660 RCX:
> 00007f326330fb5c
> [Št júl 20 18:33:12 2023] RDX: 0000000000001fff RSI: 00007f3248001500 RDI:
> 000000000000000d
> [Št júl 20 18:33:12 2023] RBP: 0000000000001fff R08: 0000000000000000 R09:
> 0000000000000001
> [Št júl 20 18:33:12 2023] R10: 0000000000000004 R11: 0000000000000246 R12:
> 00007f3248001500
> [Št júl 20 18:33:12 2023] R13: 00007f325edfa6c8 R14: 00007f325edfa650 R15:
> 00005647dbed2950
> [Št júl 20 18:33:12 2023]  </TASK>
> [Št júl 20 18:33:12 2023] ---[ end trace 0000000000000000 ]---
> [Št júl 20 18:33:14 2023] pcieport 0000:00:1d.5: Data Link Layer Link Active
> not set in 1000 msec
> [Št júl 20 18:33:14 2023] r8169 0000:3c:00.0 enp60s0: Can't reset secondary
> PCI bus, detach NIC
> [Pi júl 21 14:35:57 2023] perf: interrupt took too long (3143 > 3128),
> lowering kernel.perf_event_max_sample_rate to 63600

It may be fixed in kernel 6.5, for I saw they reverted the relevant bug patches.

https://github.com/torvalds/linux/commits/master/drivers/net/ethernet/realtek/r8169_main.c

If you use arch, I think you can downgrade kernel to 6.3.9, and don't update it until the kernel 6.5 release. :)
Comment 48 Milan Oravec 2023-07-26 08:45:25 UTC
Thank you for information. I've tried pcie_aspm=force option, but no luck either, NIC was lost 6 hours after boot. :( 

Yes, I'm on Arch and will switch back to 6.3 tree, this is only known solution so far.
Comment 49 Edwin 2023-07-27 13:48:45 UTC
Looks like kernel 6.4.7 will include both bug patches that will be in 6.5 :) 

https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/
Comment 50 Adilson Dantas 2023-07-27 14:02:45 UTC
(In reply to edwin.frank.loeffler from comment #49)
> Looks like kernel 6.4.7 will include both bug patches that will be in 6.5 :) 
> 
> https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/

It doesn't fix. I built 6.4.7 and it got the same issue. I had to go back to a patched 6.4.6 to get my offborad nic working.
Comment 51 Edwin 2023-07-27 14:14:56 UTC
(In reply to Adilson Dantas from comment #50)
> (In reply to edwin.frank.loeffler from comment #49)
> > Looks like kernel 6.4.7 will include both bug patches that will be in 6.5
> :) 
> > 
> > https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/
> 
> It doesn't fix. I built 6.4.7 and it got the same issue. I had to go back to
> a patched 6.4.6 to get my offborad nic working.

Ah yes, apologies. 6.4.7 seems to still be missing commit cf2ffde
Comment 52 Adilson Dantas 2023-07-27 14:38:33 UTC
(In reply to Edwin from comment #51)
> (In reply to Adilson Dantas from comment #50)
> > (In reply to edwin.frank.loeffler from comment #49)
> > > Looks like kernel 6.4.7 will include both bug patches that will be in 6.5
> > :) 
> > > 
> > > https://lore.kernel.org/all/20230725104514.821564989@linuxfoundation.org/
> > 
> > It doesn't fix. I built 6.4.7 and it got the same issue. I had to go back
> to
> > a patched 6.4.6 to get my offborad nic working.
> 
> Ah yes, apologies. 6.4.7 seems to still be missing commit cf2ffde

And I confirm it. I applied this commit and now I can use 6.4.7 without any issues.
Comment 53 Adilson Dantas 2023-08-03 12:22:13 UTC
6.4.8 was released today and really fix this issue for me.

From the changelog:

   r8169: revert 2ab19de62d67 ("r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll")
    
    commit cf2ffdea0839398cb0551762af7f5efb0a6e0fea upstream.
    
    There have been reports that on a number of systems this change breaks
    network connectivity. Therefore effectively revert it. Mainly affected
    seem to be systems where BIOS denies ASPM access to OS.
    Due to later changes we can't do a direct revert.


Can anyone, who was affected by this bug, can test to see if it will still get the same errors or not before closing this bug report?
Comment 54 Jay Mann 2023-08-07 02:04:54 UTC
(In reply to Adilson Dantas from comment #53)
> 6.4.8 was released today and really fix this issue for me.
> 
> From the changelog:
> 
>    r8169: revert 2ab19de62d67 ("r8169: remove ASPM restrictions now that
> ASPM is disabled during NAPI poll")
>     
>     commit cf2ffdea0839398cb0551762af7f5efb0a6e0fea upstream.
>     
>     There have been reports that on a number of systems this change breaks
>     network connectivity. Therefore effectively revert it. Mainly affected
>     seem to be systems where BIOS denies ASPM access to OS.
>     Due to later changes we can't do a direct revert.
> 
> 
> Can anyone, who was affected by this bug, can test to see if it will still
> get the same errors or not before closing this bug report?

Yes it looks to be fixed.  Thanks.
Comment 55 Adilson Dantas 2023-08-11 20:50:20 UTC
Since there is no more answers I will consider this bug fixed since 6.4.8.

So I'm closing this bug report.
Comment 56 Oliver Freyermuth 2023-12-16 23:44:50 UTC
It seems that (expectedly), this is causing higher power consumption on some systems which handled ASPM well before (see e.g. https://forum.odroid.com/viewtopic.php?p=377001&sid=03fc5da54400c7a76a88003cd7212117#p377001 for odroid H3+, which draws ~2 W in idle, power consumption seems to increase by +3W with disabled ASPM for the Realtek cards, i.e. over twice as much idle power drawn after kernel upgrade). Apparently, users suggest kernel downgrades and did not investigate the issue further. 

Sadly, it seems that this change can not even be overridden with pcie_aspm=force on such systems which worked fine with ASPM for the NICs before. Is this expected?
Comment 57 Heiner Kallweit 2023-12-17 00:04:29 UTC
As long as BIOS allows it you can use the standard sysfs attributes to enable/disable each individual ASPM state.
Please also verify that EEE is enabled.
Comment 58 Oliver Freyermuth 2023-12-17 01:39:14 UTC
Thanks, confirmed to work, both with older kernels (e.g. 6.1) and most recent ones — I was completely unaware of the sysfs overide until now. I'll let the odroid community know, too, and indeed enabling EEE saves even more. Thanks!