Bug 107421
Summary: | r8169 - rtl_counters_cond == 1 (loop: 1000, delay: 10). (klog spam) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Sverd Johnsen (sverd.johnsen) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | normal | CC: | abhigenie92, arbitraryadirc, eharastasan, esesmu, groeger, hkallweit1, jernej.jakob, joey.corleone, jonathan.p.schuster, khillman, larry_chiang, lopeonline+kernelbugzilla, mike, mirage, nailzuk, stephen, vmxevilstar, wangjiezhe, wfkernel |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
See Also: | https://bugzilla.kernel.org/show_bug.cgi?id=104351 | ||
Kernel Version: | 4.3.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Sverd Johnsen
2015-11-07 04:02:43 UTC
I notice the same error in dmesg after booting without any network connector attached, after cold boot and warm boot. Arch Linux kernel: 4.3.3-2-ARCH [ 6.221104] r8169 0000:04:00.1 enp4s0f1: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 6.233220] r8169 0000:04:00.1 enp4s0f1: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 6.386644] r8169 0000:04:00.1 enp4s0f1: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 6.664475] r8169 0000:04:00.1 enp4s0f1: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 10.977972] r8169 0000:04:00.1 enp4s0f1: link down Log spam, without any eth connection ----------- Linux ArchPC 4.3.3-2-ARCH #1 SMP PREEMPT Wed Dec 23 20:09:18 CET 2015 x86_64 GNU/Linux 09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 03) ----------- jan 17 17:20:47 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:20:47 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:20:57 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:20:57 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:07 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:07 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:17 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:17 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:27 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:27 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:37 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:37 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:47 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:47 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:57 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:21:57 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:22:07 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:22:07 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:22:17 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:22:17 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:22:27 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). jan 17 17:22:27 ArchPC kernel: r8169 0000:09:00.0 enp9s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). As above, on 4.3.3-3-ARCH. $ uname -a Linux h4lfwit 4.3.3-3-ARCH #1 SMP PREEMPT Wed Jan 20 08:12:23 CET 2016 x86_64 GNU/Linux $ dmesg | tail -4 [ 1932.335123] r8169 0000:01:00.0 enp1s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1932.364492] r8169 0000:01:00.0 enp1s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1933.489744] r8169 0000:01:00.0 enp1s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1933.519148] r8169 0000:01:00.0 enp1s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). # lspci -vvv 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 05) Subsystem: Toshiba America Info Systems Device fb30 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 29 Region 0: I/O ports at 2000 [size=256] Region 2: Memory at c0104000 (64-bit, prefetchable) [size=4K] Region 4: Memory at c0100000 (64-bit, prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 2f-01-00-00-36-4c-e0-00 Kernel driver in use: r8169 Kernel modules: r8169 Same for me 09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 0a) Linux hp 4.4.0-gentoo-r1 #11 SMP Sat Jan 30 22:51:38 EET 2016 x86_64 Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz GenuineIntel GNU/Linux [ 413.406739] r8169 0000:09:00.0 eno1: rtl_counters_cond == 1 (loop: 1000, delay: 10). Card seems to work fine, just annoying messages It happens for me on laptop when nothing is connected. Notice the same issue with: abhishek log $ uname -a Linux hp 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux It blows my syslog off. Any fixes? I have noticed it happens after I disconnected my LAN cable. #Archlinux x86_64 >uname -a Linux hawker64 4.6.4-1-ck #1 SMP PREEMPT Mon Jul 11 17:37:05 EDT 2016 x86_64 GNU/Linux >inxi -M Machine: Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010 I too have been experiencing this but i put it down to the Intel 55x0 chipset errata - Interrupt remapping issue (Intel 5500/5520/X58 chipset revision 0x13 and 0x22 have an errata (#47 and #53) which makes the IOMMU interrupt remapping unit unreliable. This erratum causes interruptions and the interrupt remapping invalidations become unresponsive) https://forums.gentoo.org/viewtopic-t-1030102-start-0.html?sid=59c8eddb43e0553296f93355ea10b42d below are some snippets from logs that i *think* may be relevant. Ocaasionaly i get hard lockups where only a hard reboot will suffice,on other occasion i just lose network connectivity, mostly x freezes and i get dropped to tty with errors about radeon fence/ring. This started happening for me since kernels 4.* if i rollback to say kernel 3.19-1 all is well. So again i *guess* it's a kernel regression, either that or my issue is a mixture of this and the IOMMU thing. (happens when using linux-ck or stock archlinux kernels) perf: interrupt took too long (2711 > 2500), lowering kernel.perf_event_max_sample_rate to 73000 perf: interrupt took too long (3512 > 3388), lowering kernel.perf_event_max_sample_rate to 56000 perf: interrupt took too long (4459 > 4390), lowering kernel.perf_event_max_sample_rate to 44000 perf: interrupt took too long (5613 > 5573), lowering kernel.perf_event_max_sample_rate to 35000 hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB ! hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB ! hawker64 kernel: radeon 0000:02:00.0: scheduling IB failed (-2). hawker64 kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to schedule IB ! hawker64 kernel: r8169 0000:05:00.0 enp5s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). hawker64 kernel: r8169 0000:05:00.0 enp5s0: rtl_counters_cond == 1 (loop: 1000, delay: 10). hawker64 kernel: INFO: rcu_preempt detected stalls on CPUs/tasks: hawker64 kernel: 1-...: (0 ticks this GP) idle=9b2/0/0 softirq=4017531/4017531 fqs=0 hawker64 kernel: 2-...: (6 GPs behind) idle=dd2/0/0 softirq=2426637/2426637 fqs=0 hawker64 kernel: 3-...: (3 GPs behind) idle=e78/0/0 softirq=1361775/1361777 fqs=0 hawker64 kernel: 4-...: (38 GPs behind) idle=3e0/0/0 softirq=440832/440833 fqs=0 hawker64 kernel: 6-...: (1 GPs behind) idle=efe/0/0 softirq=305520/305520 fqs=0 hawker64 kernel: 7-...: (1 GPs behind) idle=a2a/0/0 softirq=204822/204822 fqs=0 hawker64 kernel: (detected by 0, t=127647 jiffies, g=2627574, c=2627573, q=9503) hawker64 kernel: rcu_preempt kthread starved for 127647 jiffies! g2627574 c2627573 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 Jul 01 14:13:57 hawker64 login[541]: pam_systemd(login:session): Failed to release session: Connection reset by peer Jul 01 14:13:57 hawker64 systemd-logind[507]: Failed to abandon session scope: Transport endpoint is not connected -- Reboot -- I am also getting this with 4.17.0-0.rc7. lspci says my card is: Realtek Semiconductor Co., Ltd. RTL8169 PCI Gigabit Ethernet Controller (rev 10) r8169 0000:04:06.0 enp4s6: rtl_counters_cond == 1 (loop: 1000, delay: 10) ...(spam)... I compiled this kernel myself since I saw that this was supposed to have been fixed by previous commits: f51d4a10ac39ecf06b25e7a79121b06f7ed59928 and e06362369ae1e5b0ba70b66f8703ff08bcb63b23 ...however it persists in 4.17.0. Another interesting fact is that ethtool cannot disable the wol features: # ethtool enp4s6 Settings for enp4s6: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised pause frame use: Symmetric Receive-only Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Link partner advertised pause frame use: Symmetric Receive-only Link partner advertised auto-negotiation: Yes Link partner advertised FEC modes: Not reported Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: pumbg Current message level: 0x00000033 (51) drv probe ifdown ifup Link detected: yes # ethtool -s enp4s6 wol d ... and # ethtool enp4s6 will report same information, no change. Still present in $ uname -a Linux vyos 4.19.4-amd64-vyos #1 SMP Thu Dec 13 10:10:42 CET 2018 x86_64 GNU/Linux $ dmesg |grep r8169 [ 1.700147] libphy: r8169: probed [ 1.700762] r8169 0000:01:00.0 eth0: RTL8168e/8111e, aa:aa:aa:00:1e:00, XID 2c200000, IRQ 24 [ 1.700767] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 1.705981] libphy: r8169: probed [ 1.706263] r8169 0000:02:00.0 eth1: RTL8101e, 00:19:21:df:fc:1e, XID 34000000, IRQ 25 [ 1.709367] libphy: r8169: probed [ 1.709994] r8169 0000:03:00.0 eth2: RTL8168e/8111e, aa:aa:aa:00:1d:50, XID 2c200000, IRQ 26 [ 1.710000] r8169 0000:03:00.0 eth2: jumbo features [frames: 9200 bytes, tx checksumming: ko] [ 1.710214] r8169 0000:04:02.0: not PCI Express [ 1.713563] libphy: r8169: probed [ 1.713839] r8169 0000:04:02.0 eth3: RTL8169sb/8110sb, c8:3a:35:dd:e3:b8, XID 10000000, IRQ 19 [ 1.713844] r8169 0000:04:02.0 eth3: jumbo features [frames: 7152 bytes, tx checksumming: ok] [ 25.660949] r8169 0000:02:00.0 rename3: renamed from eth1 [ 26.797862] r8169 0000:01:00.0: invalid short VPD tag 05 at offset 2 [ 26.798967] r8169 0000:02:00.0: invalid short VPD tag 05 at offset 2 [ 26.802305] r8169 0000:03:00.0: invalid short VPD tag 05 at offset 2 [ 26.820900] r8169 0000:03:00.0 rename4: renamed from eth2 [ 26.865677] r8169 0000:02:00.0 eth2: renamed from rename3 [ 27.912200] r8169 0000:01:00.0 eth1: renamed from eth0 [ 27.950468] r8169 0000:03:00.0 eth0: renamed from rename4 [ 49.346688] RTL8211C Gigabit Ethernet r8169-410:00: attached PHY driver [RTL8211C Gigabit Ethernet] (mii_bus:phy_addr=r8169-410:00, irq=IGNORE) [ 49.447626] r8169 0000:04:02.0 eth3: Link is Down [ 51.929646] RTL8201CP Ethernet r8169-200:00: attached PHY driver [RTL8201CP Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) [ 52.696796] r8169 0000:04:02.0 eth3: Link is Up - 1Gbps/Full - flow control rx/tx [ 53.593780] RTL8211DN Gigabit Ethernet r8169-100:00: attached PHY driver [RTL8211DN Gigabit Ethernet] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE) [ 53.856398] r8169 0000:01:00.0 eth1: No native access to PCI extended config space, falling back to CSI [ 55.400747] RTL8211DN Gigabit Ethernet r8169-300:00: attached PHY driver [RTL8211DN Gigabit Ethernet] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE) [ 55.784729] r8169 0000:01:00.0 eth1: Link is Up - 1Gbps/Full - flow control rx/tx [ 57.276860] r8169 0000:03:00.0 eth0: Link is Up - 100Mbps/Full - flow control rx/tx [ 1455.954345] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1455.965944] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1456.475123] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1456.486725] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1458.145224] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1458.156894] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1458.348005] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1458.359591] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1460.720713] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1460.732311] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 1827.272274] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2236.992186] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2237.003751] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2239.731677] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2239.743239] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2267.371138] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2267.382701] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). [ 2427.267260] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, delay: 10). $ lspci 00:00.0 Host bridge: Intel Corporation 82945G/GZ/P/PL Memory Controller Hub (rev 02) 00:01.0 PCI bridge: Intel Corporation 82945G/GZ/P/PL PCI Express Root Port (rev 02) 00:02.0 VGA compatible controller: Intel Corporation 82945G/GZ Integrated Graphics Controller (rev 02) 00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 01) 00:1c.1 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 2 (rev 01) 00:1d.0 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #1 (rev 01) 00:1d.1 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #2 (rev 01) 00:1d.2 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #3 (rev 01) 00:1d.3 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #4 (rev 01) 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) 00:1f.3 SMBus: Intel Corporation NM10/ICH7 Family SMBus Controller (rev 01) 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 01) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) 04:02.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8169 PCI Gigabit Ethernet Controller (rev 10) https://github.com/vyos/vyos-kernel It seems that only RTL8101e has the described problem. Or did anybody face this problem with another chip version too? (In reply to Heiner Kallweit from comment #10) > It seems that only RTL8101e has the described problem. Or did anybody face > this problem with another chip version too? I see a variety of chips in this bug report, including RTL8111/8168/8411, RTL8101E/RTL8102E and RTL8169. The 8101e is most likely mentioned more times because it is only fast ethernet (100M), not gigabit, so people are more likely to have it unplugged and using a gigabit card instead. Unfortunately the lspci output doesn't really help. What I would need is the dmesg line with chip name and XID (like in yesterdays report). And it would be good if affected people could re-test with a recent kernel version. I won't pick up 3 years old reports, for certain chip versions this may have been fixed in the meantime. I'll check with Realtek whether there are any known issues with the statistics counters on specific chip versions. (In reply to Jernej Jakob from comment #9) > Still present in > > $ uname -a > > Linux vyos 4.19.4-amd64-vyos #1 SMP Thu Dec 13 10:10:42 CET 2018 x86_64 > GNU/Linux > > $ dmesg |grep r8169 > > [ 1.700147] libphy: r8169: probed > [ 1.700762] r8169 0000:01:00.0 eth0: RTL8168e/8111e, aa:aa:aa:00:1e:00, > XID 2c200000, IRQ 24 > [ 1.700767] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 1.705981] libphy: r8169: probed > [ 1.706263] r8169 0000:02:00.0 eth1: RTL8101e, 00:19:21:df:fc:1e, XID > 34000000, IRQ 25 > [ 1.709367] libphy: r8169: probed > [ 1.709994] r8169 0000:03:00.0 eth2: RTL8168e/8111e, aa:aa:aa:00:1d:50, > XID 2c200000, IRQ 26 > [ 1.710000] r8169 0000:03:00.0 eth2: jumbo features [frames: 9200 bytes, > tx checksumming: ko] > [ 1.710214] r8169 0000:04:02.0: not PCI Express > [ 1.713563] libphy: r8169: probed > [ 1.713839] r8169 0000:04:02.0 eth3: RTL8169sb/8110sb, c8:3a:35:dd:e3:b8, > XID 10000000, IRQ 19 > [ 1.713844] r8169 0000:04:02.0 eth3: jumbo features [frames: 7152 bytes, > tx checksumming: ok] > [ 25.660949] r8169 0000:02:00.0 rename3: renamed from eth1 > [ 26.797862] r8169 0000:01:00.0: invalid short VPD tag 05 at offset 2 > [ 26.798967] r8169 0000:02:00.0: invalid short VPD tag 05 at offset 2 > [ 26.802305] r8169 0000:03:00.0: invalid short VPD tag 05 at offset 2 > [ 26.820900] r8169 0000:03:00.0 rename4: renamed from eth2 > [ 26.865677] r8169 0000:02:00.0 eth2: renamed from rename3 > [ 27.912200] r8169 0000:01:00.0 eth1: renamed from eth0 > [ 27.950468] r8169 0000:03:00.0 eth0: renamed from rename4 > [ 49.346688] RTL8211C Gigabit Ethernet r8169-410:00: attached PHY driver > [RTL8211C Gigabit Ethernet] (mii_bus:phy_addr=r8169-410:00, irq=IGNORE) > [ 49.447626] r8169 0000:04:02.0 eth3: Link is Down > [ 51.929646] RTL8201CP Ethernet r8169-200:00: attached PHY driver > [RTL8201CP Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) > [ 52.696796] r8169 0000:04:02.0 eth3: Link is Up - 1Gbps/Full - flow > control rx/tx > [ 53.593780] RTL8211DN Gigabit Ethernet r8169-100:00: attached PHY driver > [RTL8211DN Gigabit Ethernet] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE) > [ 53.856398] r8169 0000:01:00.0 eth1: No native access to PCI extended > config space, falling back to CSI > [ 55.400747] RTL8211DN Gigabit Ethernet r8169-300:00: attached PHY driver > [RTL8211DN Gigabit Ethernet] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE) > [ 55.784729] r8169 0000:01:00.0 eth1: Link is Up - 1Gbps/Full - flow > control rx/tx > [ 57.276860] r8169 0000:03:00.0 eth0: Link is Up - 100Mbps/Full - flow > control rx/tx > [ 1455.954345] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, > delay: 10). > [ 1455.965944] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, > delay: 10). > [ 1456.475123] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, > delay: 10). The issue starts from second 1455 after boot. This leaves few questions: - Was interface eth2 brought up at boot? Looks like it's brought up at second 51, because the PHY driver is attached on device open. - Is something connected to eth2, IOW should a link have been established? - Any action on second 1455 which could be related to the start of the issue? (In reply to Jernej Jakob from comment #9) > Still present in > > $ uname -a > > Linux vyos 4.19.4-amd64-vyos #1 SMP Thu Dec 13 10:10:42 CET 2018 x86_64 > GNU/Linux > > $ dmesg |grep r8169 > [..] > [ 1455.954345] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, > delay: 10). > [ 1455.965944] r8169 0000:02:00.0 eth2: rtl_counters_cond == 1 (loop: 1000, > delay: 10). [..] Seems like the chip isn't reachable, e.g. because it's in a PCI powersave state. Because all PCI reads return 0xff, this would also explain why in comment 8 all WoL flags seem to be set: Supports Wake-on: pumbg Wake-on: pumbg I would need to know the call trace of the attempt to access the sleeping chip. Could you apply the following and post the resulting warning? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 7c252e198..cb2aab4e6 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -768,6 +768,7 @@ static bool rtl_loop_wait(struct rtl8169_private *tp, const struct rtl_cond *c, } netif_err(tp, drv, tp->dev, "%s == %d (loop: %d, delay: %d).\n", c->msg, !high, n, d); + WARN_ON_ONCE(1); return false; } -- 2.20.1 With the following commit the log spam shouldn't occur any longer (even though it's not clear how the chip can be in a PCI power-save state if it's not runtime-suspended). However it will take some time until it gets applied to stable. https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=10262b0b53666cbc506989b17a3ead1e9c3b43b4 (In reply to Heiner Kallweit from comment #14) ... > Could you apply the following and post the resulting warning? > > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 7c252e198..cb2aab4e6 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -768,6 +768,7 @@ static bool rtl_loop_wait(struct rtl8169_private *tp, > const struct rtl_cond *c, > } > netif_err(tp, drv, tp->dev, "%s == %d (loop: %d, delay: %d).\n", > c->msg, !high, n, d); > + WARN_ON_ONCE(1); > return false; > } > > -- > 2.20.1 Applied and tested, but no change - not getting any extra output. Do I need to change any log level settings? My rtl_loop_wait is on line 772 (kernel 4.19.4-amd64-vyos) When I run "ip link" I get rtl_counters_cond logged on the console 4 times before the command gives any output. Same with "ethtool eth2", I get two rtl_counters_cond on the console. WARN_ON_ONCE prints a stack trace plus some additional info to the syslog (dmesg). Could you please re-test with 4.19.19 or 4.20.6? They include the patch referenced in comment 15. (In reply to Heiner Kallweit from comment #17) > WARN_ON_ONCE prints a stack trace plus some additional info to the syslog > (dmesg). > Could you please re-test with 4.19.19 or 4.20.6? They include the patch > referenced in comment 15. I'm testing that patch right now on the same kernel version. I can't easily upgrade the kernel as this is an image-based system (vyos) that's running on that particular hardware. I can however quickly and easily recompile the module and swap out the .ko (with rebooting) as I have everything set up from the first recompilation. (In reply to Heiner Kallweit from comment #15) > With the following commit the log spam shouldn't occur any longer (even > though it's not clear how the chip can be in a PCI power-save state if it's > not runtime-suspended). However it will take some time until it gets applied > to stable. > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/ > ?id=10262b0b53666cbc506989b17a3ead1e9c3b43b4 No difference with this patch either. Interestingly, ethtool seems to show the WOL state fine: Settings for eth2: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised pause frame use: Symmetric Receive-only Advertised auto-negotiation: Yes Speed: Unknown! Duplex: Unknown! (255) Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: pumbg Wake-on: ug Current message level: 0x00000033 (51) drv probe ifdown ifup Link detected: no Also there seems to be no difference between ethtool output when it works normally (without rtl_counters_cond logged) and when it doesn't. The logs always start a certain time after booting, not immediately. And there is exactly one log every 600 seconds (something polling the NIC?). lspci -vvvnn: 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller [10ec:8136] (rev 01) Subsystem: Hewlett-Packard Company Device [103c:2a57] Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 25 Region 0: I/O ports at c800 [disabled] [size=256] Region 2: Memory at fe9ff000 (64-bit, non-prefetchable) [disabled] [size=4K] [virtual] Expansion ROM at fe9c0000 [disabled] [size=128K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data pcilib: sysfs_read_vpd: read failed: Input/output error Not readable Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [60] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <128ns, L1 unlimited ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [84] Vendor Specific Information: Len=4c <?> Kernel driver in use: r8169 The VPD read error is present at all three PCI-e cards in this system (the one in question is onboard, the other two are RTL8111/8168/8411 add-in cards) but the rtl_counters_cond happens only on this one with no link. One RTL8168 PCI card shows no such error. There are a total of 4 cards, 3 with link up and 1 down. Quite some things look weird here: (In reply to Jernej Jakob from comment #20) > lspci -vvvnn: > > 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. > RTL8101E/RTL8102E PCI Express Fast Ethernet controller [10ec:8136] (rev 01) > Subsystem: Hewlett-Packard Company Device [103c:2a57] > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 25 > Region 0: I/O ports at c800 [disabled] [size=256] > Region 2: Memory at fe9ff000 (64-bit, non-prefetchable) [disabled] > [size=4K] Region is disabled, this shouldn't be the case. > [virtual] Expansion ROM at fe9c0000 [disabled] [size=128K] > Capabilities: [40] Power Management version 2 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA > PME(D0-,D1+,D2+,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [48] Vital Product Data > pcilib: sysfs_read_vpd: read failed: Input/output error > Not readable The VPD error is normal. > Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [60] Express (v1) Endpoint, MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <128ns, > L1 unlimited > ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset- > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- > Unsupported- > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ > TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit > Latency L0s > unlimited, L1 unlimited > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ > DLActive- BWMgmt- > ABWMgmt- > Capabilities: [84] Vendor Specific Information: Len=4c <?> This also doesn't look good. Here it breaks. > Kernel driver in use: r8169 > > The VPD read error is present at all three PCI-e cards in this system (the > one in question is onboard, the other two are RTL8111/8168/8411 add-in > cards) but the rtl_counters_cond happens only on this one with no link. One > RTL8168 PCI card shows no such error. There are a total of 4 cards, 3 with > link up and 1 down. Was there ever a kernel version that didn't show these errors? IOW, is it a regression? Else support for this very old chip version may always have been broken. (In reply to Heiner Kallweit from comment #21) > Was there ever a kernel version that didn't show these errors? IOW, is it a > regression? Else support for this very old chip version may always have been > broken. I'm not sure, I think it was previously on 3.13, definitely no errors then. Now I tried plugging a cable in and the link LED came on, but the system didn't detect the link up change (ip link and ethtool both showed link down), then I rebooted it and the link came up, now it works normally so it was likely in some kind of weird powered down state. lspci now shows: 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller [10ec:8136] (rev 01) Subsystem: Hewlett-Packard Company Device [103c:2a57] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 25 Region 0: I/O ports at c800 [size=256] Region 2: Memory at fe9ff000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at fe9c0000 [disabled] [size=128K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME+ Capabilities: [48] Vital Product Data pcilib: sysfs_read_vpd: read failed: Input/output error Not readable Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+ Address: 00000000fee01004 Data: 4026 Capabilities: [60] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <128ns, L1 unlimited ExtTag+ AttnBtn+ AttnInd+ PwrInd+ RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [84] Vendor Specific Information: Len=4c <?> Kernel driver in use: r8169 Region 2 is no longer disabled. Also the RTL8111 and RTL8169 that are now unplugged show no errors. (In reply to Jernej Jakob from comment #22) > (In reply to Heiner Kallweit from comment #21) > > Was there ever a kernel version that didn't show these errors? IOW, is it a > > regression? Else support for this very old chip version may always have > been > > broken. > > I'm not sure, I think it was previously on 3.13, definitely no errors then. > > Now I tried plugging a cable in and the link LED came on, but the system > didn't detect the link up change (ip link and ethtool both showed link > down), then I rebooted it and the link came up, now it works normally so it > was likely in some kind of weird powered down state. Currently ther's an issue with wakeup if cable is plugged in. It's an issue in the PCIe subsystem, fix is waiting to be applied: https://patchwork.ozlabs.org/patch/1034385/ Neither of those 2 patches will apply on my 4.19 branch. Seems like those functions were added in later commits. Do I need to update and rebuild my whole kernel or can I somehow find which commits I need to backport? Right, maybe the change reverted by this patch was added after 4.19. When checking the vendor r8101 driver, I saw that they disable MSI for this chip version and use a legacy interrupt. But I don't know whether disabling MSI could help in your case. This old RTL8101e chip version seems to be somewhat quirky. Your log says subsystem HP, what kind of device is it, an old HP laptop? It's an old HP Livermore 945GCT-HM mobo that's used as a network router, old and obsolete but otherwise super reliable other than this bug. Yeah, the removed functions in https://patchwork.ozlabs.org/patch/1034385/ don't exist, but neither do the two in https://patchwork.ozlabs.org/patch/1034384/ that are the supposed replacement. I'm not sure if I should try searching for all the required commits to apply the second patch or if this is a totally unrelated issue. I experienced a variation of this issue now. I rename me r8169 ethernet adapter to primary-eth using udev rules. It's been working reliably, however now I booted up and primary-eth was there as usual, but it was not part of primary-bridge, as it is supposed to be. The kernel had somehow created an additional bridge called "rename3" and added the renamed r8169 adapter primary-eth to the "rename3" bridge. ================================== details Below I rename my primary ethernet adapter to primary-eth, catering for booting my hard drive in either of my 2 different computers =========================================================== /etc/udev/rules.d/70-mainnet-setup-link.rules #config for when I boot hard drive in computer 1 SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="aa:aa:aa:aa:aa:aa", NAME="primary-eth" #config for when I boot hard drive in computer 2 SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="bb:bb:bb:bb:bb:bb", NAME="primary-eth" =========================================================== I use "computer 1" 99% of the time and haven't used "computer 2" in weeks. Then primary-eth is part of a bridge: primary-bridge. However I booted up computer 1 now from a cold boot, and I found primary-eth was part of a BRIDGE called rename3. I deleted the rename3 bridge. `brctl delbr rename3` Then I saw primary-eth automatically got added to primary-bridge. Then I just had to run `ip link set primary-bridge up state up` and all was well. Is there a way I can prevent this rename3 BRIDGE from appearing? primary-eth on "Computer 1" is: 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11) I found these bugs, in which an ETHERNET ADAPTER gets renamed to rename3 https://bugzilla.kernel.org/show_bug.cgi?id=107421 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1578141 But in my case it's not an ethernet adapter that's getting renamed. It's a BRIDGE called rename3 being CREATED. Any ideas? (In reply to lopeonline+kernelbugzilla from comment #27) > I experienced a variation of this issue now. > I rename me r8169 ethernet adapter to primary-eth using udev rules. It's > been working reliably, however now I booted up and primary-eth was there as > usual, but it was not part of primary-bridge, as it is supposed to be. > > The kernel had somehow created an additional bridge called "rename3" and > added the renamed r8169 adapter primary-eth to the "rename3" bridge. > This is a totally different question und not related to this issue at all. My issue was likely caused by a race condition where udev probably tried to rename my eth adapter and br0 (because it takes on the MAC of the eth) to primary-bridge. So I've cleaned up my udev rules and hopefully it won't happen again. @Heiner Yes, you're right. It's a different issue. Mods feel free to delete my 2 posts in this particular bug. The reason I assumed my issue is related is because of some vague similarities * My eth uses the same kernel driver as in this bug * I've had issues with this eth driver for suspend/resume * An unwanted bridge got created for me, named rename3, in this bug the eth is named rename3. Different but vaguely similar. Please accept my apology and remove my 2 posts here if possible. My issue was likely caused by a race condition where udev probably tried to rename my eth adapter and br0 (because it takes on the MAC of the eth) to primary-bridge. So I've cleaned up my udev rules and hopefully it won't happen again. *lspci|grep -i eth *02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) *uname -a *Linux ghost 6.2.0-rc3 #1 SMP PREEMPT_DYNAMIC Tue Jan 10 13:21:10 CET 2023 x86_64 GNU/Linux *dmesg -T *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). *[Tue Jan 17 09:02:35 2023] r8169 0000:02:00.0 enp2s0: rtl_mac_ocp_e00e_cond == 1 (loop: 10, delay: 1000). .... etc Dear Mantainers, I am having an issue with my ethernet card. It works when the system boots but after around a couple of hours it disconnects. I tried different ways to get it working without having to reboot but nothing else seemed to work. Even rebooting doesn't solve the problem since again, after a couple of hours, it stops working again. I have googled around and found that some people had this same problem on older kernels but no solution seemed to apply to this rc nor latest stable kernel versions. I am probably missing something here. The issue happened also with recent stable 6.1.7 and rc kernel versions. I am actually testing the latest 6.2-rc4 version. Following are some data I think might be useful but if you feel I neglected to give enough informations and you need more please just ask me. Here some informations about my system : uname -a Linux ghost 6.2.0-rc4 #2 SMP PREEMPT_DYNAMIC Tue Jan 17 13:35:46 CET 2023 x86_64 GNU/Linux gcc --version gcc (Debian 12.2.0-14) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. /usr/src# lspci|grep -i net 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) description: Ethernet interface product: RTL8125 2.5GbE Controller vendor: Realtek Semiconductor Co., Ltd. physical id: 0 bus info: pci@0000:02:00.0 logical name: enp2s0 version: ff serial: b0:25:aa:49:a5:3a size: 1Gbit/s capacity: 1Gbit/s width: 32 bits clock: 66MHz capabilities: bus_master vga_palette cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=6.2.0-rc4 duplex=full firmware=rtl8125b-2_0.0.2 07/13/20 latency=255 link=yes maxlatency=255 mingnt=255 multicast=yes port=twisted pair speed=1Gbit/s lsmod|grep r8169 r8169 110592 0 mdio_devres 16384 1 r8169 libphy 200704 3 r8169,mdio_devres,realtek the firmware version I am using is linux-firmware-20221214.tar.gz Here you can find what happens (dmesg -wT) [Fri Jan 20 11:04:32 2023] userif-3: sent link up event. [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:19:41 2023] r8169 0000:02:00.0 enp2s0: rtl_mac_ocp_e00e_cond == 1 (loop: 10, delay: 1000). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:20:17 2023] r8169 0000:02:00.0 enp2s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10). [Fri Jan 20 13:20:18 2023] r8169 0000:02:00.0 enp2s0: rtl_mac_ocp_e00e_cond == 1 (loop: 10, delay: 1000). I would love to provide a patch of any kind but I am afraid I don't have enough programming skills. Thanks in advance for your time. There have been reports of issues on the Realtek chip on some platforms due to buggy implementation of power saving. Look at netdev mailing list history for many threads on r8169. (In reply to Renato Gallo from comment #31) > Dear Mantainers, > > I am having an issue with my ethernet card. You received an answer to the same question on the netdev mailing list already. https://lore.kernel.org/netdev/37b1001d-688c-fa35-0d8a-cbbbae5e6fa8@gmail.com/T/ Why duplicate the communication? There is an issue with power saving for r8169 as Stephen Hemminger mentioned before. A quick fix can be to disable the power saving mode from "/etc/default/grub" by adding "igb.EEE=0" to the variable "GRUB_CMDLINE_LINUX_DEFAULT" as follows: ... GRUB_CMDLINE_LINUX_DEFAULT="quiet splash igb.EEE=0" ... (In reply to https://github.com/emidev98 from comment #34) > There is an issue with power saving for r8169 as Stephen Hemminger mentioned > before. > > A quick fix can be to disable the power saving mode from "/etc/default/grub" > by adding "igb.EEE=0" to the variable "GRUB_CMDLINE_LINUX_DEFAULT" as > follows: > > ... > GRUB_CMDLINE_LINUX_DEFAULT="quiet splash igb.EEE=0" > ... This doesn't make sense. As the name states the igb.EEE parameter is for the igb driver. If this helps for you then the issue is not with r8169. (In reply to eharastasan from comment #34) > There is an issue with power saving for r8169 as Stephen Hemminger mentioned > before. > > A quick fix can be to disable the power saving mode from "/etc/default/grub" > by adding "igb.EEE=0" to the variable "GRUB_CMDLINE_LINUX_DEFAULT" as > follows: > > ... > GRUB_CMDLINE_LINUX_DEFAULT="quiet splash igb.EEE=0" > ... I had the same issue (network driver crashes after a while) with kernel 5.x and 6.x on debian 11 with proxmox. But I could resolve the problem disable ASPM: Editing /etc/default/grub with: GRUB_CMDLINE_LINUX="pcie_aspm=off pcie_port_pm=off" grub-update and reboot. After this it works fine for me. That's something you can do also during runtime w/o disabling ASPM for all devices. Just use the standard sysfs attributes under /sys/class/net/<if>/device/link. (In reply to Heiner Kallweit from comment #37) > That's something you can do also during runtime w/o disabling ASPM for all > devices. Just use the standard sysfs attributes under > /sys/class/net/<if>/device/link. Could you walk a total beginner through how to do this? (In reply to arbitraryadirc from comment #38) > (In reply to Heiner Kallweit from comment #37) > > That's something you can do also during runtime w/o disabling ASPM for all > > devices. Just use the standard sysfs attributes under > > /sys/class/net/<if>/device/link. > > Could you walk a total beginner through how to do this? See documentation: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/ABI/testing/sysfs-bus-pci?h=next-20230725 Start with disabling the deepest power saving state: L1.2 Hello together, I stumpled over the same problem I guess - one of the two realtek 8169 nics of my system often go down after some time and the only possible fix is a reboot. It is not allways the same of the two nics. This happens especially after upgrading from proxmox 7 to proxmox 8 (which means upgrade from debian bullseye to debian bookworm) before that I did not really recognized that behavior. It might be there before but definitely occurs a lot less often than now (at least daily/every second day). Especially on a little longer and higher (compared to idle) network load (~ 9-10 MByte/s but far far away from max theoretical or practical linkspeed ~ 115 MByte/s) @Heiner Kallweit regarding "Start with disabling the deepest power saving state: L1.2" I wanted to try that but the corresponding pathes e.g. /sys/bus/pci/devices/0000\:00\:1d.0/link/ are empty - so I currently do not know whether I can disable that? I read it might be not possible because of the bios settings but when I look into lscpi -vvv it seems to be available?! 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 19 Region 0: I/O ports at 3000 [size=256] Region 2: Memory at 9c204000 (64-bit, non-prefetchable) [size=4K] Region 4: Memory at 9c200000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Via message/WAKE#, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [b0] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number redacted serial no Capabilities: [170 v1] Latency Tolerance Reporting Max snoop latency: 3145728ns Max no snoop latency: 3145728ns Capabilities: [178 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=150us PortTPowerOnTime=150us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=81920ns L1SubCtl2: T_PwrOn=150us Kernel driver in use: r8169 Kernel modules: r8169 I experimented a bit with echo "performance" > /sys/module/pcie_aspm/parameters/policy which seems to improve the situation a bit (but also with negative measurable power consumption effects), but I'm not yet sure whether this really solves it. dmesg | grep r8169: [ 1.398724] r8169 0000:01:00.0 eth0: RTL8168h/8111h, redacted mac, XID 541, IRQ 127 [ 1.398728] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko] [ 1.418757] r8169 0000:02:00.0 eth1: RTL8168h/8111h, redacted mac, XID 541, IRQ 128 [ 1.418761] r8169 0000:02:00.0 eth1: jumbo features [frames: 9194 bytes, tx checksumming: ko] [ 1.490168] r8169 0000:02:00.0 enp2s0: renamed from eth1 [ 1.773791] r8169 0000:01:00.0 enp1s0: renamed from eth0 [ 5.497016] Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver (mii_bus:phy_addr=r8169-0-100:00, irq=MAC) [ 5.701229] r8169 0000:01:00.0 enp1s0: Link is Down [ 5.761156] Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC) [ 5.937429] r8169 0000:02:00.0 enp2s0: Link is Down [ 8.774445] r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control off [ 9.358071] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off lspci -t -v (stripped down to realtek + parent) -[0000:00]-+-00.0 Intel Corporation Comet Lake-U v1 4c Host Bridge/DRAM Controller +-1d.0-[01]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller +-1d.3-[02]----00.0 Realtek Semiconductor Co., Ltd. Actually for power consumption reasons I did not test the suggestion: GRUB_CMDLINE_LINUX="pcie_aspm=off pcie_port_pm=off" as of now. Is there something that has changed between kernel 5.15 (which has been used in proxmox 7) and kernel 6.2 in proxmox 8 which could explain it? Any other ideas what could fix the situation other than going back to older kernel and/or completely disable the aspm with above kernel cmdline? I also stumpled over the following dmesg line: [ 0.403207] MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details. Could this influence the r8169 driver problem here? Please let me know whether I could help you with providing some more information which helps finding/resolving these mysterious issue. (In reply to Kai Hillmann from comment #40) > Hello together, > > I stumpled over the same problem I guess - one of the two realtek 8169 nics > of my system often go down after some time and the only possible fix is a > reboot. It is not allways the same of the two nics. This happens especially > after upgrading from proxmox 7 to proxmox 8 (which means upgrade from debian > bullseye to debian bookworm) before that I did not really recognized that > behavior. It might be there before but definitely occurs a lot less often > than now (at least daily/every second day). Especially on a little longer > and higher (compared to idle) network load (~ 9-10 MByte/s but far far away > from max theoretical or practical linkspeed ~ 115 MByte/s) > > @Heiner Kallweit regarding > > "Start with disabling the deepest power saving state: L1.2" > > I wanted to try that but the corresponding pathes e.g. > /sys/bus/pci/devices/0000\:00\:1d.0/link/ are empty - so I currently do not > know whether I can disable that? I read it might be not possible because of > the bios settings but when I look into lscpi -vvv it seems to be available?! > > 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) > Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI > Express > Gigabit Ethernet Controller > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 19 > Region 0: I/O ports at 3000 [size=256] > Region 2: Memory at 9c204000 (64-bit, non-prefetchable) [size=4K] > Region 4: Memory at 9c200000 (64-bit, non-prefetchable) [size=16K] > Capabilities: [40] Power Management version 3 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA > PME(D0+,D1+,D2+,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [70] Express (v2) Endpoint, MSI 01 > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, > L1 <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > SlotPowerLimit 10W > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- > MaxPayload 128 bytes, MaxReadReq 4096 bytes > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ > TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit > Latency L0s > unlimited, L1 <64us > ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x1 > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- > LTR+ > 10BitTagComp- 10BitTagReq- OBFF Via message/WAKE#, > ExtFmt- EETLPPrefix- > EmergencyPowerReduction Not Supported, > EmergencyPowerReductionInit- > FRS- TPHComp- ExtTPHComp- > AtomicOpsCap: 32bit- 64bit- 128bitCAS- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ > 10BitTagReq- > OBFF Disabled, > AtomicOpsCtl: ReqEn- > LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- > 2Retimers- > DRS- > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, > EnterModifiedCompliance- > ComplianceSOS- > Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB > preshoot > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- > EqualizationPhase1- > EqualizationPhase2- EqualizationPhase3- > LinkEqualizationRequest- > Retimer- 2Retimers- CrosslinkRes: unsupported > Capabilities: [b0] MSI-X: Enable+ Count=4 Masked- > Vector table: BAR=4 offset=00000000 > PBA: BAR=4 offset=00000800 > Capabilities: [100 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- > ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- > ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- > RxOF+ MalfTLP+ > ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- > ECRCChkCap+ > ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 00000000 00000000 00000000 00000000 > Capabilities: [140 v1] Virtual Channel > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > Arb: Fixed- WRR32- WRR64- WRR128- > Ctrl: ArbSelect=Fixed > Status: InProgress- > VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- > Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff > Status: NegoPending- InProgress- > Capabilities: [160 v1] Device Serial Number redacted serial no > Capabilities: [170 v1] Latency Tolerance Reporting > Max snoop latency: 3145728ns > Max no snoop latency: 3145728ns > Capabilities: [178 v1] L1 PM Substates > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ > L1_PM_Substates+ > PortCommonModeRestoreTime=150us > PortTPowerOnTime=150us > L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- > T_CommonMode=0us LTR1.2_Threshold=81920ns > L1SubCtl2: T_PwrOn=150us > Kernel driver in use: r8169 > Kernel modules: r8169 > > > I experimented a bit with > > echo "performance" > /sys/module/pcie_aspm/parameters/policy > > which seems to improve the situation a bit (but also with negative > measurable power consumption effects), but I'm not yet sure whether this > really solves it. > > dmesg | grep r8169: > > [ 1.398724] r8169 0000:01:00.0 eth0: RTL8168h/8111h, redacted mac, XID > 541, IRQ 127 > [ 1.398728] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, > tx checksumming: ko] > [ 1.418757] r8169 0000:02:00.0 eth1: RTL8168h/8111h, redacted mac, XID > 541, IRQ 128 > [ 1.418761] r8169 0000:02:00.0 eth1: jumbo features [frames: 9194 bytes, > tx checksumming: ko] > [ 1.490168] r8169 0000:02:00.0 enp2s0: renamed from eth1 > [ 1.773791] r8169 0000:01:00.0 enp1s0: renamed from eth0 > [ 5.497016] Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver > (mii_bus:phy_addr=r8169-0-100:00, irq=MAC) > [ 5.701229] r8169 0000:01:00.0 enp1s0: Link is Down > [ 5.761156] Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver > (mii_bus:phy_addr=r8169-0-200:00, irq=MAC) > [ 5.937429] r8169 0000:02:00.0 enp2s0: Link is Down > [ 8.774445] r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow > control off > [ 9.358071] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow > control off > > lspci -t -v (stripped down to realtek + parent) > -[0000:00]-+-00.0 Intel Corporation Comet Lake-U v1 4c Host Bridge/DRAM > Controller > +-1d.0-[01]----00.0 Realtek Semiconductor Co., Ltd. > RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller > +-1d.3-[02]----00.0 Realtek Semiconductor Co., Ltd. > > Actually for power consumption reasons I did not test the suggestion: > > GRUB_CMDLINE_LINUX="pcie_aspm=off pcie_port_pm=off" > > as of now. > > Is there something that has changed between kernel 5.15 (which has been used > in proxmox 7) and kernel 6.2 in proxmox 8 which could explain it? > > Any other ideas what could fix the situation other than going back to older > kernel and/or completely disable the aspm with above kernel cmdline? > > I also stumpled over the following dmesg line: > > [ 0.403207] MMIO Stale Data CPU bug present and SMT on, data leak > possible. See > https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/ > processor_mmio_stale_data.html for more details. > > Could this influence the r8169 driver problem here? > > Please let me know whether I could help you with providing some more > information which helps finding/resolving these mysterious issue. I also found a kernel trace before the typical 2023-08-01T23:10:18.072470+02:00 pve kernel: [199380.810315] r8169 0000:02:00.0 enp2s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100). line starts in my logs: 2023-08-01T23:10:18.036574+02:00 pve kernel: [199380.778029] ------------[ cut here ]------------ 2023-08-01T23:10:18.036609+02:00 pve kernel: [199380.778046] NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out 2023-08-01T23:10:18.036614+02:00 pve kernel: [199380.778082] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250 2023-08-01T23:10:18.036940+02:00 pve kernel: [199380.778103] Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables scsi_transport_iscsi msr iptable_filter bpfilter softdog bonding tls sunrpc nfnetlink_log nfnetlink binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match intel_rapl_msr snd_soc_acpi intel_rapl_common intel_tcc_cooling soundwire_bus iwlmvm x86_pkg_temp_thermal intel_powerclamp coretemp snd_soc_core snd_compress mac80211 kvm_intel ac97_bus i915 snd_pcm_dmaengine libarc4 kvm drm_buddy snd_hda_intel ttm irqbypass snd_intel_dspcfg snd_intel_sdw_acpi drm_display_helper crct10dif_pclmul snd_hda_codec polyval_clmulni polyval_generic ghash_clmulni_intel cec sha512_ssse3 snd_hda_core btusb 2023-08-01T23:10:18.037392+02:00 pve kernel: [199380.778501] rc_core snd_hwdep aesni_intel btrtl mei_pxp mei_hdcp btbcm snd_pcm crypto_simd drm_kms_helper btintel cryptd cmdlinepart btmtk iwlwifi i2c_algo_bit spi_nor rapl mei_me snd_timer syscopyarea sysfillrect snd bluetooth ecdh_generic intel_cstate soundcore sysimgblt wmi_bmof pcspkr ecc mtd mei ee1004 cfg80211 intel_pch_thermal acpi_pad acpi_tad mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore drm dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb ums_realtek dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c uas usb_storage spi_intel_pci crc32_pclmul xhci_pci spi_intel xhci_pci_renesas i2c_i801 i2c_smbus ahci r8169 realtek xhci_hcd libahci video wmi pinctrl_cannonlake 2023-08-01T23:10:18.037408+02:00 pve kernel: [199380.778968] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 6.2.16-5-pve #1 2023-08-01T23:10:18.037410+02:00 pve kernel: [199380.778978] Hardware name: ZOTAC ZBOX-CI622/CI642/CI662NANO/ZBOX-CI622/CI642/CI662NANO, BIOS B418P108 07/08/2021 2023-08-01T23:10:18.037429+02:00 pve kernel: [199380.778984] RIP: 0010:dev_watchdog+0x23a/0x250 2023-08-01T23:10:18.037431+02:00 pve kernel: [199380.779002] Code: 00 e9 2b ff ff ff 48 89 df c6 05 8b 6c 7d 01 01 e8 6b 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 98 65 80 a0 48 89 c2 e8 86 a6 30 ff <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 2023-08-01T23:10:18.037452+02:00 pve kernel: [199380.779011] RSP: 0018:ffffacafc0003e38 EFLAGS: 00010246 2023-08-01T23:10:18.037453+02:00 pve kernel: [199380.779025] RAX: 0000000000000000 RBX: ffff9f3f9261c000 RCX: 0000000000000000 2023-08-01T23:10:18.037459+02:00 pve kernel: [199380.779032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 2023-08-01T23:10:18.037465+02:00 pve kernel: [199380.779038] RBP: ffffacafc0003e68 R08: 0000000000000000 R09: 0000000000000000 2023-08-01T23:10:18.037492+02:00 pve kernel: [199380.779045] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f3f9261c4c8 2023-08-01T23:10:18.037495+02:00 pve kernel: [199380.779052] R13: ffff9f3f9261c41c R14: 0000000000000000 R15: 0000000000000000 2023-08-01T23:10:18.037496+02:00 pve kernel: [199380.779058] FS: 0000000000000000(0000) GS:ffff9f4ea2e00000(0000) knlGS:0000000000000000 2023-08-01T23:10:18.037498+02:00 pve kernel: [199380.779065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2023-08-01T23:10:18.037505+02:00 pve kernel: [199380.779073] CR2: 00007fe339a08ed7 CR3: 0000000d20610002 CR4: 00000000003726f0 2023-08-01T23:10:18.037506+02:00 pve kernel: [199380.779080] Call Trace: 2023-08-01T23:10:18.037512+02:00 pve kernel: [199380.779086] <IRQ> 2023-08-01T23:10:18.037537+02:00 pve kernel: [199380.779096] ? __pfx_dev_watchdog+0x10/0x10 2023-08-01T23:10:18.037543+02:00 pve kernel: [199380.779112] call_timer_fn+0x29/0x160 2023-08-01T23:10:18.037569+02:00 pve kernel: [199380.779125] ? __pfx_dev_watchdog+0x10/0x10 2023-08-01T23:10:18.037584+02:00 pve kernel: [199380.779138] __run_timers+0x259/0x310 2023-08-01T23:10:18.037591+02:00 pve kernel: [199380.779152] run_timer_softirq+0x1d/0x40 2023-08-01T23:10:18.037594+02:00 pve kernel: [199380.779161] __do_softirq+0xd6/0x346 2023-08-01T23:10:18.037609+02:00 pve kernel: [199380.779170] ? hrtimer_interrupt+0x11f/0x250 2023-08-01T23:10:18.037630+02:00 pve kernel: [199380.779187] __irq_exit_rcu+0xa2/0xd0 2023-08-01T23:10:18.037634+02:00 pve kernel: [199380.779198] irq_exit_rcu+0xe/0x20 2023-08-01T23:10:18.037641+02:00 pve kernel: [199380.779207] sysvec_apic_timer_interrupt+0x92/0xd0 2023-08-01T23:10:18.037650+02:00 pve kernel: [199380.779220] </IRQ> 2023-08-01T23:10:18.037651+02:00 pve kernel: [199380.779226] <TASK> 2023-08-01T23:10:18.037671+02:00 pve kernel: [199380.779233] asm_sysvec_apic_timer_interrupt+0x1b/0x20 2023-08-01T23:10:18.037679+02:00 pve kernel: [199380.779244] RIP: 0010:cpuidle_enter_state+0xde/0x6f0 2023-08-01T23:10:18.037686+02:00 pve kernel: [199380.779256] Code: 27 57 60 e8 d4 79 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 02 82 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00 2023-08-01T23:10:18.037692+02:00 pve kernel: [199380.779264] RSP: 0018:ffffffffa1003da8 EFLAGS: 00000246 2023-08-01T23:10:18.037709+02:00 pve kernel: [199380.779275] RAX: 0000000000000000 RBX: ffffccafbfc00000 RCX: 0000000000000000 2023-08-01T23:10:18.037711+02:00 pve kernel: [199380.779283] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 2023-08-01T23:10:18.037719+02:00 pve kernel: [199380.779288] RBP: ffffffffa1003df8 R08: 0000000000000000 R09: 0000000000000000 2023-08-01T23:10:18.037720+02:00 pve kernel: [199380.779294] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa12c33e0 2023-08-01T23:10:18.037725+02:00 pve kernel: [199380.779300] R13: 0000000000000008 R14: 0000000000000008 R15: 0000b555f472d0f6 2023-08-01T23:10:18.037748+02:00 pve kernel: [199380.779312] ? cpuidle_enter_state+0xce/0x6f0 2023-08-01T23:10:18.037750+02:00 pve kernel: [199380.779321] cpuidle_enter+0x2e/0x50 2023-08-01T23:10:18.037768+02:00 pve kernel: [199380.779329] do_idle+0x216/0x2a0 2023-08-01T23:10:18.037781+02:00 pve kernel: [199380.779341] cpu_startup_entry+0x1d/0x20 2023-08-01T23:10:18.037790+02:00 pve kernel: [199380.779350] rest_init+0xdc/0x100 2023-08-01T23:10:18.037792+02:00 pve kernel: [199380.779361] ? acpi_enable_subsystem+0xe6/0x2a0 2023-08-01T23:10:18.037810+02:00 pve kernel: [199380.779372] arch_call_rest_init+0xe/0x30 2023-08-01T23:10:18.037817+02:00 pve kernel: [199380.779385] start_kernel+0x6b0/0xb80 2023-08-01T23:10:18.037824+02:00 pve kernel: [199380.779395] ? load_ucode_intel_bsp+0x3d/0x80 2023-08-01T23:10:18.037845+02:00 pve kernel: [199380.779408] x86_64_start_kernel+0x102/0x180 2023-08-01T23:10:18.037863+02:00 pve kernel: [199380.779419] secondary_startup_64_no_verify+0xe5/0xeb 2023-08-01T23:10:18.037877+02:00 pve kernel: [199380.779435] </TASK> 2023-08-01T23:10:18.037879+02:00 pve kernel: [199380.779440] ---[ end trace 0000000000000000 ]--- I hope this helps to find the cause of this behavior. Please note that vendor kernels aren't supported here. Always test with a (best self-compiled) mainline kernel. Your report doesn't even mentioned the affected kernel version. If the ASPM sysfs attributes aren't visible, most likely reason is that BIOS claims exclusive access to ASPM settings. You can override this with pcie_aspm=force. However, according to the lspci output, ASPM is disabled anyway. If the issue doesn't occur with a previous kernel version, please bisect. (In reply to Heiner Kallweit from comment #42) > Please note that vendor kernels aren't supported here. Always test with a > (best self-compiled) mainline kernel. Your report doesn't even mentioned the > affected kernel version. > > If the ASPM sysfs attributes aren't visible, most likely reason is that BIOS > claims exclusive access to ASPM settings. You can override this with > pcie_aspm=force. However, according to the lspci output, ASPM is disabled > anyway. > > If the issue doesn't occur with a previous kernel version, please bisect. Full ack - you're right - sorry for pasting it here - I should first had a look directly within proxmox - just for reference the thread regarding this problem is here: https://forum.proxmox.com/threads/system-hanging-after-upgrade-nic-driver.129366 I'll try to test it against a mainline kernel instead of my current problematic proxmox one: proxmox-kernel-6.2.16-6-pve I'm not very familar yet with bisecting that but I'll have a look whether I can trace the behaviour change down to a change between pve-kernel-5.15.108-1-pve and the current proxmox-kernel-6.2.16-6-pve. Dear all, I found it might realtek driver entering into power saving mode. My ubuntu version is 22.04 and kernel 6.2 My solution is download driver from realtek web: https://www.realtek.com/zh-tw/component/zoo/category/pci-8169-8110 https://www.realtek.com/zh-tw/directly-download?downloadid=28907f3d6ddbf32a2041c1ceadcace96 Please follow the "readme", soft remind it's only suitable for kernel 5. If you're using kernel 6, please modify netif_napi_add to netif_napi_add_weight, for detail please reference to https://lore.kernel.org/netdev/20221002175650.1491124-1-kuba@kernel.org/t/ Thank you Best Regards, Larry from Arcadyan |