Bug 219388 - RTL8168fp PCI error after resuming from suspend
Summary: RTL8168fp PCI error after resuming from suspend
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-15 03:24 UTC by Pengyu Ma
Modified: 2024-10-19 04:20 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg when r8169 failed. (534.32 KB, text/plain)
2024-10-15 03:24 UTC, Pengyu Ma
Details

Description Pengyu Ma 2024-10-15 03:24:57 UTC
Created attachment 307007 [details]
dmesg when r8169 failed.

After resuming, r8169 shows:
kernel: r8169 0000:02:00.1 enp2s0f1: PCI error (cmd = 0x0407, status_errs = 0x0000)

Hardware: The HP ZBook Power 15.6 inch G10

Ethernet:
kernel: r8169 0000:02:00.1 eth0: RTL8168fp/RTL8117, 7c:4d:8f:a4:02:31, XID 54b, IRQ 71
kernel: r8169 0000:02:00.1 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
kernel: r8169 0000:02:00.1 eth0: DASH disabled

The issue is only reproduced on 1 of 4 HP G10.
Failrate 1/1.

The first regression patch:
5e864d90b2080 r8169: skip DASH fw status checks when DASH is disabled

Before this patch, 4 HP G10 works fine.
After this patch, only 1 HP G10 failed.
Comment 1 Heiner Kallweit 2024-10-16 13:08:10 UTC
Apart from the error message, is there any actual issue with connectivity?
I think the error message is a red herring, because in status_errs not a single error bit is set.
Comment 2 Pengyu Ma 2024-10-16 13:33:51 UTC
After the error, the ethernet can not link up.
Comment 3 Heiner Kallweit 2024-10-16 18:54:07 UTC
Thanks. Please note that 6.8 isn't an LTS kernel and therefore won't receive fixes. Best re-test with latest mainline kernel 6.11.

If the issue persists, please re-test with the following applied:

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 1a4d834c2..322a1e930 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4805,7 +4805,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 	struct rtl8169_private *tp = dev_instance;
 	u32 status = rtl_get_events(tp);
 
-	if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
+	if ((status & 0xffff) == 0xffff)
+		return IRQ_NONE;
+
+	status &= tp->irq_mask | RxFIFOOver;
+	if (!status)
 		return IRQ_NONE;
 
 	if (unlikely(status & SYSErr)) {
--
Comment 4 Pengyu Ma 2024-10-17 04:18:30 UTC
Thanks you very much,  Kallweit.

Your change do fix the issue.

It's reproduced on 6.11 upsteam kernel.

Do you have plan to make the change upsteam?
Comment 5 Atlas Yu 2024-10-17 04:40:36 UTC
Thanks, this patch works out great.
https://github.com/torvalds/linux/compare/master...pseudocc:linux:r8169
Comment 6 Heiner Kallweit 2024-10-17 05:32:00 UTC
(In reply to Pengyu Ma from comment #4)
> Thanks you very much,  Kallweit.
> 
> Your change do fix the issue.
> 
> It's reproduced on 6.11 upsteam kernel.
> 
> Do you have plan to make the change upsteam?

Yes, this change will be applied as a fix in mainline. Means it will also be rolled out to LTS kernel versions.
Comment 7 Pengyu Ma 2024-10-17 07:23:43 UTC
@Kallweit

Thank you very much.
Comment 8 Heiner Kallweit 2024-10-17 21:25:46 UTC
I found that the proposed patch causes issues under heavy load. Reason isn't clear yet. Could you please test whether the following fixes the issue for you as well?
This change should have no side effects.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 4166f1ab8..56ff450ad 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4749,7 +4749,8 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 	if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
 		return IRQ_NONE;
 
-	if (unlikely(status & SYSErr)) {
+	if (unlikely(status & SYSErr &&
+	    tp->mac_version <= RTL_GIGA_MAC_VER_06)) {
 		rtl8169_pcierr_interrupt(tp->dev);
 		goto out;
 	}
-- 
2.47.0
Comment 9 Atlas Yu 2024-10-18 06:16:17 UTC
(In reply to Heiner Kallweit from comment #8)
> I found that the proposed patch causes issues under heavy load. Reason isn't
> clear yet. Could you please test whether the following fixes the issue for
> you as well?
> This change should have no side effects.
> 
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c
> b/drivers/net/ethernet/realtek/r8169_main.c
> index 4166f1ab8..56ff450ad 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4749,7 +4749,8 @@ static irqreturn_t rtl8169_interrupt(int irq, void
> *dev_instance)
>       if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>               return IRQ_NONE;
>  
> -     if (unlikely(status & SYSErr)) {
> +     if (unlikely(status & SYSErr &&
> +         tp->mac_version <= RTL_GIGA_MAC_VER_06)) {
>               rtl8169_pcierr_interrupt(tp->dev);
>               goto out;
>       }
> -- 
> 2.47.0

Much appreciate it, this patch also works.
Comment 10 Pengyu Ma 2024-10-19 04:20:16 UTC
Thanks, Kallweit,

The patch fixes the issue.

Note You need to log in before you can comment on or make changes to this bug.