Bug 202945

Summary: Realtek rt8169/rt81xx performance issue in kernel 4.19
Product: Drivers Reporter: Matt Devo (matt.devillier)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: NEW ---    
Severity: high CC: brunomanuelsantos, hkallweit1, matt.devillier, mwg1812, rm+bko, tseewald
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.19-RC1+ Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg, 4.19-rc1, normal boot
dmesg, 4.19-rc1, with pci=nomsi
dmesg, 4.18.20, normal boot
output of 'sudo lspci -vv -s 1:00.0'
output of 'sudo lspci -vv -s 1:00.0' for BDW

Description Matt Devo 2019-03-15 23:27:07 UTC
Created attachment 281857 [details]
dmesg, 4.19-rc1, normal boot

kernel 4.19 brought several changes to the rt8169 driver, most notably the enabling of ASPM, which had previously been disabled. 
Potentially Relevant commits:
671646c151d492c3846e6e6797e72ff757b5d65e - r8169: Don't disable ASPM in the driver
and 
a99790bf5c7f3d68d8b01e015d3212a98ee7bd57 - r8169: Reinstate ASPM Support

On certain platforms/rt81xx variants, this results in Ethernet performance being throttled to ~1MBps, and the kernel log flooded with AER errors.  Disabling AER via 'pci=noaer' and/or disabling ASPM via 'pcie_aspm=off' eliminates the noise, but not the issue. The use of 'pci=nomsi' mostly mitigates the performance issue, bringing it back close to 4.18.x.

The most significant impact is seen on Haswell/Broadwell platform devices; Skylake/Kabylake devices are also affected, but the performance impact is much less severe (~20% reduction vs 95%). An example of this is a large network file copy over a gigabit Ethernet link from a SMB share on a NAS:

- 4.18.20: ~52MB/s
- 4.19-RC1: ~600KB/s (with very slow directory navigation)
- 4.19-RC1 w/pci=nomsi: ~35MB/s
Comment 1 Matt Devo 2019-03-15 23:27:40 UTC
Created attachment 281859 [details]
dmesg, 4.19-rc1, with pci=nomsi
Comment 2 Matt Devo 2019-03-15 23:28:05 UTC
Created attachment 281861 [details]
dmesg, 4.18.20, normal boot
Comment 3 Heiner Kallweit 2019-03-16 15:00:53 UTC
Could you please provide the lspci -vv output for the network chip?
And how is it if you set pcie_aspm.policy=performance?
Comment 4 Matt Devo 2019-03-16 15:53:34 UTC
lspci output:

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
	Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at 2000 [size=256]
	Region 2: Memory at e0500000 (64-bit, non-prefetchable) [size=4K]
	Region 4: Memory at e0400000 (64-bit, prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
	Kernel modules: r8169


using 'pcie_aspm.policy=performance' resolves the issue and provides better performance than 'pci=nomsi' (~50MB/s vs ~35MB/s) for SMB file copy, no errors in dmesg.
Comment 5 Heiner Kallweit 2019-03-16 15:57:06 UTC
Issue sounds somewhat familiar, maybe it's again about ASPM L1 sub-states. Therefore it would be helpful to see the more detailed lspci output (sudo lspci -vv).
Comment 6 Matt Devo 2019-03-16 16:08:03 UTC
(In reply to Heiner Kallweit from comment #5)
> Issue sounds somewhat familiar, maybe it's again about ASPM L1 sub-states.
> Therefore it would be helpful to see the more detailed lspci output (sudo
> lspci -vv).

whoops, forgot the sudo - attachment added. 

ASPM L1 substate is enabled, but toggling it makes no difference. Device is running my own custom coreboot firmware (I'm a coreboot dev), but other non-coreboot devices are showing the same issue.  I'm happy to try any adjustments at the firmware level; so far the only one I've found that works is to disable ASPM.
Comment 7 Matt Devo 2019-03-16 16:08:40 UTC
Created attachment 281863 [details]
output of 'sudo lspci -vv -s 1:00.0'
Comment 8 Matt Devo 2019-03-16 16:29:33 UTC
also adding output for a similarly affected Broadwell device
Comment 9 Matt Devo 2019-03-16 16:30:08 UTC
Created attachment 281865 [details]
output of 'sudo lspci -vv -s 1:00.0' for BDW
Comment 10 Heiner Kallweit 2019-03-16 16:31:34 UTC
Thanks. But two things are strange:

- Devices uses MSI instead of MSI-X. Usually it selects the best option, means MSI-X.
- ASPM is disabled (LnkCtl:	ASPM Disabled)

Regarding the L1 sub-states see also here (comment 65):
https://bugzilla.redhat.com/show_bug.cgi?id=1671958

In comment 69 you find an experimental patch you could try.
Comment 11 Matt Devo 2019-03-17 19:43:19 UTC
(In reply to Heiner Kallweit from comment #10)
> Thanks. But two things are strange:
> 
> - Devices uses MSI instead of MSI-X. Usually it selects the best option,
> means MSI-X.

not sure what would cause this

> - ASPM is disabled (LnkCtl:   ASPM Disabled)

I'll look into that. The Broadwell box has it properly enabled (see previously attached lspci output) and still has the same issue, so I'm not sure it's a factor here (though something that should be fixed regardless)

> 
> Regarding the L1 sub-states see also here (comment 65):
> https://bugzilla.redhat.com/show_bug.cgi?id=1671958
> 
> In comment 69 you find an experimental patch you could try.

I built a 4.19.28 kernel with the patch in #69, and there was no change in performance (ie, still <1MB/s xfer rate, log flooded with AER errors)
Comment 12 Heiner Kallweit 2019-03-17 19:52:19 UTC
At one point we had a misunderstanding: You talked about the L1 sub-state whilst I talked about L1 sub-states(!). L1 isn't a sub-state, what I mean is L1.1 and L1.2. 
These states aren't supported in your case, therefore the patch is a no-op.
Comment 13 Matt Devo 2019-03-17 23:12:28 UTC
(In reply to Heiner Kallweit from comment #12)
> At one point we had a misunderstanding: You talked about the L1 sub-state
> whilst I talked about L1 sub-states(!). L1 isn't a sub-state, what I mean is
> L1.1 and L1.2. 
> These states aren't supported in your case, therefore the patch is a no-op.

understood. Any suggestions then?
Comment 14 Heiner Kallweit 2019-03-18 16:10:44 UTC
For now you have the workaround to use "pcie_aspm.policy=performance".

I checked with Realtek recently and they mentioned that their Windows driver uses some heuristic to disable ASPM under load. Such a hack I'd like to avoid.
And disabling ASPM (again) would hurt all the unaffected notebook users (I saw reports that ASPM can save 1W). What doesn't make the situation better is that Realtek refuses to provide access to datasheets and errata info.
To cut a long story short: I don't have a good solution yet.
Comment 15 Heiner Kallweit 2019-03-25 21:14:27 UTC
This is an experimental patch mimicking what Realtek does in the Windows driver.
If load exceeds a threshold (for now I just say 5%) then ASPM gets temporarily disabled. Worst case, due to the issue, even this threshold isn't reached.
You can play with the threshold here:
		/* Threshold for disabling ASPM: 5% */
		threshold /= 20;
For the beginning I included a debug message being printed whenever ASPM gets enabled or disabled. Would be great if you could give it a try.
Patches is based on latest linux-next, it may apply cleanly also on 5.1-rc2.

---
 drivers/net/ethernet/realtek/r8169.c | 83 +++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index a8ca26c2a..12326d010 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -625,6 +625,7 @@ struct rtl8169_tc_offsets {
 enum rtl_flag {
 	RTL_FLAG_TASK_ENABLED = 0,
 	RTL_FLAG_TASK_RESET_PENDING,
+	RTL_FLAG_TASK_DISABLE_ASPM,
 	RTL_FLAG_MAX
 };
 
@@ -678,6 +679,14 @@ struct rtl8169_private {
 		struct work_struct work;
 	} wk;
 
+	struct {
+		struct delayed_work work;
+		atomic_t pkt_counter;
+		atomic_t idle_counter;
+		unsigned supported:1;
+		unsigned enabled:1;
+	} aspm;
+
 	unsigned irq_enabled:1;
 	unsigned supports_gmii:1;
 	dma_addr_t counters_phys_addr;
@@ -4835,14 +4844,62 @@ static void rtl_hw_aspm_clkreq_enable(struct rtl8169_private *tp, bool enable)
 	if (enable) {
 		RTL_W8(tp, Config5, RTL_R8(tp, Config5) | ASPM_en);
 		RTL_W8(tp, Config2, RTL_R8(tp, Config2) | ClkReqEn);
+		tp->aspm.supported = 1;
+		tp->aspm.enabled = 1;
 	} else {
 		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
 		RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+		tp->aspm.enabled = 0;
 	}
 
 	udelay(10);
 }
 
+static void rtl_aspm_set_unlock(struct rtl8169_private *tp, bool enable)
+{
+	pr_info("r8169: set ASPM to: %d\n", enable);
+	rtl_unlock_config_regs(tp);
+	rtl_hw_aspm_clkreq_enable(tp, enable);
+	rtl_lock_config_regs(tp);
+}
+
+static void rtl_sched_aspm_work(struct rtl8169_private *tp)
+{
+	mod_delayed_work(system_power_efficient_wq, &tp->aspm.work, HZ);
+}
+
+static void rtl_reset_aspm_pkt_counter(struct rtl8169_private *tp)
+{
+	atomic_set(&tp->aspm.pkt_counter, 0);
+}
+
+static void rtl_reset_aspm_idle_counter(struct rtl8169_private *tp)
+{
+	atomic_set(&tp->aspm.idle_counter, 0);
+}
+
+static void rtl_aspm_work(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct rtl8169_private *tp = container_of(dwork, struct rtl8169_private,
+						  aspm.work);
+
+	if (tp->aspm.supported && !tp->aspm.enabled)
+		if (atomic_inc_return(&tp->aspm.idle_counter) == 5) {
+			rtl_lock_work(tp);
+			rtl_aspm_set_unlock(tp, true);
+			rtl_unlock_work(tp);
+		}
+
+	rtl_reset_aspm_pkt_counter(tp);
+	rtl_sched_aspm_work(tp);
+}
+
+static void rtl_disable_aspm(struct rtl8169_private *tp)
+{
+	rtl_aspm_set_unlock(tp, false);
+}
+
 static void rtl_hw_start_8168bb(struct rtl8169_private *tp)
 {
 	RTL_W8(tp, Config3, RTL_R8(tp, Config3) & ~Beacon_en);
@@ -6559,6 +6616,7 @@ static void rtl_task(struct work_struct *work)
 		void (*action)(struct rtl8169_private *);
 	} rtl_work[] = {
 		{ RTL_FLAG_TASK_RESET_PENDING,	rtl_reset_work },
+		{ RTL_FLAG_TASK_DISABLE_ASPM,	rtl_disable_aspm },
 	};
 	struct rtl8169_private *tp =
 		container_of(work, struct rtl8169_private, wk.work);
@@ -6587,7 +6645,7 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 {
 	struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi);
 	struct net_device *dev = tp->dev;
-	int work_done;
+	int work_done, pkt_cnt, threshold;
 
 	work_done = rtl_rx(dev, tp, (u32) budget);
 
@@ -6598,6 +6656,23 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 		rtl_irq_enable(tp);
 	}
 
+	if (tp->aspm.supported) {
+		/* packets per sec */
+		threshold = tp->phydev->speed * 125000 / dev->mtu;
+		/* Threshold for disabling ASPM: 5% */
+		threshold /= 20;
+		pkt_cnt = atomic_read(&tp->aspm.pkt_counter);
+		if (pkt_cnt > threshold) {
+			rtl_reset_aspm_idle_counter(tp);
+			rtl_reset_aspm_pkt_counter(tp);
+			if (tp->aspm.enabled)
+				rtl_schedule_task(tp,
+					RTL_FLAG_TASK_DISABLE_ASPM);
+		} else {
+			atomic_add(work_done, &tp->aspm.pkt_counter);
+		}
+	}
+
 	return work_done;
 }
 
@@ -6697,6 +6772,9 @@ static int rtl8169_close(struct net_device *dev)
 	rtl_unlock_work(tp);
 
 	cancel_work_sync(&tp->wk.work);
+	cancel_delayed_work_sync(&tp->aspm.work);
+	if (tp->aspm.supported)
+		rtl_aspm_set_unlock(tp, true);
 
 	phy_disconnect(tp->phydev);
 
@@ -6778,6 +6856,8 @@ static int rtl_open(struct net_device *dev)
 	phy_start(tp->phydev);
 	netif_start_queue(dev);
 
+	rtl_sched_aspm_work(tp);
+
 	rtl_unlock_work(tp);
 
 	pm_runtime_put_sync(&pdev->dev);
@@ -7429,6 +7509,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	mutex_init(&tp->wk.mutex);
 	INIT_WORK(&tp->wk.work, rtl_task);
+	INIT_DELAYED_WORK(&tp->aspm.work, rtl_aspm_work);
 	u64_stats_init(&tp->rx_stats.syncp);
 	u64_stats_init(&tp->tx_stats.syncp);
 
-- 
2.21.0
Comment 16 Matt Devo 2019-03-27 03:35:57 UTC
I applied the above patch to 5.1-rc2, but no change in behavior on either device (Haswell or Broadwell) nor any enable/disable debug output in dmesg. Did I miss something?
Comment 17 Heiner Kallweit 2019-03-27 06:52:58 UTC
Maybe it's related to what I wrote:
"Worst case, due to the issue, even this threshold isn't reached.
You can play with the threshold here:
		/* Threshold for disabling ASPM: 5% */
		threshold /= 20;"

Actually yours is the first report where AER is triggered. So there seems to be some bigger incompatability with the board chipset.
Can you post the "ethtool -S" statistics after a bigger transfer? I'd be interested to see whether there's a significant number of missed rx packets.
Comment 18 Heiner Kallweit 2019-03-27 18:40:08 UTC
The ASPM-related issues seem to occur in different flavors. In a parallel thread a user reported that the following fixed it for him. Would be curious whether it makes a difference on your system.

ethtool -C <iface> rx-frames 0
Comment 19 Tom Seewald 2019-04-20 19:34:18 UTC
Given that these r8169 ASPM issues seem to affect a small percentage of devices, and that disabling ASPM for r8169 across the board would negatively impact every other user, would it be feasible to eventually have an ASPM blacklist for the problematic device/chip/firmware combinations?

Perhaps something like libata's "ATA_HORKAGE_NOLPM"?
Comment 20 Matt Devo 2019-04-20 19:37:40 UTC
that sounds reasonable to me, I'd be happy to provide a list of IDs
Comment 21 Michael G 2019-08-05 20:51:15 UTC
Is anyone working on this?  I, for one, will not progress past the 4.18 kernel/driver due to the severity of the impact on the 8168 adapter's performance.
Comment 22 Heiner Kallweit 2019-08-05 20:58:49 UTC
ASPM was still enabled in the chip in case the BIOS forbids the OS to control ASPM. Solving this needed a PCI core extension. With the following change 
62b1b3b3b6d3 ("r8169: don't activate ASPM in chip if OS can't control ASPM")
in 5.3 the ASPM should be solved on all systems. Please check 5.3-rc3 or linux-next.
Comment 23 Tom Seewald 2019-08-05 21:06:38 UTC
Is there hope of eventually having an ASPM white-list for known-working chip/firmware/system combinations?  I'm afraid the vast majority of r8169 users will never know about this kernel command line option and will therefore see a permanent decrease in battery life.

I am aware this would likely be a lot of work to maintain, so I understand if this is not currently feasible.
Comment 24 Heiner Kallweit 2019-08-05 21:44:44 UTC
Realtek chips of the RTL8168/RTL8101 family have been used on hundreds of mainboard types for more than 15 yrs now. As you say, maintaining such a list would be quite some effort. Pending with the PCI maintainers is a patch set that allows to control ASPM per device via sysfs. This would give users the option to re-enable ASPM completely or just certain sub-modes.
Distributions then could make this sysfs setting the default and provide users with a hint to disable it in case they are affected by the ASPM issue.