Bug 218127 - The PCIe AER error flood between PCIe bridge and Realtek's RTL8723BE makes system hang
Summary: The PCIe AER error flood between PCIe bridge and Realtek's RTL8723BE makes sy...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-10 08:30 UTC by Jian-Hong Pan
Modified: 2024-03-30 13:52 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
The kernel log shows the AER error flood (1.12 MB, text/plain)
2023-11-10 08:30 UTC, Jian-Hong Pan
Details
The PCI bridge's detail information (4.45 KB, text/plain)
2023-11-10 08:35 UTC, Jian-Hong Pan
Details
The PCI RTL8723BE's detail information (3.61 KB, text/plain)
2023-11-10 08:38 UTC, Jian-Hong Pan
Details

Description Jian-Hong Pan 2023-11-10 08:30:29 UTC
Created attachment 305388 [details]
The kernel log shows the AER error flood

We have an ASUS X555UQ laptop equipped with Intel i7-6500U CPU and Realtek RTL8723BE PCIe Wireless adapter.

We tested it with kernel 6.6.  System keeps showing AER error message flood, even hangs up, until rtl8723be's ASPM is disabled.

kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
kernel: pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000
kernel: pcieport 0000:00:1c.5:    [ 0] RxErr                  (First)
kernel: pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
kernel: pcieport 0000:00:1c.5: AER: can't find device of ID00e5
kernel: pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
kernel: pcieport 0000:00:1c.5: AER: can't find device of ID00e5
kernel: pcieport 0000:00:1c.5: AER: Multiple Corrected error received: 0000:00:1c.5
kernel: pcieport 0000:00:1c.5: AER: can't find device of ID00e5

Here is the PCI tree:
$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers
           +-02.0  Intel Corporation Skylake GT2 [HD Graphics 520]
           +-04.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
           +-14.0  Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
           +-14.2  Intel Corporation Sunrise Point-LP Thermal subsystem
           +-15.0  Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0
           +-15.1  Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1
           +-16.0  Intel Corporation Sunrise Point-LP CSME HECI #1
           +-17.0  Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode]
           +-1c.0-[01]----00.0  NVIDIA Corporation GM108M [GeForce 940MX]
           +-1c.4-[02]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-1c.5-[03]----00.0  Realtek Semiconductor Co., Ltd. RTL8723BE PCIe Wireless Network Adapter
           +-1f.0  Intel Corporation Sunrise Point-LP LPC Controller
           +-1f.2  Intel Corporation Sunrise Point-LP PMC
           +-1f.3  Intel Corporation Sunrise Point-LP HD Audio
           \-1f.4  Intel Corporation Sunrise Point-LP SMBus
Comment 1 Jian-Hong Pan 2023-11-10 08:35:33 UTC
Created attachment 305389 [details]
The PCI bridge's detail information
Comment 2 Jian-Hong Pan 2023-11-10 08:38:15 UTC
Created attachment 305390 [details]
The PCI RTL8723BE's detail information
Comment 3 Jian-Hong Pan 2023-11-10 08:41:05 UTC
Notice a long time ago discussion mail: Dmesg filled with "AER: Corrected error received" [1]

So, I force write 1 to clear Receiver Error Status bit of Correctable Error Status Register, like

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9c8fd69ae5ad..39faedd2ec8e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1141,8 +1160,9 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
                        e_info.multi_error_valid = 0;
                aer_print_port_info(pdev, &e_info);
 
-               if (find_source_device(pdev, &e_info))
-                       aer_process_err_devices(&e_info);
+               //if (find_source_device(pdev, &e_info))
+               //      aer_process_err_devices(&e_info);
+               pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS, 0x1);
        }
 
        if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {

Then, system should clear the error right away.  However, system still get the AER flood ...

Seems that we still have to disable rtl8723be's ASPM.

[1]: https://lore.kernel.org/all/20151229155822.GA17321@localhost/T/#r7ca71d16bb63a651b456fd14bbbd889aa97b8ba4
Comment 4 Jian-Hong Pan 2023-11-13 06:08:46 UTC
Sent a patch to disable the rtl8723be's ASPM when the PCI bridge is some kinds of Intel devices as a workaround https://lore.kernel.org/lkml/05390e0b-27fd-4190-971e-e70a498c8221@lwfinger.net/T/

Note You need to log in before you can comment on or make changes to this bug.