Bug 216863 - ThinkPad X1 Extreme Gen 5: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) after resuming from sleep
Summary: ThinkPad X1 Extreme Gen 5: PCIe Bus Error: severity=Corrected, type=Data Link...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-12-29 13:10 UTC by Frederick Zhang
Modified: 2023-11-12 12:58 UTC (History)
4 users (show)

See Also:
Kernel Version: 6.1.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vv output without pcie_aspm=off (112.67 KB, text/plain)
2022-12-29 13:10 UTC, Frederick Zhang
Details
lspci -vv output with pcie_aspm=off (112.63 KB, text/plain)
2022-12-29 13:10 UTC, Frederick Zhang
Details

Description Frederick Zhang 2022-12-29 13:10:11 UTC
Created attachment 303500 [details]
lspci -vv output without pcie_aspm=off

I recently purchased a Thunderbolt 4 dock (CalDigit TS4) and started
having millions of these warnings in my logs after resuming from sleep.
I previously didn't have any Thunderbolt peripherals. The device is a
ThinkPad X1 Extreme Gen 5 (BIOS 1.12 N3JET28W, EC 1.08 N3JHT21W).

Dec 29 18:51:05 FredArch systemd[1]: Starting System Suspend...
Dec 29 18:51:05 FredArch systemd-sleep[31007]: Entering sleep state 'suspend'...
Dec 29 18:51:05 FredArch kernel: PM: suspend entry (s2idle)
Dec 29 18:51:07 FredArch kernel: Filesystems sync: 1.566 seconds
Dec 29 18:52:30 FredArch kernel: Freezing user space processes ... (elapsed 0.001 seconds) done.
Dec 29 18:52:30 FredArch kernel: OOM killer disabled.
Dec 29 18:52:30 FredArch kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Dec 29 18:52:30 FredArch kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Dec 29 18:52:30 FredArch kernel: ACPI: EC: interrupt blocked
Dec 29 18:52:30 FredArch kernel: ACPI: EC: interrupt unblocked
Dec 29 18:52:30 FredArch kernel: pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:21:01.0
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0:   device [8086:1136] error status/mask=00001100/00002000
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0:    [ 8] Rollover
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0:    [12] Timeout
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0: AER:   Error of this Agent is reported first
Dec 29 18:52:30 FredArch kernel: pcieport 0000:23:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Dec 29 18:52:30 FredArch kernel: pcieport 0000:23:00.0:   device [8086:0b26] error status/mask=00001000/00002000
Dec 29 18:52:30 FredArch kernel: pcieport 0000:23:00.0:    [12] Timeout
Dec 29 18:52:30 FredArch kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:21:01.0
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0:   device [8086:1136] error status/mask=00001100/00002000
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0:    [ 8] Rollover
Dec 29 18:52:30 FredArch kernel: pcieport 0000:21:01.0:    [12] Timeout

$ cat /proc/version
Linux version 6.1.1-arch1-1 (linux@archlinux) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Wed, 21 Dec 2022 22:27:55 +0000

$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers [8086:4641] (rev 02)
00:01.0 PCI bridge [0604]: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 [8086:460d] (rev 02)
00:04.0 Signal processing controller [1180]: Intel Corporation Alder Lake Innovation Platform Framework Processor Participant [8086:461d] (rev 02)
00:06.0 PCI bridge [0604]: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 [8086:464d] (rev 02)
00:08.0 System peripheral [0880]: Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator [8086:464f] (rev 02)
00:0a.0 Signal processing controller [1180]: Intel Corporation Platform Monitoring Technology [8086:467d] (rev 01)
00:14.0 USB controller [0c03]: Intel Corporation Alder Lake PCH USB 3.2 xHCI Host Controller [8086:51ed] (rev 01)
00:14.2 RAM memory [0500]: Intel Corporation Alder Lake PCH Shared SRAM [8086:51ef] (rev 01)
00:14.3 Network controller [0280]: Intel Corporation Alder Lake-P PCH CNVi WiFi [8086:51f0] (rev 01)
00:15.0 Serial bus controller [0c80]: Intel Corporation Alder Lake PCH Serial IO I2C Controller #0 [8086:51e8] (rev 01)
00:16.0 Communication controller [0780]: Intel Corporation Alder Lake PCH HECI Controller [8086:51e0] (rev 01)
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:51b8] (rev 01)
00:1c.7 PCI bridge [0604]: Intel Corporation Alder Lake PCH-P PCI Express Root Port #9 [8086:51bf] (rev 01)
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:51b0] (rev 01)
00:1f.0 ISA bridge [0601]: Intel Corporation Alder Lake PCH eSPI Controller [8086:5182] (rev 01)
00:1f.3 Multimedia audio controller [0401]: Intel Corporation Alder Lake PCH-P High Definition Audio Controller [8086:51c8] (rev 01)
00:1f.4 SMBus [0c05]: Intel Corporation Alder Lake PCH-P SMBus Host Controller [8086:51a3] (rev 01)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Alder Lake-P PCH SPI Controller [8086:51a4] (rev 01)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA103M [GeForce RTX 3080 Ti Mobile] [10de:2420] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:2288] (rev a1)
04:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd Device [144d:a80c]
0a:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5261 PCI Express Card Reader [10ec:5261] (rev 01)
20:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02)
21:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02)
21:01.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02)
21:02.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02)
21:03.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02)
22:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137]
23:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03)
24:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03)
24:01.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03)
24:02.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03)
24:03.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03)
24:04.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03)
55:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller (2) I225-LMvP [8086:5502] (rev 03)
56:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020] [8086:1138]


It happened every time after resuming from sleep. pcie_aspm=off solved
the issue for me. Some related posts I found: [1][2].

Maybe we need some quirk patches like [3]?

[1] https://bbs.archlinux.org/viewtopic.php?id=274935
[2] https://askubuntu.com/questions/1394924/35-gb-day-of-pcie-bus-error-severity-corrected-type-data-link-layer-in-sy
[3] https://lkml.iu.edu/hypermail/linux/kernel/2008.0/01418.html
Comment 1 Frederick Zhang 2022-12-29 13:10:43 UTC
Created attachment 303501 [details]
lspci -vv output with pcie_aspm=off
Comment 2 Frederick Zhang 2022-12-30 18:08:04 UTC
I just realised that pcie_aspm=off broke most of my dock's functions. I
still had Ethernet but wake-on-lan stopped working. The dock's
Thunderbolt ports, USB Type-A/C data ports, SD card slots all stopped
working too (no logs at all after plugging in things).

Then I tested pcie_aspm.policy=performance. The dock started working
again but the warning logs were also back.

Also tried applying quirk_disable_aspm_l0s_l1 on the Thunderbolt bridges
but unfortunately I still had the logs.

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 285acc4aaccc..495e976606b6 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2393,8 +2393,11 @@ static void quirk_disable_aspm_l0s_l1(struct pci_dev *dev)
  * disable both L0s and L1 for now to be safe.
  */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASMEDIA, 0x1080, quirk_disable_aspm_l0s_l1);
 
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1136, quirk_disable_aspm_l0s_l1);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0b26, quirk_disable_aspm_l0s_l1);
+
 /*
  * Some Pericom PCIe-to-PCI bridges in reverse mode need the PCIe Retrain
  * Link bit cleared after starting the link retrain process to allow this
  * process to finish.

And I noticed that the warning logs stopped once I plugged something in
(NVMe enclosure or SD card), and started again once I ran `udisksctl
power-off`. This was without any parameters or patches.
Comment 3 Frederick Zhang 2023-01-03 12:32:03 UTC
I noticed that I can disable ACPI wakeup to avoid the warning log
flooding.

$ lspci -t
-[0000:00]-+-00.0
           +-01.0-[01]--+-00.0
           |            \-00.1
           +-04.0
           +-06.0-[04]----00.0
           +-08.0
           +-0a.0
           +-14.0
           +-14.2
           +-14.3
           +-15.0
           +-16.0
           +-1c.0-[08]--
           +-1c.7-[0a]----00.0
           +-1d.0-[20-89]----00.0-[21-89]--+-00.0-[22]----00.0
           |                               +-01.0-[23-55]----00.0-[24-55]--+-00.0-[25]--
           |                               |                               +-01.0-[26-34]--
           |                               |                               +-02.0-[35-43]--
           |                               |                               +-03.0-[44-54]--
           |                               |                               \-04.0-[55]----00.0
           |                               +-02.0-[56]----00.0
           |                               \-03.0-[57-89]--
           +-1f.0
           +-1f.3
           +-1f.4
           \-1f.5
$ cat /proc/acpi/wakeup
Device	S-state	  Status   Sysfs node
PEG0	  S4	*enabled   pci:0000:00:06.0
PEGP	  S4	*disabled  pci:0000:04:00.0
PEG1	  S4	*enabled   pci:0000:00:01.0
PEGP	  S4	*disabled  pci:0000:01:00.0
PEG2	  S4	*disabled
PEGP	  S4	*disabled
XHCI	  S3	*enabled   pci:0000:00:14.0
XDCI	  S4	*disabled
HDAS	  S4	*disabled  pci:0000:00:1f.3
CNVW	  S4	*disabled  pci:0000:00:14.3
RP01	  S4	*enabled   pci:0000:00:1c.0
PXSX	  S4	*disabled
RP02	  S4	*disabled
PXSX	  S4	*disabled
RP03	  S4	*disabled
PXSX	  S4	*disabled
RP04	  S4	*disabled
PXSX	  S4	*disabled
PXSX	  S4	*disabled
RP06	  S4	*disabled
PXSX	  S4	*disabled
RP07	  S4	*disabled
PXSX	  S4	*disabled
RP08	  S4	*enabled   pci:0000:00:1c.7
PXSX	  S4	*disabled  pci:0000:0a:00.0
		*disabled  platform:rtsx_pci_sdmmc.0
RP09	  S4	*enabled   pci:0000:00:1d.0
PXSX	  S4	*enabled   pci:0000:20:00.0
RP10	  S4	*disabled
PXSX	  S4	*disabled
RP11	  S4	*disabled
PXSX	  S4	*disabled
RP12	  S4	*disabled
PXSX	  S4	*disabled
RP13	  S4	*disabled
PXSX	  S4	*disabled
RP14	  S4	*disabled
PXSX	  S4	*disabled
RP15	  S4	*disabled
PXSX	  S4	*disabled
RP16	  S4	*disabled
PXSX	  S4	*disabled
RP17	  S4	*disabled
PXSX	  S4	*disabled
RP18	  S4	*disabled
PXSX	  S4	*disabled
RP19	  S4	*disabled
PXSX	  S4	*disabled
RP20	  S4	*disabled
PXSX	  S4	*disabled
RP21	  S4	*disabled
PXSX	  S4	*disabled
RP22	  S4	*disabled
PXSX	  S4	*disabled
RP23	  S4	*disabled
PXSX	  S4	*disabled
RP24	  S4	*disabled
PXSX	  S4	*disabled
RP25	  S4	*disabled
PXSX	  S4	*disabled
RP26	  S4	*disabled
PXSX	  S4	*disabled
RP27	  S4	*disabled
PXSX	  S4	*disabled
RP28	  S4	*disabled
PXSX	  S4	*disabled
AWAC	  S4	*enabled   platform:ACPI000E:00
SLPB	  S3	*enabled   platform:PNP0C0E:00
LID	  S4	*enabled   platform:PNP0C0D:00
$ echo RP09 | sudo tee /proc/acpi/wakeup
RP09
$ grep RP09 /proc/acpi/wakeup
RP09      S4    *disabled  pci:0000:00:1d.0


Wake-on-LAN from S3 stopped working (as expected) though.
Comment 4 Paul Menzel 2023-11-12 12:58:36 UTC
Frederick, do you think it is an adapter problem? Are you able to test with a device, maybe from the IT group or a friend?

Note You need to log in before you can comment on or make changes to this bug.