Bug 218556

Summary: high number of messages "PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)"
Product: Drivers Reporter: Harald Dunkel (harri)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED ANSWERED    
Severity: normal    
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg -T of a few days
lspci -vxxxxxx

Description Harald Dunkel 2024-03-04 07:17:07 UTC
Created attachment 305954 [details]
dmesg -T of a few days

I get a pretty high number of messages

[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0: AER: Corrected error message received from 0000:00:06.0
[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0:   device [8086:a74d] error status/mask=00000001/00002000
[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0:    [ 0] RxErr                  (First)

dmesg and lspci are attached. Platform is Debian 12 amd64, self-built kernel 6.7.6. I have seen these messages using Debian's backports kernel 6.5.10 and the default kernel 6.1.76 as well.

Between Jan 11th and Mar 4th I got >2000 messages about this, all for "device [8086:a74d]".
Comment 1 Harald Dunkel 2024-03-04 07:18:13 UTC
Created attachment 305955 [details]
lspci -vxxxxxx
Comment 2 Artem S. Tashkinov 2024-03-04 08:58:20 UTC
Try booting with pci=noaer
Comment 3 Artem S. Tashkinov 2024-03-04 09:00:12 UTC
Or turn off ASPM in BIOS.

This is not limited to Linux:

https://www.reddit.com/r/intel/comments/17qftj1/whea_corrected_errors_event_id_17_every_once_in_a/
Comment 4 Artem S. Tashkinov 2024-03-04 09:02:38 UTC
You could also try disabling ASPM for just this device alone:

https://bbs.archlinux.org/viewtopic.php?id=264364

Anyways, it's a HW issue which needs to be reported to Intel.
Comment 5 Harald Dunkel 2024-03-12 07:33:42 UTC
There is no BIOS option to turn ASPM off for this host, but I can move control over ASPM from BIOS to the operating system and boot Linux with pcie_aspm=off. This seems to have worked. The warning is gone.