Bug 218556 - high number of messages "PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)"
Summary: high number of messages "PCIe Bus Error: severity=Corrected, type=Physical La...
Status: RESOLVED ANSWERED
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-04 07:17 UTC by Harald Dunkel
Modified: 2024-03-12 07:33 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg -T of a few days (21.48 KB, application/gzip)
2024-03-04 07:17 UTC, Harald Dunkel
Details
lspci -vxxxxxx (20.97 KB, application/gzip)
2024-03-04 07:18 UTC, Harald Dunkel
Details

Description Harald Dunkel 2024-03-04 07:17:07 UTC
Created attachment 305954 [details]
dmesg -T of a few days

I get a pretty high number of messages

[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0: AER: Corrected error message received from 0000:00:06.0
[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0:   device [8086:a74d] error status/mask=00000001/00002000
[Mon Mar  4 00:00:58 2024] pcieport 0000:00:06.0:    [ 0] RxErr                  (First)

dmesg and lspci are attached. Platform is Debian 12 amd64, self-built kernel 6.7.6. I have seen these messages using Debian's backports kernel 6.5.10 and the default kernel 6.1.76 as well.

Between Jan 11th and Mar 4th I got >2000 messages about this, all for "device [8086:a74d]".
Comment 1 Harald Dunkel 2024-03-04 07:18:13 UTC
Created attachment 305955 [details]
lspci -vxxxxxx
Comment 2 Artem S. Tashkinov 2024-03-04 08:58:20 UTC
Try booting with pci=noaer
Comment 3 Artem S. Tashkinov 2024-03-04 09:00:12 UTC
Or turn off ASPM in BIOS.

This is not limited to Linux:

https://www.reddit.com/r/intel/comments/17qftj1/whea_corrected_errors_event_id_17_every_once_in_a/
Comment 4 Artem S. Tashkinov 2024-03-04 09:02:38 UTC
You could also try disabling ASPM for just this device alone:

https://bbs.archlinux.org/viewtopic.php?id=264364

Anyways, it's a HW issue which needs to be reported to Intel.
Comment 5 Harald Dunkel 2024-03-12 07:33:42 UTC
There is no BIOS option to turn ASPM off for this host, but I can move control over ASPM from BIOS to the operating system and boot Linux with pcie_aspm=off. This seems to have worked. The warning is gone.

Note You need to log in before you can comment on or make changes to this bug.