Bug 196183

Summary: AER: Corrected error received: id=00e8
Product: Drivers Reporter: Sven Köhler (sven.koehler)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: jcubic, jean-louis, kernel.org, sgasgar, tadej.j
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.11.7 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel config

Description Sven Köhler 2017-06-25 01:14:14 UTC
Created attachment 257169 [details]
kernel config

In dmesg, I see tons of these messages:

[  261.754916] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[  261.754927] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[  261.754935] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[  261.754940] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[  285.146589] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[  285.146604] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[  285.146616] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[  285.146623] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
...

they are repeated many many times, in particular if disk I/O occurs.

According to lspci -tv, port 0000:00:1d.0 is connected my SSD:

-[0000:00]-+-00.0  Intel Corporation Device 5910
           +-01.0-[01]----00.0  NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile]
           +-02.0  Intel Corporation Device 591b
           +-04.0  Intel Corporation Skylake Processor Thermal Subsystem
           +-14.0  Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
           +-14.2  Intel Corporation Sunrise Point-H Thermal subsystem
           +-15.0  Intel Corporation Sunrise Point-H Serial IO I2C Controller #0
           +-15.1  Intel Corporation Sunrise Point-H Serial IO I2C Controller #1
           +-16.0  Intel Corporation Sunrise Point-H CSME HECI #1
           +-17.0  Intel Corporation Sunrise Point-H SATA Controller [AHCI mode]
           +-1c.0-[02]----00.0  Intel Corporation Wireless 8265 / 8275
           +-1c.1-[03]----00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
           +-1d.0-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
           +-1d.4-[05]--
           +-1d.6-[06-3e]--
           +-1f.0  Intel Corporation Sunrise Point-H LPC Controller
           +-1f.2  Intel Corporation Sunrise Point-H PMC
           +-1f.3  Intel Corporation Device a171
           \-1f.4  Intel Corporation Sunrise Point-H SMBus


I don't know much about these errors. I do not know whether I should worry, for example.
I do not know what this replay timer means, but might this have something to do with putting the SSD in sleep mode? Can the replay timer timeout be increased somehow, maybe? Is my SSD overheating maybe?

The System is a Dell XPS 9560 (Kaby Lake). The kernel config is attached.

I have the following kernel parameters set, to be able to use bumblebee: pcie_port_pm=off acpi_rev_override
Comment 1 Sven Köhler 2017-06-25 01:14:57 UTC
The obvious workaround is to be ignorant and disable AER altogether via pci=noaer
Comment 2 Jean-Louis Dupond 2017-12-22 10:48:47 UTC
Hi All,

I got the same problem on a the (almost) same laptop (Dell Precision 5520).

We could indeed disable aer, buts thats not a FIX ofcourse :)

I receive the following messages on a 4.15-rc4 (drm-tip) kernel:
[  136.777360] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: id=00e8
[  136.777392] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[  136.777396] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[  136.777398] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[  136.777399] pcieport 0000:00:1d.0:   Error of this Agent(00e8) is reported first
[  136.777403] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID)
[  136.777404] nvme 0000:04:00.0:   device [1c5c:1284] error status/mask=00000081/0000e000
[  136.777405] nvme 0000:04:00.0:    [ 0] Receiver Error         (First)
[  136.777406] nvme 0000:04:00.0:    [ 7] Bad DLLP              
[  228.424502] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[  228.424514] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[  228.424519] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[  228.424522] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[  901.426214] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[  901.426227] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[  901.426234] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[  901.426238] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[ 1378.607349] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[ 1378.607356] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[ 1378.607360] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[ 1378.607362] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[ 1391.919513] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[ 1391.919526] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[ 1391.919532] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[ 1391.919537] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  
[ 1604.654186] pcieport 0000:00:1d.0: AER: Corrected error received: id=00e8
[ 1604.654200] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e8(Transmitter ID)
[ 1604.654209] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00002000
[ 1604.654215] pcieport 0000:00:1d.0:    [12] Replay Timer Timeout  


I hope this can be pinpointed and fixed!
Please let me know if additional information is needed!

Thanks
Jean-Louis
Comment 3 Sven Köhler 2017-12-22 12:01:29 UTC
The Arch Linux Wiki suggests the following workaround:
pci=nommconf

I haven't seen the error since I am using this option.

https://wiki.archlinux.org/index.php/Dell_XPS_15_9560
Comment 4 Sven Köhler 2020-03-17 20:38:35 UTC
Same laptop (Dell XPS 9560) now running Linux 5.5.9. Unless I am using pci=nommconf, I still get the AER errors.

What is going on? Shouldn't the Dell XPS 9560 be added to some sort of blacklist, maybe? Do I really need to use pci=nommconf or is there some other workaround that has less sideeffects?

Any help appreciated.
Comment 5 Tadej Janež 2020-10-16 19:33:09 UTC
Hi Sven,

> Do I really need to use pci=nommconf or is there some other workaround that
> has less sideeffects?

according to https://bbs.archlinux.org/viewtopic.php?id=233230,
one could use pci=noaer to just disable AER:

"it seems disabling AER should be safe, and you are still left with basic PCIe error reporting capabilities. AER is just for "advanced" error reporting."
Comment 6 Jakub Jankiewicz 2022-03-12 21:29:24 UTC
Those errors may be related to Dell laptops because I have the same error with:

Laptop Dell Inspiron 15 5570 i7-8550U
Comment 7 Jakub Jankiewicz 2022-03-12 21:30:21 UTC
Kernel 5.16.11 with Fedora 35.