Bug 201517 - pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
Summary: pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-25 05:45 UTC by Mikhail
Modified: 2023-12-02 04:27 UTC (History)
14 users (show)

See Also:
Kernel Version: 4.19
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (132.38 KB, text/plain)
2018-10-25 05:45 UTC, Mikhail
Details
Network issue with latest kernel (29 bytes, text/plain)
2018-12-03 08:16 UTC, emteeelp
Details
dmesg with 4.19 Kernel, network issue (42.92 KB, text/plain)
2018-12-05 13:25 UTC, emteeelp
Details
dmesg with 4.18 Kernel, no issues (44.29 KB, text/plain)
2018-12-05 13:26 UTC, emteeelp
Details
dmesg (733.09 KB, text/plain)
2020-09-15 23:34 UTC, twinshadows404
Details
dmesg showing AER error. (77.05 KB, text/plain)
2020-11-21 18:20 UTC, raul
Details
dmesg AER ryzen 2700x msi x470 gaming pro max (97.46 KB, text/plain)
2020-11-21 22:24 UTC, raul
Details
dmesg 2023-05-15 from 6.1.19-gentoo-x86_64 (127.59 KB, text/plain)
2023-05-15 21:58 UTC, Eduardo Santiago
Details
lspci -vv - 2023-05-15 from 6.1.19-gentoo (125.35 KB, text/plain)
2023-05-15 22:00 UTC, Eduardo Santiago
Details

Description Mikhail 2018-10-25 05:45:00 UTC
Created attachment 279149 [details]
dmesg

I often get a strange error in the kernel log:

[ 8885.590311] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 8885.590320] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 8885.590324] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
[ 8885.590328] pcieport 0000:00:03.1:    [12] Timeout

But not always, it means that if this message starts to appear after a reboot, then it will appear again and again, and if it does not appear, it does not appear at all.

# lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Optane SSD 900P Series [8086:2700]
02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01)
02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] Device [1022:43c8] (rev 01)

# uname -r
4.19.0-0.rc8.git4.1.fc30.x86_64
Comment 1 Mikhail 2018-10-25 05:45:32 UTC
What this message means?
Comment 2 emteeelp 2018-12-03 08:16:36 UTC
Created attachment 279815 [details]
Network issue with latest kernel
Comment 3 emteeelp 2018-12-05 13:25:44 UTC
Created attachment 279867 [details]
dmesg with 4.19 Kernel, network issue
Comment 4 emteeelp 2018-12-05 13:26:24 UTC
Created attachment 279869 [details]
dmesg with 4.18 Kernel, no issues
Comment 5 emteeelp 2018-12-05 23:05:59 UTC
The issue started with 4.19-RC1

https://www.lkml.org/lkml/2018/8/26/199
Comment 6 phitchen90 2018-12-06 22:33:15 UTC
I also have network issues with 4.19.6, which began with 4.19-RC1.

I had no issues with 4.18; dmesg here: http://ix.io/1vcF

Network issues present in 4.19.6; dmesg here: http://ix.io/1vay
Comment 7 marshalled 2019-10-14 23:12:59 UTC
I'm getting this same issue on kernel 5.3.6 (Unraid).  On a threadripper 1950x / Asus X3990-A Motherboard.  Repeated over and over in the logs.

Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: Multiple Corrected error received: 0000:00:00.0
Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER:   device [1022:1453] error status/mask=00001180/00006000

[1022:1453] 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
Comment 8 Andreas 2020-01-06 07:57:38 UTC
I Could confirm this error also on kernel 5.4.8-arch1-1

pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Trans>
Jan 06 06:28:52 ajmind1 kernel: pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001100/00006000
Jan 06 06:28:52 ajmind1 kernel: pcieport 0000:00:03.1: AER:    [ 8] Rollover              
Jan 06 06:28:52 ajmind1 kernel: pcieport 0000:00:03.1: AER:    [12] Timeout               
Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Trans>
Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001100/00006000
Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER:    [ 8] Rollover              
Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER:    [12] Timeout               
Jan 06 06:31:16 ajmind1 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Trans>
Jan 06 06:31:16 ajmind1 kernel: pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 26
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: fcf00000-fcffffff [size=1M]
	Prefetchable memory behind bridge: None
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Port (Slot+), MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
	Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] Secondary PCI Express <?>
	Capabilities: [2a0] Access Control Services
	Capabilities: [370] L1 PM Substates
	Capabilities: [3c4] Designated Vendor-Specific <?>
	Kernel driver in use: pcieport
Comment 9 Mikhail 2020-01-06 08:02:36 UTC
Everyone who experienced this issue please try workaround "pci=noaer" in kernel options.

I have tested the system several months with the kernel command line "pci=noaer". And the error "pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0" not happened all testing time.
Comment 10 Matt Jolly 2020-06-18 10:03:39 UTC
(In reply to Mikhail from comment #9)
> Everyone who experienced this issue please try workaround "pci=noaer" in
> kernel options.
> 
> I have tested the system several months with the kernel command line
> "pci=noaer". And the error "pcieport 0000:00:03.1: AER: Corrected error
> received: 0000:00:00.0" not happened all testing time.

That's because you've disabled AER. AER is just PCI-e Advanced Error Reporting. 

According to the documentation (https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt), Corrected errors should be printed with a severity of "warning".

>When a PCIe AER error is captured, an error message will be output to
>console. If it's a correctable error, it is output as a warning.
>Otherwise, it is printed as an error. So users could choose different
>log level to filter out correctable error messages.

A quick browse through aer.c shows that it only calls the pci_err function, which unsurprisingly sends an error, rather than a warning. 

I can also confirm that setting dmesg to 'dmesg -n err' still shows this text.

This warning can be safely ignored - it's reporting a corrected error. 

If there's a performance impact a separate bug should be filed. 

I'll submit a patch in the near future to address this behaviour.
Comment 11 Clay Stan 2020-09-03 09:07:06 UTC
I met the familiar problem in debian unstable
kernel version : 5.8

when boot system
the kernel log :

-- Logs begin at Sat 2020-08-29 23:13:02 CST. --
9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER:   device [1e0f:0009] error status/mask=00001000/00006000
9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER:    [12] Timeout               
9月 03 16:46:17 stan kernel: pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER:   device [1e0f:0009] error status/mask=00001000/00006000
9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER:    [12] Timeout               
9月 03 16:46:48 stan kernel: pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
9月 03 16:46:48 stan kernel: nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
9月 03 16:46:48 stan kernel: nvme 0000:01:00.0: AER:   device [1e0f:0009] error status/mask=00001000/00006000
9月 03 16:46:48 stan kernel: nvme 0000:01:00.0: AER:    [12] Timeout
Comment 12 twinshadows404 2020-09-15 23:34:19 UTC
Created attachment 292511 [details]
dmesg

I doubt this is (only) a network issue as I found a way to reproduce this 100% of the time.


Change the BLCK from 100 to 105.Few observations:
When the mouse button is pressed and dragged over some text  it will randomly stop highlighting text and start dragging it.Nvme prints these errors as well
When the GPU is at its lowest power state(300MHz) it won't stayed fixed but rather fluctuate couple of MHz above.On Windows, after every shutdown WerFault.exe will prevent the shutdown for a short time.
From that it seems that the issue is with the USB connections and it's not only affecting linux.
Comment 13 raul 2020-11-21 18:20:42 UTC
Created attachment 293759 [details]
dmesg showing AER error.

After a few days with the system online, dmesg showed an AER:
> pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
> pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data
> Link Layer, (Transmitter ID)
> pcieport 0000:00:03.1: AER:   device [1022:1453] error
> status/mask=00001000/00006000
> pcieport 0000:00:03.1: AER:    [12] Timeout

Afterwards, amdgpu printed this error:
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [drm] amdgpu_dm_irq_schedule_work FAILED src 12

The graphics card is hanging from the PCIe bridge corresponding to device [1022:1453].

Linux version 5.8.16-050816-generic. Ubuntu 20.04 LTS.

I have appended the output of lspci -vnnt and lspci -nn.

This might be a CPU-SoC problem, related to the idle soft lockups. Poor CPU binning might be at play.
Comment 14 raul 2020-11-21 18:22:36 UTC
(In reply to raul from comment #13)
The errors showed up a few hours far from each other.
[de nov.20 23:52] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[  +0,000006] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  +0,000004] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[  +0,000002] pcieport 0000:00:03.1: AER:    [12] Timeout               
[de nov.21 15:00] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
[  +0,001199] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
[de nov.21 15:01] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
[  +0,003955] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
[  +0,002941] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
[  +0,001758] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
[de nov.21 15:08] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
Comment 15 raul 2020-11-21 20:18:10 UTC
(In reply to raul from comment #14)
The amdgpu_dm_irq_schedule_work occurred today while changing the monitor. Then, while I was changing resolutions within the GNOME settings panel it crashed while enabling/disabling FreeSync in the monitor. Dmesg shows this errors (some of the previous included to show nothing else happened):

> [de nov.21 15:01] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [  +0,003955] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [  +0,002941] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [  +0,001758] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [de nov.21 15:08] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [de nov.21 20:02] show_signal_msg: 15 callbacks suppressed
> [  +0,000001] gnome-control-c[1019094]: segfault at 7ffcd97dbf88 ip
> 00007f26251bfa80 sp 00007ffcd97dbf90 error 6 in
> libc-2.31.so[7f262516b000+178000]
> [  +0,000008] Code: 89 fb 83 e3 02 0f 85 2f 04 00 00 f3 41 0f 6f 0e 49 8b 46
> 10 be 25 00 00 00 4c 89 ef 48 89 85 c8 fb ff ff 0f 11 8d b8 fb ff ff <e8> 0b
> b9 fa ff 41 81 e7 00 80 00 00 c7 85 44 fb ff ff 00 00 00 00
> [  +0,473765] [drm:dm_force_atomic_commit.isra.0 [amdgpu]] *ERROR* Restoring
> old state failed with -12
> [de nov.21 20:03] [drm] amdgpu_dm_irq_schedule_work FAILED src 12
> [de nov.21 20:52] device-mapper: table: 253:0: linear: Device lookup failed
> [  +0,000003] device-mapper: ioctl: error adding target to table

Please note that the AER error did not happen while changing or using a new screen. I think I was watching DVB-T with Kaffeine, which I believe takes advantage of hardware acceleration to decode the video stream.

I have had these AER errors with other motherboards, not related with the GPU but with bridges connecting USB ports. The DVB-T dongle happens to be connected through USB.
The system currently has a Ryzen 7 2700X installed. The other errors happened with a Ryzen 7 1800X and ASUS Crosshair VII motherboard.

Error quoted in comment https://bugzilla.kernel.org/show_bug.cgi?id=208201#c12 and https://bugzilla.kernel.org/show_bug.cgi?id=208201#c13 from another bug in which I'm showing ryzen 7 1800X soft lockups with different motherboards, until I found a motherboard which has a working "Typical power idle" setting.
Comment 16 raul 2020-11-21 22:24:46 UTC
Created attachment 293761 [details]
dmesg AER ryzen 2700x msi x470 gaming pro max

Got another one. See attached dmesg which includes all the previous. Using the DVB-T usb dongle with Kaffeine.
I appended lsusb -vt output. You can ignore device-mapper errors. Is just kernel upgrades scanning disks and failing to access bcache.
Comment 17 Oliver Schwabedissen 2021-01-14 06:00:03 UTC
Same problem here after switching from kernel 5.9.12 to 5.10.4 (SuSE Tumbleweed).

Since then I get the following errors on a daily basis:

kernel: [41896.255169] [drm:dm_restore_drm_connector_state [amdgpu]] *ERROR* Restoring old state failed with -12

kernel: [ 6948.820626] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:09:00.0
kernel: [ 6948.820634] amdgpu 0000:09:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
kernel: [ 6948.820636] amdgpu 0000:09:00.0:   device [1002:67df] error status/mask=000000c0/00002000
kernel: [ 6948.820638] amdgpu 0000:09:00.0:    [ 6] BadTLP                
kernel: [ 6948.820639] amdgpu 0000:09:00.0:    [ 7] BadDLLP               


Linux frodo 5.10.5-1-default #1 SMP Wed Jan 6 18:20:57 UTC 2021 (19815f3) x86_64 x86_64 x86_64 GNU/Linux

01: None 00.0: 10103 CPU                                        
  [Created at cpu.462]
  Unique ID: rdCR.j8NaKXDZtZ6
  Hardware Class: cpu
  Arch: X86-64
  Vendor: "AuthenticAMD"
  Model: 23.8.2 "AMD Ryzen 7 2700X Eight-Core Processor"
  Features: fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,constant_tsc,rep_good,nopl,nonstop_tsc,cpuid,extd_apicid,aperfmperf,pni,pclmulqdq,monitor,ssse3,fma,cx16,sse4_1,sse4_2,movbe,popcnt,aes,xsave,avx,f16c,rdrand,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,skinit,wdt,tce,topoext,perfctr_core,perfctr_nb,bpext,perfctr_llc,mwaitx,cpb,hw_pstate,sme,ssbd,sev,ibpb,vmmcall,sev_es,fsgsbase,bmi1,avx2,smep,bmi2,rdseed,adx,smap,clflushopt,sha_ni,xsaveopt,xsavec,xgetbv1,xsaves,clzero,irperf,xsaveerptr,arat,npt,lbrv,svm_lock,nrip_save,tsc_scale,vmcb_clean,flushbyasid,decodeassists,pausefilter,pfthreshold,avic,v_vmsave_vmload,vgif,overflow_recov,succor,smca
  Clock: 1836 MHz
  BogoMips: 7384.98
  Cache: 512 kb
  Units/Processor: 16
  Config Status: cfg=no, avail=yes, need=no, active=unknown

lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01)
02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
03:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
03:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
03:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
05:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller [1b21:1242]
07:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev ef)
09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
0a:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
0a:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
0b:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
0b:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
0b:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
Comment 18 Jens 2022-03-16 17:27:36 UTC
Hello Matt and all,

Using Fedora 35, since some weeks I get these corrected AER "errors" from my AMD-GPU multiple times a day.

Most of the time the (two) monitors get blanked (it might well be related to the screensaver and power-settings, but that is not 100% clear) and (after I move the mouse) the windows from left screen are shifted to the right/main screen. (As if the left was disconnected for a short time.)

Current kernel: 5.16.14-200.fc35.x86_64+debug
AMD-GPU: lspci says "[AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [1002:699f] (rev c7)"

Manually increasing the GPU fan seems to help a little bit, but not much.

(In reply to Matt Jolly from comment #10)
> I'll submit a patch in the near future to address this behaviour.

@Matt, did you manage to submit a patch already?

Cheers,
Jens
Comment 19 Matt Jolly 2022-04-21 06:34:34 UTC
@Jens

A patch was merged back in 5.9 to address all AER logs being sent as 'ERROR' and being unable to filter those.

The root cause of this issue still needs to be addressed, but 'corrected' errors can be filtered out of 'dmesg' now.
Comment 20 Jens 2022-05-23 09:25:28 UTC
Hello Matt,

(In reply to Matt Jolly from comment #19)
> A patch was merged back in 5.9 to address all AER logs being sent as 'ERROR'
> and being unable to filter those.
> 
> The root cause of this issue still needs to be addressed, but 'corrected'
> errors can be filtered out of 'dmesg' now.

Great, thank you for this feedback.
I use Fedora (with kernel 5.17.8-300.fc36.x86_64 currently) and since a few weeks the sudden crashes (and AER errors in dmesg) do not appear any longer.

Cheers!
Jens
Comment 21 Hans-Peter Jansen 2022-06-16 08:16:33 UTC
Hi,

FYI, I reported this issue about two years ago: 
https://bugzilla.kernel.org/show_bug.cgi?id=209331

In contrast to this issue reported here, I wasn't able to co-relate these correctable errors to any crashes or the like, but I get them since installing this server. pci=noaer does not suppress them.

Currently, the server is on 5.17.5, while I prepare to switch to 5.18.3+ soon.

Since a reboot 2.5 days ago, 7732 correctable errors were reported.

In my case, they are always reported on this device:

20:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

and I believe, this is a bridge to a physical PCIe device (8 channel DVB-S receiver). Changing the slot of this device changes the PCI address of the GPP Bridge (as shown in the error messages). Removing it makes these errors go away, but one of the primary purposes of this server is exactly driving this device. 

I agree, that the root cause of this issue needs to be addressed, but what about a new option pci=noaerwarn, that simply suppresses thus correctable error messages for such pathologic cases.
Comment 22 Bjorn Helgaas 2023-05-15 21:24:47 UTC
If this is still a problem, can someone please attach a complete dmesg log from a recent upstream kernel and the output of "sudo lspci -vv"?
Comment 23 Eduardo Santiago 2023-05-15 21:58:54 UTC
Created attachment 304273 [details]
dmesg 2023-05-15 from 6.1.19-gentoo-x86_64
Comment 24 Eduardo Santiago 2023-05-15 22:00:04 UTC
Created attachment 304274 [details]
lspci -vv - 2023-05-15 from 6.1.19-gentoo
Comment 25 Eduardo Santiago 2023-05-15 22:01:17 UTC
Not 100% upstream (Gentoo), and possibly not as recent as you'd like; and this isn't really a problem for me, just something to ignore in the logs. I hope the above attachments are helpful.
Comment 26 Bjorn Helgaas 2023-05-16 23:28:10 UTC
Thanks, Eduardo!  Your comment #23 dmesg shows corrected errors detected by three devices:

  02:05.0 logged Receiver Error and Bad TLP errors for TLPs from the Intel 3168NGW NIC at 05:00.0

  41:00.0 and .1 logged Bad DLLP, Bad TLP, and Replay Timer Timeout errors for TLPs received from the AMD GPP bridge at 40:01.3.

These all look like checksum errors (PCIe r6.0, sec 3.6.2.2, 3.6.3.1).  It's odd that the NIC errors are all within 30 seconds of boot, and the GPU errors are all 80+ hours later.  And it's odd that you have errors on multiple paths.  But my guess (and I'm not a hardware engineer) is that these would be signal integrity issues, not something caused by a software problem.

For posterity, in case we can pin this on the hardware involved, here is the hardware and error info from your dmesg.  If you have better platform details than "To Be Filled by O.E.M.", that would be interesting, too.

DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.90 12/04/2019

pci 0000:02:05.0: [1022:43b4] type 01 class 0x060400
pci 0000:05:00.0: [8086:24fb] type 00 class 0x028000
pci 0000:40:01.3: [1022:1453] type 01 class 0x060400
pci 0000:41:00.0: [1002:67ff] type 00 class 0x030000
pci 0000:41:00.1: [1002:aae0] type 00 class 0x040300

pcieport 0000:02:05.0:    [ 0] RxErr [ 6] BadTLP
amdgpu 0000:41:00.0:    [ 6] BadTLP [ 7] BadDLLP [12] Timeout
snd_hda_intel 0000:41:00.1:    [ 6] BadTLP [ 7] BadDLLP [12] Timeout
Comment 27 Matt Jolly 2023-05-17 00:42:54 UTC
Hi Bjorn,

I no longer have access to any devices that show this error, however my understanding of (and involvement with) this issue is that initially all PCI-e AER messages were sent as KERN_ERR even though Intel's docs from AER's inclusion in the kernel specify that corrected errors should be sent as KERN_WARNING. I submitted a patch (that you significantly improved and merged - https://github.com/torvalds/linux/commit/e83e2ca3c39553a9d0fd497d9c839b341e38c742) that actually sends these as warnings to enable users to filter them out.

As identified in the commit message:

> PCIe correctable errors are recovered by hardware with no need for software
> intervention (PCIe r5.0, sec 6.2.2.1).

It's quite unlikely that any of these errors have anything to do with the kernel itself - my suspicion is that they have to do with poor board firmware, electrical layout, or signal propagation/integrity issues. 

Interestingly, I seem to recall some reports (possibly in other tickets? I can't dig any out) that performance issues may be resolved by setting `pci=noaer` in the cmdline when these errors occur on a device; this may indicate that there's an overhead when processing a constant influx of AER events.

We can probably close this bug off with instructions to either:

1. Identify the specific hardware item and submit a bug for that device.
2. Suppress the warnings by either: setting `pci=noaer` (in the event of performance issues) or only logging errors (if the influx of warnings bothers the user)

As an alternative, perhaps we could raise a new bug that describes the issue and workarounds for users that stumble across it while googling with the instructions to raise a hardware-specific bug?


If there's anything that you think I can help out with please reach out!

Cheers,

Matt
Comment 28 Bjorn Helgaas 2023-05-17 16:15:13 UTC
Yes, I think I'm going to close this.

PCIe Correctable Errors have already been corrected by hardware and there is no functional impact.  Generally they are caused by things like checksum errors that are probably related to signal integrity issues.  Clean and reseat the device, etc.  All the kernel can do is log these.

My goal is to remove the need to use "pci=noaer".  Users should never have to use that parameter.

Even though we've reduced the log level, I think people still use "pci=noaer" because the messages still clog up the dmesg log, so we need some kind of rate-limiting.  There's work in progress [1] to do this, and I hope to merge it for v6.5.

So I'll close this, and rephrase my advice as "please open a new bug if you need 'pci=noaer' on v6.5-rc1 (or any kernel containing the patches at [1])".


[1] https://lore.kernel.org/all/20230317175109.3859943-1-grundler@chromium.org/
Comment 29 Paul Dufresne 2023-12-02 04:00:17 UTC
Try: "pcie_aspm=off"
taken from:
https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165004

I believed it works for the 3 related problems:
SSD, VGA card like GTX 660 and realtek USB wifi devices (less sure for this last).

Basic believe is that motherboard firmware say it does not support ASPM, but device ask for it, and it is accepted.
Comment 30 Paul Dufresne 2023-12-02 04:19:01 UTC
Some new patch might be related:
I said realtek wifi devices, but seems to be mediatek wifi devices:
https://lkml.kernel.org/linux-wireless/aa3386a4-c22d-6d5d-112d-f36b22cda6d3@linux.intel.com/T/

Don't know in which version of Linux it should land in.
Comment 31 Paul Dufresne 2023-12-02 04:27:53 UTC
I am using pcie_aspm=off here on Linux 6.6.3 caused by GTX 660 (Kepler card... belive affects Maxwell too).
So probably previous patch not in 6.6.3­... not sure frankly.

Note You need to log in before you can comment on or make changes to this bug.