Bug 201517
Summary: | pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Mikhail (mikhail.v.gavrilov) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | andreasjantz, bjorn, claystan97, dufresnep, emteeelp, hpj, jens.rosenboom, kernelbugs, Matt.Jolly, neil, phitchen90, quentinj, raulvior.bcn, twinshadows404 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.19 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
Network issue with latest kernel dmesg with 4.19 Kernel, network issue dmesg with 4.18 Kernel, no issues dmesg dmesg showing AER error. dmesg AER ryzen 2700x msi x470 gaming pro max dmesg 2023-05-15 from 6.1.19-gentoo-x86_64 lspci -vv - 2023-05-15 from 6.1.19-gentoo |
What this message means? Created attachment 279815 [details]
Network issue with latest kernel
Created attachment 279867 [details]
dmesg with 4.19 Kernel, network issue
Created attachment 279869 [details]
dmesg with 4.18 Kernel, no issues
The issue started with 4.19-RC1 https://www.lkml.org/lkml/2018/8/26/199 I also have network issues with 4.19.6, which began with 4.19-RC1. I had no issues with 4.18; dmesg here: http://ix.io/1vcF Network issues present in 4.19.6; dmesg here: http://ix.io/1vay I'm getting this same issue on kernel 5.3.6 (Unraid). On a threadripper 1950x / Asus X3990-A Motherboard. Repeated over and over in the logs. Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: Multiple Corrected error received: 0000:00:00.0 Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) Oct 14 15:37:07 OBI-WAN kernel: pcieport 0000:40:03.1: AER: device [1022:1453] error status/mask=00001180/00006000 [1022:1453] 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge I Could confirm this error also on kernel 5.4.8-arch1-1 pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Trans> Jan 06 06:28:52 ajmind1 kernel: pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001100/00006000 Jan 06 06:28:52 ajmind1 kernel: pcieport 0000:00:03.1: AER: [ 8] Rollover Jan 06 06:28:52 ajmind1 kernel: pcieport 0000:00:03.1: AER: [12] Timeout Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Trans> Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001100/00006000 Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER: [ 8] Rollover Jan 06 06:30:21 ajmind1 kernel: pcieport 0000:00:03.1: AER: [12] Timeout Jan 06 06:31:16 ajmind1 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Trans> Jan 06 06:31:16 ajmind1 kernel: pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 26 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: None Memory behind bridge: fcf00000-fcffffff [size=1M] Prefetchable memory behind bridge: None Capabilities: [50] Power Management version 3 Capabilities: [58] Express Root Port (Slot+), MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150] Advanced Error Reporting Capabilities: [270] Secondary PCI Express <?> Capabilities: [2a0] Access Control Services Capabilities: [370] L1 PM Substates Capabilities: [3c4] Designated Vendor-Specific <?> Kernel driver in use: pcieport Everyone who experienced this issue please try workaround "pci=noaer" in kernel options. I have tested the system several months with the kernel command line "pci=noaer". And the error "pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0" not happened all testing time. (In reply to Mikhail from comment #9) > Everyone who experienced this issue please try workaround "pci=noaer" in > kernel options. > > I have tested the system several months with the kernel command line > "pci=noaer". And the error "pcieport 0000:00:03.1: AER: Corrected error > received: 0000:00:00.0" not happened all testing time. That's because you've disabled AER. AER is just PCI-e Advanced Error Reporting. According to the documentation (https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt), Corrected errors should be printed with a severity of "warning". >When a PCIe AER error is captured, an error message will be output to >console. If it's a correctable error, it is output as a warning. >Otherwise, it is printed as an error. So users could choose different >log level to filter out correctable error messages. A quick browse through aer.c shows that it only calls the pci_err function, which unsurprisingly sends an error, rather than a warning. I can also confirm that setting dmesg to 'dmesg -n err' still shows this text. This warning can be safely ignored - it's reporting a corrected error. If there's a performance impact a separate bug should be filed. I'll submit a patch in the near future to address this behaviour. I met the familiar problem in debian unstable kernel version : 5.8 when boot system the kernel log : -- Logs begin at Sat 2020-08-29 23:13:02 CST. -- 9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER: device [1e0f:0009] error status/mask=00001000/00006000 9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER: [12] Timeout 9月 03 16:46:17 stan kernel: pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0 9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) 9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER: device [1e0f:0009] error status/mask=00001000/00006000 9月 03 16:46:17 stan kernel: nvme 0000:01:00.0: AER: [12] Timeout 9月 03 16:46:48 stan kernel: pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0 9月 03 16:46:48 stan kernel: nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) 9月 03 16:46:48 stan kernel: nvme 0000:01:00.0: AER: device [1e0f:0009] error status/mask=00001000/00006000 9月 03 16:46:48 stan kernel: nvme 0000:01:00.0: AER: [12] Timeout Created attachment 292511 [details]
dmesg
I doubt this is (only) a network issue as I found a way to reproduce this 100% of the time.
Change the BLCK from 100 to 105.Few observations:
When the mouse button is pressed and dragged over some text it will randomly stop highlighting text and start dragging it.Nvme prints these errors as well
When the GPU is at its lowest power state(300MHz) it won't stayed fixed but rather fluctuate couple of MHz above.On Windows, after every shutdown WerFault.exe will prevent the shutdown for a short time.
From that it seems that the issue is with the USB connections and it's not only affecting linux.
Created attachment 293759 [details] dmesg showing AER error. After a few days with the system online, dmesg showed an AER: > pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 > pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data > Link Layer, (Transmitter ID) > pcieport 0000:00:03.1: AER: device [1022:1453] error > status/mask=00001000/00006000 > pcieport 0000:00:03.1: AER: [12] Timeout Afterwards, amdgpu printed this error: > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [drm] amdgpu_dm_irq_schedule_work FAILED src 12 The graphics card is hanging from the PCIe bridge corresponding to device [1022:1453]. Linux version 5.8.16-050816-generic. Ubuntu 20.04 LTS. I have appended the output of lspci -vnnt and lspci -nn. This might be a CPU-SoC problem, related to the idle soft lockups. Poor CPU binning might be at play. (In reply to raul from comment #13) The errors showed up a few hours far from each other. [de nov.20 23:52] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ +0,000006] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ +0,000004] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00001000/00006000 [ +0,000002] pcieport 0000:00:03.1: AER: [12] Timeout [de nov.21 15:00] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 [ +0,001199] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 [de nov.21 15:01] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 [ +0,003955] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 [ +0,002941] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 [ +0,001758] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 [de nov.21 15:08] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 (In reply to raul from comment #14) The amdgpu_dm_irq_schedule_work occurred today while changing the monitor. Then, while I was changing resolutions within the GNOME settings panel it crashed while enabling/disabling FreeSync in the monitor. Dmesg shows this errors (some of the previous included to show nothing else happened): > [de nov.21 15:01] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [ +0,003955] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [ +0,002941] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [ +0,001758] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [de nov.21 15:08] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [de nov.21 20:02] show_signal_msg: 15 callbacks suppressed > [ +0,000001] gnome-control-c[1019094]: segfault at 7ffcd97dbf88 ip > 00007f26251bfa80 sp 00007ffcd97dbf90 error 6 in > libc-2.31.so[7f262516b000+178000] > [ +0,000008] Code: 89 fb 83 e3 02 0f 85 2f 04 00 00 f3 41 0f 6f 0e 49 8b 46 > 10 be 25 00 00 00 4c 89 ef 48 89 85 c8 fb ff ff 0f 11 8d b8 fb ff ff <e8> 0b > b9 fa ff 41 81 e7 00 80 00 00 c7 85 44 fb ff ff 00 00 00 00 > [ +0,473765] [drm:dm_force_atomic_commit.isra.0 [amdgpu]] *ERROR* Restoring > old state failed with -12 > [de nov.21 20:03] [drm] amdgpu_dm_irq_schedule_work FAILED src 12 > [de nov.21 20:52] device-mapper: table: 253:0: linear: Device lookup failed > [ +0,000003] device-mapper: ioctl: error adding target to table Please note that the AER error did not happen while changing or using a new screen. I think I was watching DVB-T with Kaffeine, which I believe takes advantage of hardware acceleration to decode the video stream. I have had these AER errors with other motherboards, not related with the GPU but with bridges connecting USB ports. The DVB-T dongle happens to be connected through USB. The system currently has a Ryzen 7 2700X installed. The other errors happened with a Ryzen 7 1800X and ASUS Crosshair VII motherboard. Error quoted in comment https://bugzilla.kernel.org/show_bug.cgi?id=208201#c12 and https://bugzilla.kernel.org/show_bug.cgi?id=208201#c13 from another bug in which I'm showing ryzen 7 1800X soft lockups with different motherboards, until I found a motherboard which has a working "Typical power idle" setting. Created attachment 293761 [details]
dmesg AER ryzen 2700x msi x470 gaming pro max
Got another one. See attached dmesg which includes all the previous. Using the DVB-T usb dongle with Kaffeine.
I appended lsusb -vt output. You can ignore device-mapper errors. Is just kernel upgrades scanning disks and failing to access bcache.
Same problem here after switching from kernel 5.9.12 to 5.10.4 (SuSE Tumbleweed). Since then I get the following errors on a daily basis: kernel: [41896.255169] [drm:dm_restore_drm_connector_state [amdgpu]] *ERROR* Restoring old state failed with -12 kernel: [ 6948.820626] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:09:00.0 kernel: [ 6948.820634] amdgpu 0000:09:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) kernel: [ 6948.820636] amdgpu 0000:09:00.0: device [1002:67df] error status/mask=000000c0/00002000 kernel: [ 6948.820638] amdgpu 0000:09:00.0: [ 6] BadTLP kernel: [ 6948.820639] amdgpu 0000:09:00.0: [ 7] BadDLLP Linux frodo 5.10.5-1-default #1 SMP Wed Jan 6 18:20:57 UTC 2021 (19815f3) x86_64 x86_64 x86_64 GNU/Linux 01: None 00.0: 10103 CPU [Created at cpu.462] Unique ID: rdCR.j8NaKXDZtZ6 Hardware Class: cpu Arch: X86-64 Vendor: "AuthenticAMD" Model: 23.8.2 "AMD Ryzen 7 2700X Eight-Core Processor" Features: fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,constant_tsc,rep_good,nopl,nonstop_tsc,cpuid,extd_apicid,aperfmperf,pni,pclmulqdq,monitor,ssse3,fma,cx16,sse4_1,sse4_2,movbe,popcnt,aes,xsave,avx,f16c,rdrand,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,skinit,wdt,tce,topoext,perfctr_core,perfctr_nb,bpext,perfctr_llc,mwaitx,cpb,hw_pstate,sme,ssbd,sev,ibpb,vmmcall,sev_es,fsgsbase,bmi1,avx2,smep,bmi2,rdseed,adx,smap,clflushopt,sha_ni,xsaveopt,xsavec,xgetbv1,xsaves,clzero,irperf,xsaveerptr,arat,npt,lbrv,svm_lock,nrip_save,tsc_scale,vmcb_clean,flushbyasid,decodeassists,pausefilter,pfthreshold,avic,v_vmsave_vmload,vgif,overflow_recov,succor,smca Clock: 1836 MHz BogoMips: 7384.98 Cache: 512 kb Units/Processor: 16 Config Status: cfg=no, avail=yes, need=no, active=unknown lspci -nn 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467] 01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01) 02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01) 02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01) 03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01) 03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01) 03:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01) 03:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01) 03:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01) 05:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller [1b21:1242] 07:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03) 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev ef) 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0] 0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a] 0a:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456] 0a:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f] 0b:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455] 0b:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) 0b:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457] Hello Matt and all, Using Fedora 35, since some weeks I get these corrected AER "errors" from my AMD-GPU multiple times a day. Most of the time the (two) monitors get blanked (it might well be related to the screensaver and power-settings, but that is not 100% clear) and (after I move the mouse) the windows from left screen are shifted to the right/main screen. (As if the left was disconnected for a short time.) Current kernel: 5.16.14-200.fc35.x86_64+debug AMD-GPU: lspci says "[AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [1002:699f] (rev c7)" Manually increasing the GPU fan seems to help a little bit, but not much. (In reply to Matt Jolly from comment #10) > I'll submit a patch in the near future to address this behaviour. @Matt, did you manage to submit a patch already? Cheers, Jens @Jens A patch was merged back in 5.9 to address all AER logs being sent as 'ERROR' and being unable to filter those. The root cause of this issue still needs to be addressed, but 'corrected' errors can be filtered out of 'dmesg' now. Hello Matt, (In reply to Matt Jolly from comment #19) > A patch was merged back in 5.9 to address all AER logs being sent as 'ERROR' > and being unable to filter those. > > The root cause of this issue still needs to be addressed, but 'corrected' > errors can be filtered out of 'dmesg' now. Great, thank you for this feedback. I use Fedora (with kernel 5.17.8-300.fc36.x86_64 currently) and since a few weeks the sudden crashes (and AER errors in dmesg) do not appear any longer. Cheers! Jens Hi, FYI, I reported this issue about two years ago: https://bugzilla.kernel.org/show_bug.cgi?id=209331 In contrast to this issue reported here, I wasn't able to co-relate these correctable errors to any crashes or the like, but I get them since installing this server. pci=noaer does not suppress them. Currently, the server is on 5.17.5, while I prepare to switch to 5.18.3+ soon. Since a reboot 2.5 days ago, 7732 correctable errors were reported. In my case, they are always reported on this device: 20:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge and I believe, this is a bridge to a physical PCIe device (8 channel DVB-S receiver). Changing the slot of this device changes the PCI address of the GPP Bridge (as shown in the error messages). Removing it makes these errors go away, but one of the primary purposes of this server is exactly driving this device. I agree, that the root cause of this issue needs to be addressed, but what about a new option pci=noaerwarn, that simply suppresses thus correctable error messages for such pathologic cases. If this is still a problem, can someone please attach a complete dmesg log from a recent upstream kernel and the output of "sudo lspci -vv"? Created attachment 304273 [details]
dmesg 2023-05-15 from 6.1.19-gentoo-x86_64
Created attachment 304274 [details]
lspci -vv - 2023-05-15 from 6.1.19-gentoo
Not 100% upstream (Gentoo), and possibly not as recent as you'd like; and this isn't really a problem for me, just something to ignore in the logs. I hope the above attachments are helpful. Thanks, Eduardo! Your comment #23 dmesg shows corrected errors detected by three devices: 02:05.0 logged Receiver Error and Bad TLP errors for TLPs from the Intel 3168NGW NIC at 05:00.0 41:00.0 and .1 logged Bad DLLP, Bad TLP, and Replay Timer Timeout errors for TLPs received from the AMD GPP bridge at 40:01.3. These all look like checksum errors (PCIe r6.0, sec 3.6.2.2, 3.6.3.1). It's odd that the NIC errors are all within 30 seconds of boot, and the GPU errors are all 80+ hours later. And it's odd that you have errors on multiple paths. But my guess (and I'm not a hardware engineer) is that these would be signal integrity issues, not something caused by a software problem. For posterity, in case we can pin this on the hardware involved, here is the hardware and error info from your dmesg. If you have better platform details than "To Be Filled by O.E.M.", that would be interesting, too. DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.90 12/04/2019 pci 0000:02:05.0: [1022:43b4] type 01 class 0x060400 pci 0000:05:00.0: [8086:24fb] type 00 class 0x028000 pci 0000:40:01.3: [1022:1453] type 01 class 0x060400 pci 0000:41:00.0: [1002:67ff] type 00 class 0x030000 pci 0000:41:00.1: [1002:aae0] type 00 class 0x040300 pcieport 0000:02:05.0: [ 0] RxErr [ 6] BadTLP amdgpu 0000:41:00.0: [ 6] BadTLP [ 7] BadDLLP [12] Timeout snd_hda_intel 0000:41:00.1: [ 6] BadTLP [ 7] BadDLLP [12] Timeout Hi Bjorn, I no longer have access to any devices that show this error, however my understanding of (and involvement with) this issue is that initially all PCI-e AER messages were sent as KERN_ERR even though Intel's docs from AER's inclusion in the kernel specify that corrected errors should be sent as KERN_WARNING. I submitted a patch (that you significantly improved and merged - https://github.com/torvalds/linux/commit/e83e2ca3c39553a9d0fd497d9c839b341e38c742) that actually sends these as warnings to enable users to filter them out. As identified in the commit message: > PCIe correctable errors are recovered by hardware with no need for software > intervention (PCIe r5.0, sec 6.2.2.1). It's quite unlikely that any of these errors have anything to do with the kernel itself - my suspicion is that they have to do with poor board firmware, electrical layout, or signal propagation/integrity issues. Interestingly, I seem to recall some reports (possibly in other tickets? I can't dig any out) that performance issues may be resolved by setting `pci=noaer` in the cmdline when these errors occur on a device; this may indicate that there's an overhead when processing a constant influx of AER events. We can probably close this bug off with instructions to either: 1. Identify the specific hardware item and submit a bug for that device. 2. Suppress the warnings by either: setting `pci=noaer` (in the event of performance issues) or only logging errors (if the influx of warnings bothers the user) As an alternative, perhaps we could raise a new bug that describes the issue and workarounds for users that stumble across it while googling with the instructions to raise a hardware-specific bug? If there's anything that you think I can help out with please reach out! Cheers, Matt Yes, I think I'm going to close this. PCIe Correctable Errors have already been corrected by hardware and there is no functional impact. Generally they are caused by things like checksum errors that are probably related to signal integrity issues. Clean and reseat the device, etc. All the kernel can do is log these. My goal is to remove the need to use "pci=noaer". Users should never have to use that parameter. Even though we've reduced the log level, I think people still use "pci=noaer" because the messages still clog up the dmesg log, so we need some kind of rate-limiting. There's work in progress [1] to do this, and I hope to merge it for v6.5. So I'll close this, and rephrase my advice as "please open a new bug if you need 'pci=noaer' on v6.5-rc1 (or any kernel containing the patches at [1])". [1] https://lore.kernel.org/all/20230317175109.3859943-1-grundler@chromium.org/ Try: "pcie_aspm=off" taken from: https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165004 I believed it works for the 3 related problems: SSD, VGA card like GTX 660 and realtek USB wifi devices (less sure for this last). Basic believe is that motherboard firmware say it does not support ASPM, but device ask for it, and it is accepted. Some new patch might be related: I said realtek wifi devices, but seems to be mediatek wifi devices: https://lkml.kernel.org/linux-wireless/aa3386a4-c22d-6d5d-112d-f36b22cda6d3@linux.intel.com/T/ Don't know in which version of Linux it should land in. I am using pcie_aspm=off here on Linux 6.6.3 caused by GTX 660 (Kepler card... belive affects Maxwell too). So probably previous patch not in 6.6.3... not sure frankly. |
Created attachment 279149 [details] dmesg I often get a strange error in the kernel log: [ 8885.590311] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0 [ 8885.590320] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 8885.590324] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00001000/00006000 [ 8885.590328] pcieport 0000:00:03.1: [12] Timeout But not always, it means that if this message starts to appear after a reboot, then it will appear again and again, and if it does not appear, it does not appear at all. # lspci -nn 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452] 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452] 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452] 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467] 01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Optane SSD 900P Series [8086:2700] 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01) 02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] Device [1022:43c8] (rev 01) # uname -r 4.19.0-0.rc8.git4.1.fc30.x86_64