Created attachment 292547 [details] errors, device details and initialization sequence Hi, experiencing ~2400 reproducible hardware errors here per day on a ASUS KNPA-U16 server motherboard with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI, and a Digital Devices Max S8 DVB card. Here's a typical one: 2020-09-16T16:50:09.985156+02:00 server kernel: [12494.804769] {401}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 2020-09-16T16:50:09.985210+02:00 server kernel: [12494.804773] {401}[Hardware Error]: It has been corrected by h/w and requires no further action 2020-09-16T16:50:09.985216+02:00 server kernel: [12494.804774] {401}[Hardware Error]: event severity: corrected 2020-09-16T16:50:09.985219+02:00 server kernel: [12494.804777] {401}[Hardware Error]: Error 0, type: corrected 2020-09-16T16:50:09.985222+02:00 server kernel: [12494.804778] {401}[Hardware Error]: fru_text: PcieError 2020-09-16T16:50:09.985277+02:00 server kernel: [12494.804779] {401}[Hardware Error]: section_type: PCIe error 2020-09-16T16:50:09.985279+02:00 server kernel: [12494.804781] {401}[Hardware Error]: port_type: 4, root port 2020-09-16T16:50:09.985281+02:00 server kernel: [12494.804782] {401}[Hardware Error]: version: 0.2 2020-09-16T16:50:09.985284+02:00 server kernel: [12494.804784] {401}[Hardware Error]: command: 0x0407, status: 0x0010 2020-09-16T16:50:09.985285+02:00 server kernel: [12494.804785] {401}[Hardware Error]: device_id: 0000:40:03.1 2020-09-16T16:50:09.985287+02:00 server kernel: [12494.804787] {401}[Hardware Error]: slot: 16 2020-09-16T16:50:09.985289+02:00 server kernel: [12494.804787] {401}[Hardware Error]: secondary_bus: 0x41 2020-09-16T16:50:09.985290+02:00 server kernel: [12494.804789] {401}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1453 2020-09-16T16:50:09.985292+02:00 server kernel: [12494.804790] {401}[Hardware Error]: class_code: 060400 2020-09-16T16:50:09.985294+02:00 server kernel: [12494.804791] {401}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0012 2020-09-16T16:50:09.985310+02:00 server kernel: [12494.804904] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000 2020-09-16T16:50:09.985312+02:00 server kernel: [12494.804910] pcieport 0000:40:03.1: AER: [12] Timeout 2020-09-16T16:50:09.985314+02:00 server kernel: [12494.804914] pcieport 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID 2020-09-16T16:50:15.621082+02:00 server kernel: [12500.436583] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000 2020-09-16T16:50:15.621124+02:00 server kernel: [12500.436588] pcieport 0000:40:03.1: AER: [12] Timeout 2020-09-16T16:50:15.621128+02:00 server kernel: [12500.436592] pcieport 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID Without the Max S8, no such errors appear. What's really nagging is, the pci=noaer and a lot of other experiments with different options doesn't stop this behavior. These, I tried: pci=nomsi pci=noaer pcie_aspm=off pci=nomsi,noaer,ioapicreroute pci=nommconf The error is produced from a bridging device, that is created by the kernel, if the Max S8 is plugged in: 40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge Flags: fast devsel, NUMA node 2 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 2 Bus: primary=40, secondary=41, subordinate=41, sec-latency=0 I/O behind bridge: None Memory behind bridge: e5d00000-e5dfffff [size=1M] Prefetchable memory behind bridge: None Capabilities: [50] Power Management version 3 Capabilities: [58] Express Root Port (Slot+), MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150] Advanced Error Reporting Capabilities: [270] #19 Capabilities: [2a0] Access Control Services Capabilities: [370] L1 PM Substates Capabilities: [380] Downstream Port Containment Capabilities: [3c4] #23 Kernel driver in use: pcieport That bridges to: 41:00.0 Multimedia controller: Digital Devices GmbH Max Subsystem: Digital Devices GmbH Max S8 4/8 Flags: bus master, fast devsel, latency 0, IRQ 165, NUMA node 2 Memory at e5d00000 (64-bit, non-prefetchable) [size=64K] Capabilities: [50] Power Management version 3 Capabilities: [70] MSI: Enable+ Count=2/2 Maskable- 64bit+ Capabilities: [90] Express Endpoint, MSI 00 Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?> Kernel driver in use: ddbridge Kernel modules: ddbridge Here's the initialization sequence of these devices: Sep 16 13:22:07 server kernel: pci 0000:40:03.0: [1022:1452] type 00 class 0x060000 Sep 16 13:22:07 server kernel: pci 0000:40:03.1: [1022:1453] type 01 class 0x060400 Sep 16 13:22:07 server kernel: pci 0000:40:03.1: PME# supported from D0 D3hot D3cold Sep 16 13:22:07 server kernel: pci 0000:41:00.0: [dd01:0007] type 00 class 0x048000 Sep 16 13:22:07 server kernel: pci 0000:41:00.0: reg 0x10: [mem 0xe5d00000-0xe5d0ffff 64bit] Sep 16 13:22:07 server kernel: pci 0000:40:03.1: PCI bridge to [bus 41] Sep 16 13:22:07 server kernel: pci 0000:40:03.1: bridge window [mem 0xe5d00000-0xe5dfffff] Sep 16 13:22:07 server kernel: pci 0000:40:03.1: PCI bridge to [bus 41] Sep 16 13:22:07 server kernel: pci 0000:40:03.1: bridge window [mem 0xe5d00000-0xe5dfffff] Sep 16 13:22:07 server kernel: pci 0000:40:03.0: Adding to iommu group 40 Sep 16 13:22:07 server kernel: pci 0000:40:03.1: Adding to iommu group 41 Sep 16 13:22:07 server kernel: pci 0000:41:00.0: Adding to iommu group 47 Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: PME: Signaling with IRQ 39 Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: AER: enabled with IRQ 39 Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: DPC: enabled with IRQ 39 Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+ Sep 16 13:22:15 server kernel: ddbridge: Digital Devices PCIE bridge driver 0.9.33-integrated, Copyright (C) 2010-17 Digital Devices GmbH Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: detected Digital Devices MAX S8 4/8 Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: HW 0101000f REGMAP 00010002 Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: using 2 MSI interrupt(s) Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 0: Link 0, Link Port 0 (TAB 1): DUAL DVB-S2 MAX Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 1: Link 0, Link Port 1 (TAB 2): DUAL DVB-S2 MAX Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 2: Link 0, Link Port 2 (TAB 3): DUAL DVB-S2 MAX Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 3: Link 0, Link Port 3 (TAB 4): DUAL DVB-S2 MAX Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Read mxl_fw from link 0 Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: Set fmode link 0 = 1 Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 0 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 1 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 2 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 3 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 4 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 5 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 6 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 7 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)... openSUSE 15.2, Kernel 5.8.10 (reproduced with original 15.2 kernel 5.3.18 and up to 5.8.10) Swapping the PCIe slot results in shifting of the PCI ids (on another port, the bridge was at 60:03.1, while the S8 was at 62:00.0), but the problem persisted. Here's the related kernel thread (without any responses!): https://marc.info/?l=linux-kernel&m=159328222416136&w=2 The Max S8 is updated to the current firmware, and is tested with the manufacturer drivers as well. It was working in another Intel based motherboard for a few years without any issues, and is still working well. The 8 DVB tuners work fine. VDR is serving a few set-top boxes with TV and recordings. No issues. It just swamps the log for no good reason (from my POV). Neither ASUS nor Digital Devices seems to be able to explain this behavior. I've attached the technical details and a couple of such errors for better investigation.
Created attachment 292551 [details] Here's a little broader boot log
Meanwhile, I tried to get rid of the errors the hard way, but failed. I tried a couple of combinations of: pci=nomsi pci=noaer pcie_aspm=off pci=nomsi,noaer,ioapicreroute pci=nommconf Interestingly, *none* of them turned off these errors.
My server is experiencing this issue, AMD 7742, NVIDIA A30 GPU and mellanox network card are reporting errors aer_layer=Physical Layer, aer_agent=Receiver ID 0000:81:00.0: AER: 259.4775061 nvidia
How to solve this problem
my ubuntu kernel has been crashing lately. I'm not sure what is causing it, but my computer keeps restarting randomly-5.15.0-25-generic
Nothing anybody can do, as specified, it's a hardware issue not a kernel problem. It means your device is sending wrong data and is maybe dying slowly. AER can usually be disabled in the BIOS' settings, but you basically ignore all the errors and the result can be dramatic. (the error can come from your SSD or any other important PCIe device)