Bug 209331 - AER: Hardware error from APEI Generic Hardware Error with EPYC and DD Max S8 DVB
Summary: AER: Hardware error from APEI Generic Hardware Error with EPYC and DD Max S8 DVB
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-19 09:49 UTC by Hans-Peter Jansen
Modified: 2023-09-22 04:15 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.3.18, 5.7.11, 5.8.7, 5.8.9, 5.8.10, 5.8.11, 5.9.1, 5.10.13, 5.12.2, 5.17.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
errors, device details and initialization sequence (18.93 KB, text/plain)
2020-09-19 09:49 UTC, Hans-Peter Jansen
Details
Here's a little broader boot log (12.60 KB, text/plain)
2020-09-21 08:00 UTC, Hans-Peter Jansen
Details

Description Hans-Peter Jansen 2020-09-19 09:49:01 UTC
Created attachment 292547 [details]
errors, device details and initialization sequence

Hi,

experiencing ~2400 reproducible hardware errors here per day on a ASUS KNPA-U16 server motherboard with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI, and a Digital Devices Max S8 DVB card.

Here's a typical one:

2020-09-16T16:50:09.985156+02:00 server kernel: [12494.804769] {401}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2020-09-16T16:50:09.985210+02:00 server kernel: [12494.804773] {401}[Hardware Error]: It has been corrected by h/w and requires no further action
2020-09-16T16:50:09.985216+02:00 server kernel: [12494.804774] {401}[Hardware Error]: event severity: corrected
2020-09-16T16:50:09.985219+02:00 server kernel: [12494.804777] {401}[Hardware Error]:  Error 0, type: corrected
2020-09-16T16:50:09.985222+02:00 server kernel: [12494.804778] {401}[Hardware Error]:  fru_text: PcieError
2020-09-16T16:50:09.985277+02:00 server kernel: [12494.804779] {401}[Hardware Error]:   section_type: PCIe error
2020-09-16T16:50:09.985279+02:00 server kernel: [12494.804781] {401}[Hardware Error]:   port_type: 4, root port
2020-09-16T16:50:09.985281+02:00 server kernel: [12494.804782] {401}[Hardware Error]:   version: 0.2
2020-09-16T16:50:09.985284+02:00 server kernel: [12494.804784] {401}[Hardware Error]:   command: 0x0407, status: 0x0010
2020-09-16T16:50:09.985285+02:00 server kernel: [12494.804785] {401}[Hardware Error]:   device_id: 0000:40:03.1
2020-09-16T16:50:09.985287+02:00 server kernel: [12494.804787] {401}[Hardware Error]:   slot: 16
2020-09-16T16:50:09.985289+02:00 server kernel: [12494.804787] {401}[Hardware Error]:   secondary_bus: 0x41
2020-09-16T16:50:09.985290+02:00 server kernel: [12494.804789] {401}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
2020-09-16T16:50:09.985292+02:00 server kernel: [12494.804790] {401}[Hardware Error]:   class_code: 060400
2020-09-16T16:50:09.985294+02:00 server kernel: [12494.804791] {401}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
2020-09-16T16:50:09.985310+02:00 server kernel: [12494.804904] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-09-16T16:50:09.985312+02:00 server kernel: [12494.804910] pcieport 0000:40:03.1: AER:    [12] Timeout               
2020-09-16T16:50:09.985314+02:00 server kernel: [12494.804914] pcieport 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
2020-09-16T16:50:15.621082+02:00 server kernel: [12500.436583] pcieport 0000:40:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-09-16T16:50:15.621124+02:00 server kernel: [12500.436588] pcieport 0000:40:03.1: AER:    [12] Timeout               
2020-09-16T16:50:15.621128+02:00 server kernel: [12500.436592] pcieport 0000:40:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

Without the Max S8, no such errors appear. What's really nagging is, the pci=noaer and a lot of other experiments with different options doesn't stop this behavior.

These, I tried:
pci=nomsi 
pci=noaer
pcie_aspm=off
pci=nomsi,noaer,ioapicreroute
pci=nommconf

The error is produced from a bridging device, that is created by the kernel, if the Max S8 is plugged in:

40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
        Flags: fast devsel, NUMA node 2

40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 2
        Bus: primary=40, secondary=41, subordinate=41, sec-latency=0
        I/O behind bridge: None
        Memory behind bridge: e5d00000-e5dfffff [size=1M]
        Prefetchable memory behind bridge: None
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Root Port (Slot+), MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [270] #19
        Capabilities: [2a0] Access Control Services
        Capabilities: [370] L1 PM Substates
        Capabilities: [380] Downstream Port Containment
        Capabilities: [3c4] #23
        Kernel driver in use: pcieport

That bridges to:

41:00.0 Multimedia controller: Digital Devices GmbH Max
        Subsystem: Digital Devices GmbH Max S8 4/8
        Flags: bus master, fast devsel, latency 0, IRQ 165, NUMA node 2
        Memory at e5d00000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 3
        Capabilities: [70] MSI: Enable+ Count=2/2 Maskable- 64bit+
        Capabilities: [90] Express Endpoint, MSI 00
        Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
        Kernel driver in use: ddbridge
        Kernel modules: ddbridge

Here's the initialization sequence of these devices:

Sep 16 13:22:07 server kernel: pci 0000:40:03.0: [1022:1452] type 00 class 0x060000
Sep 16 13:22:07 server kernel: pci 0000:40:03.1: [1022:1453] type 01 class 0x060400
Sep 16 13:22:07 server kernel: pci 0000:40:03.1: PME# supported from D0 D3hot D3cold
Sep 16 13:22:07 server kernel: pci 0000:41:00.0: [dd01:0007] type 00 class 0x048000
Sep 16 13:22:07 server kernel: pci 0000:41:00.0: reg 0x10: [mem 0xe5d00000-0xe5d0ffff 64bit]
Sep 16 13:22:07 server kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Sep 16 13:22:07 server kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
Sep 16 13:22:07 server kernel: pci 0000:40:03.1: PCI bridge to [bus 41]
Sep 16 13:22:07 server kernel: pci 0000:40:03.1:   bridge window [mem 0xe5d00000-0xe5dfffff]
Sep 16 13:22:07 server kernel: pci 0000:40:03.0: Adding to iommu group 40
Sep 16 13:22:07 server kernel: pci 0000:40:03.1: Adding to iommu group 41
Sep 16 13:22:07 server kernel: pci 0000:41:00.0: Adding to iommu group 47
Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: PME: Signaling with IRQ 39
Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: AER: enabled with IRQ 39
Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: DPC: enabled with IRQ 39
Sep 16 13:22:07 server kernel: pcieport 0000:40:03.1: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
Sep 16 13:22:15 server kernel: ddbridge: Digital Devices PCIE bridge driver 0.9.33-integrated, Copyright (C) 2010-17 Digital Devices GmbH
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: detected Digital Devices MAX S8 4/8
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: HW 0101000f REGMAP 00010002
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: using 2 MSI interrupt(s)
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 0: Link 0, Link Port 0 (TAB 1): DUAL DVB-S2 MAX
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 1: Link 0, Link Port 1 (TAB 2): DUAL DVB-S2 MAX
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 2: Link 0, Link Port 2 (TAB 3): DUAL DVB-S2 MAX
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Port 3: Link 0, Link Port 3 (TAB 4): DUAL DVB-S2 MAX
Sep 16 13:22:15 server kernel: ddbridge 0000:41:00.0: Read mxl_fw from link 0
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: Set fmode link 0 = 1
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 0 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 1 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 2 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 3 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 4 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 5 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 6 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...
Sep 16 13:22:21 server kernel: ddbridge 0000:41:00.0: DVB: registering adapter 7 frontend 0 (MaxLinear MxL5xx DVB-S/S2 tuner-demodulator)...

openSUSE 15.2, Kernel 5.8.10
(reproduced with original 15.2 kernel 5.3.18 and up to 5.8.10)

Swapping the PCIe slot results in shifting of the PCI ids (on another port, 
the bridge was at 60:03.1, while the S8 was at 62:00.0), but the problem 
persisted.

Here's the related kernel thread (without any responses!):
https://marc.info/?l=linux-kernel&m=159328222416136&w=2

The Max S8 is updated to the current firmware, and is tested with the manufacturer drivers as well. It was working in another Intel based motherboard for a few years without any issues, and is still working well. 

The 8 DVB tuners work fine. VDR is serving a few set-top boxes with TV and recordings. No issues.

It just swamps the log for no good reason (from my POV).

Neither ASUS nor Digital Devices seems to be able to explain this behavior.

I've attached the technical details and a couple of such errors for better investigation.
Comment 1 Hans-Peter Jansen 2020-09-21 08:00:22 UTC
Created attachment 292551 [details]
Here's a little broader boot log
Comment 2 Hans-Peter Jansen 2021-02-06 17:37:53 UTC
Meanwhile, I tried to get rid of the errors the hard way, but failed.

I tried a couple of combinations of:

pci=nomsi 
pci=noaer
pcie_aspm=off
pci=nomsi,noaer,ioapicreroute
pci=nommconf

Interestingly, *none* of them turned off these errors.
Comment 3 kenli 2023-08-11 09:45:53 UTC
My server is experiencing this issue, AMD 7742, NVIDIA A30 GPU and mellanox network card are reporting errors
aer_layer=Physical Layer, aer_agent=Receiver ID
0000:81:00.0:
AER:
259.4775061
nvidia
Comment 4 kenli 2023-08-11 09:46:33 UTC
How to solve this problem
Comment 5 kenli 2023-08-11 09:47:07 UTC
my ubuntu kernel  has been crashing lately. I'm not sure what is causing it, but my computer keeps restarting randomly-5.15.0-25-generic
Comment 6 Alexis G. 2023-09-22 04:15:21 UTC
Nothing anybody can do, as specified, it's a hardware issue not a kernel problem.
It means your device is sending wrong data and is maybe dying slowly.

AER can usually be disabled in the BIOS' settings, but you basically ignore all the errors and the result can be dramatic. (the error can come from your SSD or any other important PCIe device)

Note You need to log in before you can comment on or make changes to this bug.