Bug 208667

Summary: PCIe AER errors on ASMedia ASM1083/1085 PCIe to PCI bridge with ASPM enabled
Product: Platform Specific/Hardware Reporter: Robert Hancock (hancockrwd)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.7.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output (with pci=noaer set)
lspci -vvnnxxx output
Patch to disable ASPM on ASM1083/1085
lspci -vvnnxxx output under Windows 10 build 2004
Windows system error log entry

Description Robert Hancock 2020-07-23 00:39:15 UTC
Created attachment 290463 [details]
dmesg output (with pci=noaer set)

With an Asus PRIME H270-PRO motherboard using kernel 5.7.9, periodically there are PCIe AER errors getting spewed in dmesg. This also seems to causes suspend to fail - the system just wakes back up again right away, I am assuming due to some AER errors interrupting the process. 5.6 kernels didn't have this problem.

A sample of the errors:

[   12.909890] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[   12.909890] pcieport 0000:00:1c.0: AER:   device [8086:a292] error status/mask=00003000/00002000
[   12.909891] pcieport 0000:00:1c.0: AER:    [12] Timeout               
[   12.909896] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.909899] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.909900] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.909902] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.909903] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.909906] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.910012] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.910015] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[   12.910015] pcieport 0000:00:1c.0: AER:   device [8086:a292] error status/mask=00001000/00002000
[   12.910016] pcieport 0000:00:1c.0: AER:    [12] Timeout               
[   12.910020] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.910023] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.910157] pcieport 0000:00:1c.0: AER: Multiple Corrected error received: 0000:00:1c.0

0000:00:1c.0 is a PCI Express root port:

00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #3 [8086:a292] (rev f0)

which is connected to this device:

02:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04)

Setting "pci=noaer" on the kernel command line avoids the errors and suspend/resume problems, but presumably is just masking the real issue.

From some more experimentation, it appears that disabling PCIe ASPM with setpci on both the ASMedia PCIe-PCI bridge as well as the PCIe root port it is connected to seems to silence the AER errors and allow suspend/resume to work again:

setpci -s 00:1c.0 0x50.B=0x00
setpci -s 02:00.0 0x90.B=0x00

It appears the behavior changed as a result of this patch (which went
into the stable tree for 5.7.6 and so affects 5.7 kernels as well):

commit 66ff14e59e8a30690755b08bc3042359703fb07a
Author: Kai-Heng Feng <kai.heng.feng@canonical.com>
Date:   Wed May 6 01:34:21 2020 +0800

    PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges

    7d715a6c1ae5 ("PCI: add PCI Express ASPM support") added the ability for
    Linux to enable ASPM, but for some undocumented reason, it didn't enable
    ASPM on links where the downstream component is a PCIe-to-PCI/PCI-X Bridge.

    Remove this exclusion so we can enable ASPM on these links.

    The Dell OptiPlex 7080 mentioned in the bugzilla has a TI XIO2001
    PCIe-to-PCI Bridge.  Enabling ASPM on the link leading to it allows the
    Intel SoC to enter deeper Package C-states, which is a significant power
    savings.

    [bhelgaas: commit log]
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207571
    Link: https://lore.kernel.org/r/20200505173423.26968-1-kai.heng.feng@canonical.com
    Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Comment 1 Robert Hancock 2020-07-23 00:39:48 UTC
Created attachment 290465 [details]
lspci -vvnnxxx output
Comment 2 Robert Hancock 2020-07-23 00:42:23 UTC
Created attachment 290467 [details]
Patch to disable ASPM on ASM1083/1085

Here is a patch I wrote up which seems to fix the issue by disabling ASPM on these devices.

I came across this page on ASMedia's site: https://www.asmedia.com.tw/eng/e_show_products.php?cate_index=169&item=114

which indicates this device has "No PCIe ASPM support". It's not clear why this problem isn't occurring on Windows however - either it is not enabling ASPM, somehow it doesn't cause issues with the PCIe link, or it is causing issues and just doesn't notify the user in any way..
Comment 3 Robert Hancock 2020-07-23 01:37:09 UTC
Created attachment 290471 [details]
lspci -vvnnxxx output under Windows 10 build 2004

It appears that Windows 10 (build 2004) has ASPM L0s enabled for this device according to this lspci output. In the Power Options window, the Balanced plan is selected, and under PCI Express, Link State Power Management, it's set to "Moderate power savings" which the tooltip says enables L0s.
Comment 4 Robert Hancock 2020-07-23 01:38:31 UTC
Created attachment 290473 [details]
Windows system error log entry

However, Windows seems to have the exact same issue with AER errors occurring on this device. There are WHEA correctable hardware error event entries being logged for the PCI Express Root Port in the system event log (see screenshot).
Comment 5 Robert Hancock 2020-07-31 01:24:15 UTC
The commit below has now been merged into mainline and should be in the next release after 5.8-rc7.

commit b361663c5a40c8bc758b7f7f2239f7a192180e7c
Author: Robert Hancock <hancockrwd@gmail.com>
Date:   Tue Jul 21 20:18:03 2020 -0600

    PCI/ASPM: Disable ASPM on ASMedia ASM1083/1085 PCIe-to-PCI bridge
    
    Recently ASPM handling was changed to allow ASPM on PCIe-to-PCI/PCI-X
    bridges.  Unfortunately the ASMedia ASM1083/1085 PCIe to PCI bridge device
    doesn't seem to function properly with ASPM enabled.  On an Asus PRIME
    H270-PRO motherboard, it causes errors like these:
    
      pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
      pcieport 0000:00:1c.0: AER:   device [8086:a292] error status/mask=00003000/00002000
      pcieport 0000:00:1c.0: AER:    [12] Timeout
      pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
      pcieport 0000:00:1c.0: AER: can't find device of ID00e0
    
    In addition to flooding the kernel log, this also causes the machine to
    wake up immediately after suspend is initiated.
    
    The device advertises ASPM L0s and L1 support in the Link Capabilities
    register, but the ASMedia web page for ASM1083 [1] claims "No PCIe ASPM
    support".
    
    Windows 10 (build 2004) enables L0s, but it also logs correctable PCIe
    errors.
    
    Add a quirk to disable ASPM for this device.
    
    [1] https://www.asmedia.com.tw/eng/e_show_products.php?cate_index=169&item=114
    
    [bhelgaas: commit log]
    Fixes: 66ff14e59e8a ("PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges")
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208667
    Link: https://lore.kernel.org/r/20200722021803.17958-1-hancockrwd@gmail.com
    Signed-off-by: Robert Hancock <hancockrwd@gmail.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>