Created attachment 290463 [details] dmesg output (with pci=noaer set) With an Asus PRIME H270-PRO motherboard using kernel 5.7.9, periodically there are PCIe AER errors getting spewed in dmesg. This also seems to causes suspend to fail - the system just wakes back up again right away, I am assuming due to some AER errors interrupting the process. 5.6 kernels didn't have this problem. A sample of the errors: [ 12.909890] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 12.909890] pcieport 0000:00:1c.0: AER: device [8086:a292] error status/mask=00003000/00002000 [ 12.909891] pcieport 0000:00:1c.0: AER: [12] Timeout [ 12.909896] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 [ 12.909899] pcieport 0000:00:1c.0: AER: can't find device of ID00e0 [ 12.909900] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 [ 12.909902] pcieport 0000:00:1c.0: AER: can't find device of ID00e0 [ 12.909903] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 [ 12.909906] pcieport 0000:00:1c.0: AER: can't find device of ID00e0 [ 12.910012] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 [ 12.910015] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) [ 12.910015] pcieport 0000:00:1c.0: AER: device [8086:a292] error status/mask=00001000/00002000 [ 12.910016] pcieport 0000:00:1c.0: AER: [12] Timeout [ 12.910020] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 [ 12.910023] pcieport 0000:00:1c.0: AER: can't find device of ID00e0 [ 12.910157] pcieport 0000:00:1c.0: AER: Multiple Corrected error received: 0000:00:1c.0 0000:00:1c.0 is a PCI Express root port: 00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #3 [8086:a292] (rev f0) which is connected to this device: 02:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04) Setting "pci=noaer" on the kernel command line avoids the errors and suspend/resume problems, but presumably is just masking the real issue. From some more experimentation, it appears that disabling PCIe ASPM with setpci on both the ASMedia PCIe-PCI bridge as well as the PCIe root port it is connected to seems to silence the AER errors and allow suspend/resume to work again: setpci -s 00:1c.0 0x50.B=0x00 setpci -s 02:00.0 0x90.B=0x00 It appears the behavior changed as a result of this patch (which went into the stable tree for 5.7.6 and so affects 5.7 kernels as well): commit 66ff14e59e8a30690755b08bc3042359703fb07a Author: Kai-Heng Feng <kai.heng.feng@canonical.com> Date: Wed May 6 01:34:21 2020 +0800 PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges 7d715a6c1ae5 ("PCI: add PCI Express ASPM support") added the ability for Linux to enable ASPM, but for some undocumented reason, it didn't enable ASPM on links where the downstream component is a PCIe-to-PCI/PCI-X Bridge. Remove this exclusion so we can enable ASPM on these links. The Dell OptiPlex 7080 mentioned in the bugzilla has a TI XIO2001 PCIe-to-PCI Bridge. Enabling ASPM on the link leading to it allows the Intel SoC to enter deeper Package C-states, which is a significant power savings. [bhelgaas: commit log] Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207571 Link: https://lore.kernel.org/r/20200505173423.26968-1-kai.heng.feng@canonical.com Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Created attachment 290465 [details] lspci -vvnnxxx output
Created attachment 290467 [details] Patch to disable ASPM on ASM1083/1085 Here is a patch I wrote up which seems to fix the issue by disabling ASPM on these devices. I came across this page on ASMedia's site: https://www.asmedia.com.tw/eng/e_show_products.php?cate_index=169&item=114 which indicates this device has "No PCIe ASPM support". It's not clear why this problem isn't occurring on Windows however - either it is not enabling ASPM, somehow it doesn't cause issues with the PCIe link, or it is causing issues and just doesn't notify the user in any way..
Created attachment 290471 [details] lspci -vvnnxxx output under Windows 10 build 2004 It appears that Windows 10 (build 2004) has ASPM L0s enabled for this device according to this lspci output. In the Power Options window, the Balanced plan is selected, and under PCI Express, Link State Power Management, it's set to "Moderate power savings" which the tooltip says enables L0s.
Created attachment 290473 [details] Windows system error log entry However, Windows seems to have the exact same issue with AER errors occurring on this device. There are WHEA correctable hardware error event entries being logged for the PCI Express Root Port in the system event log (see screenshot).
The commit below has now been merged into mainline and should be in the next release after 5.8-rc7. commit b361663c5a40c8bc758b7f7f2239f7a192180e7c Author: Robert Hancock <hancockrwd@gmail.com> Date: Tue Jul 21 20:18:03 2020 -0600 PCI/ASPM: Disable ASPM on ASMedia ASM1083/1085 PCIe-to-PCI bridge Recently ASPM handling was changed to allow ASPM on PCIe-to-PCI/PCI-X bridges. Unfortunately the ASMedia ASM1083/1085 PCIe to PCI bridge device doesn't seem to function properly with ASPM enabled. On an Asus PRIME H270-PRO motherboard, it causes errors like these: pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) pcieport 0000:00:1c.0: AER: device [8086:a292] error status/mask=00003000/00002000 pcieport 0000:00:1c.0: AER: [12] Timeout pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0 pcieport 0000:00:1c.0: AER: can't find device of ID00e0 In addition to flooding the kernel log, this also causes the machine to wake up immediately after suspend is initiated. The device advertises ASPM L0s and L1 support in the Link Capabilities register, but the ASMedia web page for ASM1083 [1] claims "No PCIe ASPM support". Windows 10 (build 2004) enables L0s, but it also logs correctable PCIe errors. Add a quirk to disable ASPM for this device. [1] https://www.asmedia.com.tw/eng/e_show_products.php?cate_index=169&item=114 [bhelgaas: commit log] Fixes: 66ff14e59e8a ("PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208667 Link: https://lore.kernel.org/r/20200722021803.17958-1-hancockrwd@gmail.com Signed-off-by: Robert Hancock <hancockrwd@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>