Bug 217446 - PCIe AER error after S3 when enabled Intel IOMMU
Summary: PCIe AER error after S3 when enabled Intel IOMMU
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: IOMMU (show other bugs)
Hardware: Intel Linux
: P3 blocking
Assignee: drivers_iommu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-15 11:26 UTC by Pengyu Ma
Modified: 2023-05-29 13:45 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
iommu-pcie-aer-error.log (172.25 KB, text/plain)
2023-05-15 11:26 UTC, Pengyu Ma
Details
lspci (94.81 KB, text/plain)
2023-05-15 11:27 UTC, Pengyu Ma
Details
6.0-pcie-ptm-D3-suspend.dmesg (97.87 KB, text/plain)
2023-05-15 17:26 UTC, Pengyu Ma
Details

Description Pengyu Ma 2023-05-15 11:26:52 UTC
Created attachment 304263 [details]
iommu-pcie-aer-error.log

CPU: i9-13900
IOMMU enabled.
Legacy S3.

3 PCIE buses report AER error.

1,  Intel I350 [8086:1521] ethernet card connected to 00:06.0
[   58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:06.0
[   58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[   58.185287] pcieport 0000:00:06.0:   device [8086:a74d] error status/mask=00200000/00010000
[   58.185296] pcieport 0000:00:06.0:    [21] ACSViol                (First)

2, Intel thunderbolt [8086:7ab4]
[   58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
[   58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[   58.187911] pcieport 0000:00:1d.0:   device [8086:7ab4] error status/mask=00008000/00002000
[   58.187923] pcieport 0000:00:1d.0:    [15] HeaderOF
[   58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
[   58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   58.188014] pcieport 0000:00:1d.0:   device [8086:7ab4] error status/mask=00100000/00004000
[   58.188024] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
[   58.188032] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 3f000052 00000000 00000000

3, Sandisk Corp WD PC SN810
[   58.310272] pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0
[   58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   58.310299] nvme 0000:03:00.0:   device [15b7:5011] error status/mask=00000001/0000e000
[   58.310304] nvme 0000:03:00.0:    [ 0] RxErr

Disable IOMMU, the issue is gone.

The AER error and recovery could make race issue when resume from S3.
It caused igb hang. The rootcause is from AER error.
Comment 1 Pengyu Ma 2023-05-15 11:27:12 UTC
Created attachment 304264 [details]
lspci
Comment 2 Bagas Sanjaya 2023-05-15 13:03:40 UTC
(In reply to Pengyu Ma from comment #0)
> Created attachment 304263 [details]
> iommu-pcie-aer-error.log
> 
> CPU: i9-13900
> IOMMU enabled.
> Legacy S3.
> 
> 3 PCIE buses report AER error.
> 
> 1,  Intel I350 [8086:1521] ethernet card connected to 00:06.0
> [   58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal)
> error received: 0000:00:06.0
> [   58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected
> (Non-Fatal), type=Transaction Layer, (Receiver ID)
> [   58.185287] pcieport 0000:00:06.0:   device [8086:a74d] error
> status/mask=00200000/00010000
> [   58.185296] pcieport 0000:00:06.0:    [21] ACSViol                (First)
> 
> 2, Intel thunderbolt [8086:7ab4]
> [   58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error
> received: 0000:00:1d.0
> [   58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected,
> type=Transaction Layer, (Receiver ID)
> [   58.187911] pcieport 0000:00:1d.0:   device [8086:7ab4] error
> status/mask=00008000/00002000
> [   58.187923] pcieport 0000:00:1d.0:    [15] HeaderOF
> [   58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal)
> error received: 0000:00:1d.0
> [   58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected
> (Non-Fatal), type=Transaction Layer, (Requester ID)
> [   58.188014] pcieport 0000:00:1d.0:   device [8086:7ab4] error
> status/mask=00100000/00004000
> [   58.188024] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> [   58.188032] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 3f000052
> 00000000 00000000
> 
> 3, Sandisk Corp WD PC SN810
> [   58.310272] pcieport 0000:00:1a.0: AER: Corrected error received:
> 0000:03:00.0
> [   58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, (Receiver ID)
> [   58.310299] nvme 0000:03:00.0:   device [15b7:5011] error
> status/mask=00000001/0000e000
> [   58.310304] nvme 0000:03:00.0:    [ 0] RxErr
> 
> Disable IOMMU, the issue is gone.
> 
> The AER error and recovery could make race issue when resume from S3.
> It caused igb hang. The rootcause is from AER error.

What kernel version did this issue occur?
Comment 3 Pengyu Ma 2023-05-15 13:14:55 UTC
Hi Sanjaya,

It's 6.4.0-rc1.
Comment 4 Bagas Sanjaya 2023-05-15 13:37:33 UTC
On 5/15/23 20:14, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217446
> 
> --- Comment #3 from Pengyu Ma (mapengyu@gmail.com) ---
> Hi Sanjaya,
> 
> It's 6.4.0-rc1.
> 

And last known version that doesn't exhibit this issue?
Comment 5 Pengyu Ma 2023-05-15 17:26:49 UTC
Created attachment 304270 [details]
6.0-pcie-ptm-D3-suspend.dmesg

6.0 show the different error after resume:

[   24.582897] pcieport 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587098] pcieport 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587135] pcieport 0000:07:03.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587153] pcieport 0000:07:01.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587173] pcieport 0000:07:02.0: Unable to change power state from D3cold to D0, device inaccessible
[   25.838839] thunderbolt 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   25.838902] xhci_hcd 0000:3c:00.0: Unable to change power state from D3cold to D0, device inaccessible
Comment 6 Pengyu Ma 2023-05-15 17:27:40 UTC
After bisect, the following commit changed the behavior:

commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad)
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Sep 9 15:25:05 2022 -0500

    PCI/PM: Always disable PTM for all devices during suspend
Comment 7 Bagas Sanjaya 2023-05-16 09:40:32 UTC
(In reply to Pengyu Ma from comment #6)
> After bisect, the following commit changed the behavior:
> 
> commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad)
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Fri Sep 9 15:25:05 2022 -0500
> 
>     PCI/PM: Always disable PTM for all devices during suspend

There is a fix at [1]. Can you apply it on 6.4-rc1 and test it to see if it solves your regression?

[1]: https://lore.kernel.org/all/20221226153048.1208359-1-kai.heng.feng@canonical.com/
Comment 8 Pengyu Ma 2023-05-16 10:29:11 UTC
@Sanjaya,

Applied the patch on 6.4-rc1, it doesn't help, the 3 pcie (TBT, NVME, Ethernet) still report the same AER error.
Comment 9 Pengyu Ma 2023-05-25 03:50:34 UTC
@Mika,

Do you have any suggestions?
Comment 10 Mika Westerberg 2023-05-26 13:22:06 UTC
There is something wrong with the Maple Ridge add-in-card after S3 because looks like the PCIe link does not come up. This happens in both your logs so probably not a regression. Is there a BIOS upgrade for your system? If yes I would first try with that.
Comment 11 Pengyu Ma 2023-05-26 16:35:16 UTC
@Mika,

Already discussed with BIOS team, but there is no clue too.
Can we get any help from Intel BIOS team?

Thanks.
Comment 13 Mika Westerberg 2023-05-29 13:39:46 UTC
Which BIOS team you are talking about? OEM ? In that case they should have contacts to Intel BIOS team.
Comment 14 Pengyu Ma 2023-05-29 13:45:37 UTC
@Mika,

Thanks, OEM is working on it.

Note You need to log in before you can comment on or make changes to this bug.