Bug 217446

Summary:	PCIe AER error after S3 when enabled Intel IOMMU
Product:	Drivers	Reporter:	Pengyu Ma (mapengyu)
Component:	IOMMU	Assignee:	drivers_iommu
Status:	NEW ---
Severity:	blocking	CC:	bagasdotme, mika.westerberg
Priority:	P3
Hardware:	Intel
OS:	Linux
Kernel Version:		Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	iommu-pcie-aer-error.log lspci 6.0-pcie-ptm-D3-suspend.dmesg

Description Pengyu Ma 2023-05-15 11:26:52 UTC

Created attachment 304263 [details]
iommu-pcie-aer-error.log

CPU: i9-13900
IOMMU enabled.
Legacy S3.

3 PCIE buses report AER error.

1,  Intel I350 [8086:1521] ethernet card connected to 00:06.0
[   58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:06.0
[   58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[   58.185287] pcieport 0000:00:06.0:   device [8086:a74d] error status/mask=00200000/00010000
[   58.185296] pcieport 0000:00:06.0:    [21] ACSViol                (First)

2, Intel thunderbolt [8086:7ab4]
[   58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
[   58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[   58.187911] pcieport 0000:00:1d.0:   device [8086:7ab4] error status/mask=00008000/00002000
[   58.187923] pcieport 0000:00:1d.0:    [15] HeaderOF
[   58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
[   58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   58.188014] pcieport 0000:00:1d.0:   device [8086:7ab4] error status/mask=00100000/00004000
[   58.188024] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
[   58.188032] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 3f000052 00000000 00000000

3, Sandisk Corp WD PC SN810
[   58.310272] pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0
[   58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   58.310299] nvme 0000:03:00.0:   device [15b7:5011] error status/mask=00000001/0000e000
[   58.310304] nvme 0000:03:00.0:    [ 0] RxErr

Disable IOMMU, the issue is gone.

The AER error and recovery could make race issue when resume from S3.
It caused igb hang. The rootcause is from AER error.

Comment 1 Pengyu Ma 2023-05-15 11:27:12 UTC

Created attachment 304264 [details]
lspci

Comment 2 Bagas Sanjaya 2023-05-15 13:03:40 UTC

(In reply to Pengyu Ma from comment #0)
> Created attachment 304263 [details]
> iommu-pcie-aer-error.log
> 
> CPU: i9-13900
> IOMMU enabled.
> Legacy S3.
> 
> 3 PCIE buses report AER error.
> 
> 1,  Intel I350 [8086:1521] ethernet card connected to 00:06.0
> [   58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal)
> error received: 0000:00:06.0
> [   58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected
> (Non-Fatal), type=Transaction Layer, (Receiver ID)
> [   58.185287] pcieport 0000:00:06.0:   device [8086:a74d] error
> status/mask=00200000/00010000
> [   58.185296] pcieport 0000:00:06.0:    [21] ACSViol                (First)
> 
> 2, Intel thunderbolt [8086:7ab4]
> [   58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error
> received: 0000:00:1d.0
> [   58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected,
> type=Transaction Layer, (Receiver ID)
> [   58.187911] pcieport 0000:00:1d.0:   device [8086:7ab4] error
> status/mask=00008000/00002000
> [   58.187923] pcieport 0000:00:1d.0:    [15] HeaderOF
> [   58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal)
> error received: 0000:00:1d.0
> [   58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected
> (Non-Fatal), type=Transaction Layer, (Requester ID)
> [   58.188014] pcieport 0000:00:1d.0:   device [8086:7ab4] error
> status/mask=00100000/00004000
> [   58.188024] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> [   58.188032] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 3f000052
> 00000000 00000000
> 
> 3, Sandisk Corp WD PC SN810
> [   58.310272] pcieport 0000:00:1a.0: AER: Corrected error received:
> 0000:03:00.0
> [   58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, (Receiver ID)
> [   58.310299] nvme 0000:03:00.0:   device [15b7:5011] error
> status/mask=00000001/0000e000
> [   58.310304] nvme 0000:03:00.0:    [ 0] RxErr
> 
> Disable IOMMU, the issue is gone.
> 
> The AER error and recovery could make race issue when resume from S3.
> It caused igb hang. The rootcause is from AER error.

What kernel version did this issue occur?

Comment 3 Pengyu Ma 2023-05-15 13:14:55 UTC

Hi Sanjaya,

It's 6.4.0-rc1.

Comment 4 Bagas Sanjaya 2023-05-15 13:37:33 UTC

On 5/15/23 20:14, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217446
> 
> --- Comment #3 from Pengyu Ma (mapengyu@gmail.com) ---
> Hi Sanjaya,
> 
> It's 6.4.0-rc1.
> 

And last known version that doesn't exhibit this issue?

Comment 5 Pengyu Ma 2023-05-15 17:26:49 UTC

Created attachment 304270 [details]
6.0-pcie-ptm-D3-suspend.dmesg

6.0 show the different error after resume:

[   24.582897] pcieport 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587098] pcieport 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587135] pcieport 0000:07:03.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587153] pcieport 0000:07:01.0: Unable to change power state from D3cold to D0, device inaccessible
[   24.587173] pcieport 0000:07:02.0: Unable to change power state from D3cold to D0, device inaccessible
[   25.838839] thunderbolt 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   25.838902] xhci_hcd 0000:3c:00.0: Unable to change power state from D3cold to D0, device inaccessible

Comment 6 Pengyu Ma 2023-05-15 17:27:40 UTC

After bisect, the following commit changed the behavior:

commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad)
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Sep 9 15:25:05 2022 -0500

    PCI/PM: Always disable PTM for all devices during suspend

Comment 7 Bagas Sanjaya 2023-05-16 09:40:32 UTC

(In reply to Pengyu Ma from comment #6)
> After bisect, the following commit changed the behavior:
> 
> commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad)
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Fri Sep 9 15:25:05 2022 -0500
> 
>     PCI/PM: Always disable PTM for all devices during suspend

There is a fix at [1]. Can you apply it on 6.4-rc1 and test it to see if it solves your regression?

[1]: https://lore.kernel.org/all/20221226153048.1208359-1-kai.heng.feng@canonical.com/

Comment 8 Pengyu Ma 2023-05-16 10:29:11 UTC

@Sanjaya,

Applied the patch on 6.4-rc1, it doesn't help, the 3 pcie (TBT, NVME, Ethernet) still report the same AER error.

Comment 9 Pengyu Ma 2023-05-25 03:50:34 UTC

@Mika,

Do you have any suggestions?

Comment 10 Mika Westerberg 2023-05-26 13:22:06 UTC

There is something wrong with the Maple Ridge add-in-card after S3 because looks like the PCIe link does not come up. This happens in both your logs so probably not a regression. Is there a BIOS upgrade for your system? If yes I would first try with that.

Comment 11 Pengyu Ma 2023-05-26 16:35:16 UTC

@Mika,

Already discussed with BIOS team, but there is no clue too.
Can we get any help from Intel BIOS team?

Thanks.

Comment 12 Pengyu Ma 2023-05-26 16:35:36 UTC

Fix I350 hang issue:
https://lore.kernel.org/lkml/20230526163001.67626-1-aaron.ma@canonical.com/T/#u

Comment 13 Mika Westerberg 2023-05-29 13:39:46 UTC

Which BIOS team you are talking about? OEM ? In that case they should have contacts to Intel BIOS team.

Comment 14 Pengyu Ma 2023-05-29 13:45:37 UTC

@Mika,

Thanks, OEM is working on it.