Created attachment 304263 [details] iommu-pcie-aer-error.log CPU: i9-13900 IOMMU enabled. Legacy S3. 3 PCIE buses report AER error. 1, Intel I350 [8086:1521] ethernet card connected to 00:06.0 [ 58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:06.0 [ 58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 58.185287] pcieport 0000:00:06.0: device [8086:a74d] error status/mask=00200000/00010000 [ 58.185296] pcieport 0000:00:06.0: [21] ACSViol (First) 2, Intel thunderbolt [8086:7ab4] [ 58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0 [ 58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID) [ 58.187911] pcieport 0000:00:1d.0: device [8086:7ab4] error status/mask=00008000/00002000 [ 58.187923] pcieport 0000:00:1d.0: [15] HeaderOF [ 58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0 [ 58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 58.188014] pcieport 0000:00:1d.0: device [8086:7ab4] error status/mask=00100000/00004000 [ 58.188024] pcieport 0000:00:1d.0: [20] UnsupReq (First) [ 58.188032] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 3f000052 00000000 00000000 3, Sandisk Corp WD PC SN810 [ 58.310272] pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 [ 58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) [ 58.310299] nvme 0000:03:00.0: device [15b7:5011] error status/mask=00000001/0000e000 [ 58.310304] nvme 0000:03:00.0: [ 0] RxErr Disable IOMMU, the issue is gone. The AER error and recovery could make race issue when resume from S3. It caused igb hang. The rootcause is from AER error.
Created attachment 304264 [details] lspci
(In reply to Pengyu Ma from comment #0) > Created attachment 304263 [details] > iommu-pcie-aer-error.log > > CPU: i9-13900 > IOMMU enabled. > Legacy S3. > > 3 PCIE buses report AER error. > > 1, Intel I350 [8086:1521] ethernet card connected to 00:06.0 > [ 58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal) > error received: 0000:00:06.0 > [ 58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected > (Non-Fatal), type=Transaction Layer, (Receiver ID) > [ 58.185287] pcieport 0000:00:06.0: device [8086:a74d] error > status/mask=00200000/00010000 > [ 58.185296] pcieport 0000:00:06.0: [21] ACSViol (First) > > 2, Intel thunderbolt [8086:7ab4] > [ 58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error > received: 0000:00:1d.0 > [ 58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, > type=Transaction Layer, (Receiver ID) > [ 58.187911] pcieport 0000:00:1d.0: device [8086:7ab4] error > status/mask=00008000/00002000 > [ 58.187923] pcieport 0000:00:1d.0: [15] HeaderOF > [ 58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) > error received: 0000:00:1d.0 > [ 58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected > (Non-Fatal), type=Transaction Layer, (Requester ID) > [ 58.188014] pcieport 0000:00:1d.0: device [8086:7ab4] error > status/mask=00100000/00004000 > [ 58.188024] pcieport 0000:00:1d.0: [20] UnsupReq (First) > [ 58.188032] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 3f000052 > 00000000 00000000 > > 3, Sandisk Corp WD PC SN810 > [ 58.310272] pcieport 0000:00:1a.0: AER: Corrected error received: > 0000:03:00.0 > [ 58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, > type=Physical Layer, (Receiver ID) > [ 58.310299] nvme 0000:03:00.0: device [15b7:5011] error > status/mask=00000001/0000e000 > [ 58.310304] nvme 0000:03:00.0: [ 0] RxErr > > Disable IOMMU, the issue is gone. > > The AER error and recovery could make race issue when resume from S3. > It caused igb hang. The rootcause is from AER error. What kernel version did this issue occur?
Hi Sanjaya, It's 6.4.0-rc1.
On 5/15/23 20:14, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217446 > > --- Comment #3 from Pengyu Ma (mapengyu@gmail.com) --- > Hi Sanjaya, > > It's 6.4.0-rc1. > And last known version that doesn't exhibit this issue?
Created attachment 304270 [details] 6.0-pcie-ptm-D3-suspend.dmesg 6.0 show the different error after resume: [ 24.582897] pcieport 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible [ 24.587098] pcieport 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible [ 24.587135] pcieport 0000:07:03.0: Unable to change power state from D3cold to D0, device inaccessible [ 24.587153] pcieport 0000:07:01.0: Unable to change power state from D3cold to D0, device inaccessible [ 24.587173] pcieport 0000:07:02.0: Unable to change power state from D3cold to D0, device inaccessible [ 25.838839] thunderbolt 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible [ 25.838902] xhci_hcd 0000:3c:00.0: Unable to change power state from D3cold to D0, device inaccessible
After bisect, the following commit changed the behavior: commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad) Author: Bjorn Helgaas <bhelgaas@google.com> Date: Fri Sep 9 15:25:05 2022 -0500 PCI/PM: Always disable PTM for all devices during suspend
(In reply to Pengyu Ma from comment #6) > After bisect, the following commit changed the behavior: > > commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad) > Author: Bjorn Helgaas <bhelgaas@google.com> > Date: Fri Sep 9 15:25:05 2022 -0500 > > PCI/PM: Always disable PTM for all devices during suspend There is a fix at [1]. Can you apply it on 6.4-rc1 and test it to see if it solves your regression? [1]: https://lore.kernel.org/all/20221226153048.1208359-1-kai.heng.feng@canonical.com/
@Sanjaya, Applied the patch on 6.4-rc1, it doesn't help, the 3 pcie (TBT, NVME, Ethernet) still report the same AER error.
@Mika, Do you have any suggestions?
There is something wrong with the Maple Ridge add-in-card after S3 because looks like the PCIe link does not come up. This happens in both your logs so probably not a regression. Is there a BIOS upgrade for your system? If yes I would first try with that.
@Mika, Already discussed with BIOS team, but there is no clue too. Can we get any help from Intel BIOS team? Thanks.
Fix I350 hang issue: https://lore.kernel.org/lkml/20230526163001.67626-1-aaron.ma@canonical.com/T/#u
Which BIOS team you are talking about? OEM ? In that case they should have contacts to Intel BIOS team.
@Mika, Thanks, OEM is working on it.