Bug 217446
Summary: | PCIe AER error after S3 when enabled Intel IOMMU | ||
---|---|---|---|
Product: | Drivers | Reporter: | Pengyu Ma (mapengyu) |
Component: | IOMMU | Assignee: | drivers_iommu |
Status: | NEW --- | ||
Severity: | blocking | CC: | bagasdotme, mika.westerberg |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
iommu-pcie-aer-error.log
lspci 6.0-pcie-ptm-D3-suspend.dmesg |
Created attachment 304264 [details]
lspci
(In reply to Pengyu Ma from comment #0) > Created attachment 304263 [details] > iommu-pcie-aer-error.log > > CPU: i9-13900 > IOMMU enabled. > Legacy S3. > > 3 PCIE buses report AER error. > > 1, Intel I350 [8086:1521] ethernet card connected to 00:06.0 > [ 58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal) > error received: 0000:00:06.0 > [ 58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected > (Non-Fatal), type=Transaction Layer, (Receiver ID) > [ 58.185287] pcieport 0000:00:06.0: device [8086:a74d] error > status/mask=00200000/00010000 > [ 58.185296] pcieport 0000:00:06.0: [21] ACSViol (First) > > 2, Intel thunderbolt [8086:7ab4] > [ 58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error > received: 0000:00:1d.0 > [ 58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, > type=Transaction Layer, (Receiver ID) > [ 58.187911] pcieport 0000:00:1d.0: device [8086:7ab4] error > status/mask=00008000/00002000 > [ 58.187923] pcieport 0000:00:1d.0: [15] HeaderOF > [ 58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) > error received: 0000:00:1d.0 > [ 58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected > (Non-Fatal), type=Transaction Layer, (Requester ID) > [ 58.188014] pcieport 0000:00:1d.0: device [8086:7ab4] error > status/mask=00100000/00004000 > [ 58.188024] pcieport 0000:00:1d.0: [20] UnsupReq (First) > [ 58.188032] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 3f000052 > 00000000 00000000 > > 3, Sandisk Corp WD PC SN810 > [ 58.310272] pcieport 0000:00:1a.0: AER: Corrected error received: > 0000:03:00.0 > [ 58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, > type=Physical Layer, (Receiver ID) > [ 58.310299] nvme 0000:03:00.0: device [15b7:5011] error > status/mask=00000001/0000e000 > [ 58.310304] nvme 0000:03:00.0: [ 0] RxErr > > Disable IOMMU, the issue is gone. > > The AER error and recovery could make race issue when resume from S3. > It caused igb hang. The rootcause is from AER error. What kernel version did this issue occur? Hi Sanjaya, It's 6.4.0-rc1. On 5/15/23 20:14, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217446 > > --- Comment #3 from Pengyu Ma (mapengyu@gmail.com) --- > Hi Sanjaya, > > It's 6.4.0-rc1. > And last known version that doesn't exhibit this issue? Created attachment 304270 [details]
6.0-pcie-ptm-D3-suspend.dmesg
6.0 show the different error after resume:
[ 24.582897] pcieport 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 24.587098] pcieport 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 24.587135] pcieport 0000:07:03.0: Unable to change power state from D3cold to D0, device inaccessible
[ 24.587153] pcieport 0000:07:01.0: Unable to change power state from D3cold to D0, device inaccessible
[ 24.587173] pcieport 0000:07:02.0: Unable to change power state from D3cold to D0, device inaccessible
[ 25.838839] thunderbolt 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 25.838902] xhci_hcd 0000:3c:00.0: Unable to change power state from D3cold to D0, device inaccessible
After bisect, the following commit changed the behavior: commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad) Author: Bjorn Helgaas <bhelgaas@google.com> Date: Fri Sep 9 15:25:05 2022 -0500 PCI/PM: Always disable PTM for all devices during suspend (In reply to Pengyu Ma from comment #6) > After bisect, the following commit changed the behavior: > > commit c01163dbd1b8aa016c163ff4bf3a2e90311504f1 (HEAD, refs/bisect/bad) > Author: Bjorn Helgaas <bhelgaas@google.com> > Date: Fri Sep 9 15:25:05 2022 -0500 > > PCI/PM: Always disable PTM for all devices during suspend There is a fix at [1]. Can you apply it on 6.4-rc1 and test it to see if it solves your regression? [1]: https://lore.kernel.org/all/20221226153048.1208359-1-kai.heng.feng@canonical.com/ @Sanjaya, Applied the patch on 6.4-rc1, it doesn't help, the 3 pcie (TBT, NVME, Ethernet) still report the same AER error. @Mika, Do you have any suggestions? There is something wrong with the Maple Ridge add-in-card after S3 because looks like the PCIe link does not come up. This happens in both your logs so probably not a regression. Is there a BIOS upgrade for your system? If yes I would first try with that. @Mika, Already discussed with BIOS team, but there is no clue too. Can we get any help from Intel BIOS team? Thanks. Fix I350 hang issue: https://lore.kernel.org/lkml/20230526163001.67626-1-aaron.ma@canonical.com/T/#u Which BIOS team you are talking about? OEM ? In that case they should have contacts to Intel BIOS team. @Mika, Thanks, OEM is working on it. |
Created attachment 304263 [details] iommu-pcie-aer-error.log CPU: i9-13900 IOMMU enabled. Legacy S3. 3 PCIE buses report AER error. 1, Intel I350 [8086:1521] ethernet card connected to 00:06.0 [ 58.185251] pcieport 0000:00:06.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:06.0 [ 58.185279] pcieport 0000:00:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 58.185287] pcieport 0000:00:06.0: device [8086:a74d] error status/mask=00200000/00010000 [ 58.185296] pcieport 0000:00:06.0: [21] ACSViol (First) 2, Intel thunderbolt [8086:7ab4] [ 58.187838] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0 [ 58.187899] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID) [ 58.187911] pcieport 0000:00:1d.0: device [8086:7ab4] error status/mask=00008000/00002000 [ 58.187923] pcieport 0000:00:1d.0: [15] HeaderOF [ 58.187944] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0 [ 58.188003] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 58.188014] pcieport 0000:00:1d.0: device [8086:7ab4] error status/mask=00100000/00004000 [ 58.188024] pcieport 0000:00:1d.0: [20] UnsupReq (First) [ 58.188032] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 3f000052 00000000 00000000 3, Sandisk Corp WD PC SN810 [ 58.310272] pcieport 0000:00:1a.0: AER: Corrected error received: 0000:03:00.0 [ 58.310294] nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) [ 58.310299] nvme 0000:03:00.0: device [15b7:5011] error status/mask=00000001/0000e000 [ 58.310304] nvme 0000:03:00.0: [ 0] RxErr Disable IOMMU, the issue is gone. The AER error and recovery could make race issue when resume from S3. It caused igb hang. The rootcause is from AER error.