Bug 209149

Summary: "iommu/vt-d: Enable PCI ACS for platform opt in hint" makes NVMe config space not accessible after S3
Product: Drivers Reporter: Kai-Heng Feng (kai.heng.feng)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: baolu.lu, bjorn, wse
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: mainline Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg with dynamic debug enabled
lspci -vvnn
lspci -t
workaround patch
dmesg with the quirk patch applied
dmesg
lspci -tv, Intel NVMe
Proposed patch
Print AER status
dmesg with AER status printed
proposed patch

Description Kai-Heng Feng 2020-09-04 14:31:20 UTC
Here's the error:
[   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
[   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
[   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
[   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
[   50.947843] nvme nvme0: frozen state error detected, reset controller
Comment 1 Kai-Heng Feng 2020-09-04 14:32:13 UTC
Created attachment 292327 [details]
dmesg with dynamic debug enabled
Comment 2 Kai-Heng Feng 2020-09-04 14:32:33 UTC
Created attachment 292329 [details]
lspci -vvnn
Comment 3 Kai-Heng Feng 2020-09-04 14:32:51 UTC
Created attachment 292331 [details]
lspci -t
Comment 4 Kai-Heng Feng 2020-09-04 14:33:50 UTC
Created attachment 292333 [details]
workaround patch

Once using a quirk for the root port, the issue is gone.
Comment 5 Kai-Heng Feng 2020-09-04 14:34:17 UTC
Created attachment 292335 [details]
dmesg with the quirk patch applied
Comment 6 Kai-Heng Feng 2020-09-04 14:34:44 UTC
So I wonder if ACS quirk is also required for Comet Lake?
Comment 7 Kai-Heng Feng 2020-09-23 05:28:29 UTC
Created attachment 292565 [details]
dmesg

Same issue on Intel NVMe, after ACS quirk applied.
Comment 8 Kai-Heng Feng 2020-09-23 05:28:52 UTC
Created attachment 292567 [details]
lspci -tv, Intel NVMe
Comment 9 Kai-Heng Feng 2020-09-23 05:30:26 UTC
Created attachment 292569 [details]
Proposed patch

Unconditionally disable ACS redir for Intel bridges can workaround the issue.
Comment 10 Kai-Heng Feng 2020-10-15 15:33:00 UTC
#9 is just a placebo. The issue is still reproducible with ACS redir forcibly disabled.
Comment 11 Kai-Heng Feng 2021-02-05 15:05:05 UTC
Created attachment 295077 [details]
Print AER status
Comment 12 Kai-Heng Feng 2021-02-05 15:05:37 UTC
Created attachment 295079 [details]
dmesg with AER status printed
Comment 13 Bjorn Helgaas 2024-06-18 21:32:51 UTC
Created attachment 306473 [details]
proposed patch

I'm not completely clear on the mechanism here, but this is a possible fix for this issue (at least, this bug is mentioned in the commit log).
Comment 14 Kai-Heng Feng 2024-06-19 06:06:29 UTC
Confirmed the patch solves the issue.
Comment 15 Werner Sembach [TUXEDO] 2024-07-01 14:50:36 UTC
(In reply to Bjorn Helgaas from comment #13)
> Created attachment 306473 [details]
> proposed patch
> 
> I'm not completely clear on the mechanism here, but this is a possible fix
> for this issue (at least, this bug is mentioned in the commit log).

also works for me (applied wihtout conflict against 6.8.12, couldn't use 6.10-rc5, because the nvidia driver does not yet support that kernel)