Bug 215027

Summary: "PCIe Bus Error: severity=Corrected, type=Physical Layer" flood on Intel VMD + Samsung NVMe combination
Product: Drivers Reporter: Kai-Heng Feng (kai.heng.feng)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn, francisco.munoz.ruiz, jonathan.derrick, mika.westerberg, naveennaidu479
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: mainline, linux-next Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg with AER flood
lspci -vvnn
Fix long standing AER Error Handling Issues - with debug statements - to figure out why it does not work
dmesg with debug patch

Description Kai-Heng Feng 2021-11-15 07:17:01 UTC
The following tests (and any combination of them) don't help:
- Change NVMe LTR value to 0 or any other number
- Disable NVMe APST
- Disable PCIe ASPM
- Any version of kernel, including linux-next
- "Fix long standing AER Error Handling Issues" patch series [1]

[1] https://lore.kernel.org/linux-pci/cover.1635179600.git.naveennaidu479@gmail.com/
Comment 1 Kai-Heng Feng 2021-11-15 07:20:59 UTC
Created attachment 299571 [details]
dmesg with AER flood
Comment 2 Kai-Heng Feng 2021-11-15 07:21:17 UTC
Created attachment 299573 [details]
lspci -vvnn
Comment 3 Naveen Naidu 2021-11-16 09:38:37 UTC
Created attachment 299599 [details]
Fix long standing AER Error Handling Issues - with debug statements - to figure out why it does not work

Hello Kai-Heng o/

Thank you very much for the detailed bug report. And also thank you for testing my patch series "Fix long standing AER Error Handling Issues" [1]

[1] https://lore.kernel.org/linux-pci/cover.1635179600.git.naveennaidu479@gmail.com/

IIUC, even this patch series was not able to fix the AER message spew. I've added few debug statements in the new patch series attached, which might help me figure out why it did not work for you. 

I wanted to ask, if you have some free time can you please test the attached patch series and upload the dmesg output?

It would be really helpful if you could test it in two scenarios:

1. Test the patch series as it is and capture the dmesg output.
2. First Disable PCIe ASPM and then test the patch series and capture the dmesg output.

Thanks,
Naveen
Comment 4 Kai-Heng Feng 2021-11-24 14:25:41 UTC
Created attachment 299701 [details]
dmesg with debug patch
Comment 5 Kai-Heng Feng 2021-11-24 14:26:58 UTC
The issue is Intel VMD specific. If VMD is turned off, the NVMe is under regular PCIe root port, and the issue is not observed.
Comment 6 Kai-Heng Feng 2021-11-25 23:03:41 UTC
The Samsung NVMe in question is PCIe Gen4. Gen3 NVMes are not affected by this issue.
Comment 7 Francisco Munoz-Ruiz 2021-11-29 21:43:12 UTC
(In reply to Kai-Heng Feng from comment #0)
> The following tests (and any combination of them) don't help:
> - Change NVMe LTR value to 0 or any other number
> - Disable NVMe APST
> - Disable PCIe ASPM
> - Any version of kernel, including linux-next
> - "Fix long standing AER Error Handling Issues" patch series [1]
> 
> [1]
> https://lore.kernel.org/linux-pci/cover.1635179600.git.naveennaidu479@gmail.
> com/

Hello Kai-Heng,

Can you please help me with the exact command?. I'm able to inject aers to disks outside VMD domain. However, the command line tool gives me an error  for disks in VMD domain:

    aer-inject -s 10000:01:00.0 correctable_vmd
    Error: Can not parse PCI_ID: 10000:01:00.0 

I tried to fix it by doing changes in aer-inject:

    diff --git a/aer.y b/aer.y
    index a8ad063..52e1438 100644
    --- a/aer.y
    +++ b/aer.y
    @@ -98,7 +98,7 @@ int parse_pci_id(const char *str, struct aer_error_inj *aerr)
     {
            int cnt;

    -       cnt = sscanf(str, "%04hx:%02hhx:%02hhx.%01hhx",
    +       cnt = sscanf(str, "%05hx:%02hhx:%02hhx.%01hhx",
                     &aerr->domain, &aerr->bus, &aerr->dev, &aerr->fn);

then i get
    Error: Failed to write, No such device

Can you provide any suggestion about why I'm having errors with aer-inject for disks in a VMD domain?

Can you provide the model of your GEN4 Samsung ssd used for reproducing this issue?

Thanks,
Francisco
Comment 8 Kai-Heng Feng 2021-12-01 06:06:42 UTC
I found out why the issue doesn't happen under non-VMD mode, because AER is disabled.

Will send a patch to resolve the issue.
Comment 9 Bjorn Helgaas 2024-04-24 22:37:16 UTC
(In reply to Kai-Heng Feng from comment #5)
> The issue is Intel VMD specific. If VMD is turned off, the NVMe is under
> regular PCIe root port, and the issue is not observed.

When VMD is turned off, the Samsung NVMe will be in domain 0000.  If _OSC remains the same as when VMD is turned on, it will indicate that AER is not supported in that domain, so we wouldn't expect to see the issue.

Here's the negotiation from comment #1 when VMD is turned on:

[    0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0])
[    0.408995] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER]
[    0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR]

and AER is disabled on all the domain 0000 Root Ports in the comment #2 lspci output.

It would be interesting to boot with VMD turned off and with the "pcie_ports=native" parameter.  Then we should ignore _OSC and turn on AER even if firmware doesn't grant ownership.

If we see the Correctable Errors then, it suggests some issue between VMD and the Samsung NVMe.
Comment 10 Kai-Heng Feng 2024-04-26 05:49:00 UTC
> It would be interesting to boot with VMD turned off and with the
> "pcie_ports=native" parameter.  Then we should ignore _OSC and turn on AER
> even if firmware doesn't grant ownership.

I remember I tried that and the AER error floods start to appear.

If you want to see the dmesg with that I'll need to dig the laptop out from lab.
Comment 11 Bjorn Helgaas 2024-04-26 16:41:37 UTC
Thanks, you have a fantastic memory!  No need to dig out the laptop for now.

This suggests to me that this isn't related to the VMD functionality itself.  It could be an underlying hardware issue, e.g., a signal integrity issue, slot connector issue, etc., with this specific platform or NVMe device.

But I see several similar reports involving this and other devices that say "pcie_aspm=off" is a workaround, which makes me wonder if there's an ASPM configuration issue involved.  (These are from a search for "144d:a80a" "AER: Corrected error" and for "corrected error" site:launchpad.net)

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/ is an ASUS X99 Deluxe II with Kingston A2000 NVMe, "pcie_aspm=off" stopped the errors.

https://forum.proxmox.com/threads/pve-kernel-panics-on-reboots.144481/ is ASUS Pro WS with Samsung PM9A1/PM9A3/980PRO NVMe, similar errors, no info about workaround.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670 is a Dell Inc. Inspiron 3793, similar errors with Realtek RTL810xE NIC and "pcie_aspm=off" is a workaround; another reporter with similar errors with Samsung PM9A1/PM9A3/980PRO NVMe.

https://www.eevblog.com/forum/general-computing/linux-mint-21-02-clone-replace-1tb-nvme-with-a-2tb-nvme/ is Lenovo Thinkpad where Samsung 980 PRO 2 TB NVMe shows similar errors but WD SN 570 does not, no workaround info.

https://linux-hardware.org/?probe=7c13a64c8a&log=dmesg is v6.4.6 on ASUSTek VivoBook with VMD enabled and Samsung PM9A1/PM9A3/980PRO NVMe, similar errors for NVMe behind VMD, [10ec:8168] Realtek RTL8111/8168/8211/8411 NIC, Genesys GL9755 SDHCI controller.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2043665 AMI MPG Z690 FORCE WIFI with ASM1062 SATA, "pcie_aspm=off" workaround, but errors mysteriously evaporated even without "pcie_aspm=off".

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1888523 Gigabyte Aorus Gaming 7 with [10ec:5762] Realtek NVMe (XPG NVMe?), errors don't happen with NVMe from different manufacturer.
Comment 12 Kai-Heng Feng 2024-04-30 02:54:52 UTC
> This suggests to me that this isn't related to the VMD functionality itself. 
> It could be an underlying hardware issue, e.g., a signal integrity issue,
> slot connector issue, etc., with this specific platform or NVMe device.

Agree. Unfortunately ODM often toggle bits in _OSC to workaround issues. I don't really blame them because they have a deadline to meet.