The following tests (and any combination of them) don't help: - Change NVMe LTR value to 0 or any other number - Disable NVMe APST - Disable PCIe ASPM - Any version of kernel, including linux-next - "Fix long standing AER Error Handling Issues" patch series [1] [1] https://lore.kernel.org/linux-pci/cover.1635179600.git.naveennaidu479@gmail.com/
Created attachment 299571 [details] dmesg with AER flood
Created attachment 299573 [details] lspci -vvnn
Created attachment 299599 [details] Fix long standing AER Error Handling Issues - with debug statements - to figure out why it does not work Hello Kai-Heng o/ Thank you very much for the detailed bug report. And also thank you for testing my patch series "Fix long standing AER Error Handling Issues" [1] [1] https://lore.kernel.org/linux-pci/cover.1635179600.git.naveennaidu479@gmail.com/ IIUC, even this patch series was not able to fix the AER message spew. I've added few debug statements in the new patch series attached, which might help me figure out why it did not work for you. I wanted to ask, if you have some free time can you please test the attached patch series and upload the dmesg output? It would be really helpful if you could test it in two scenarios: 1. Test the patch series as it is and capture the dmesg output. 2. First Disable PCIe ASPM and then test the patch series and capture the dmesg output. Thanks, Naveen
Created attachment 299701 [details] dmesg with debug patch
The issue is Intel VMD specific. If VMD is turned off, the NVMe is under regular PCIe root port, and the issue is not observed.
The Samsung NVMe in question is PCIe Gen4. Gen3 NVMes are not affected by this issue.
(In reply to Kai-Heng Feng from comment #0) > The following tests (and any combination of them) don't help: > - Change NVMe LTR value to 0 or any other number > - Disable NVMe APST > - Disable PCIe ASPM > - Any version of kernel, including linux-next > - "Fix long standing AER Error Handling Issues" patch series [1] > > [1] > https://lore.kernel.org/linux-pci/cover.1635179600.git.naveennaidu479@gmail. > com/ Hello Kai-Heng, Can you please help me with the exact command?. I'm able to inject aers to disks outside VMD domain. However, the command line tool gives me an error for disks in VMD domain: aer-inject -s 10000:01:00.0 correctable_vmd Error: Can not parse PCI_ID: 10000:01:00.0 I tried to fix it by doing changes in aer-inject: diff --git a/aer.y b/aer.y index a8ad063..52e1438 100644 --- a/aer.y +++ b/aer.y @@ -98,7 +98,7 @@ int parse_pci_id(const char *str, struct aer_error_inj *aerr) { int cnt; - cnt = sscanf(str, "%04hx:%02hhx:%02hhx.%01hhx", + cnt = sscanf(str, "%05hx:%02hhx:%02hhx.%01hhx", &aerr->domain, &aerr->bus, &aerr->dev, &aerr->fn); then i get Error: Failed to write, No such device Can you provide any suggestion about why I'm having errors with aer-inject for disks in a VMD domain? Can you provide the model of your GEN4 Samsung ssd used for reproducing this issue? Thanks, Francisco
I found out why the issue doesn't happen under non-VMD mode, because AER is disabled. Will send a patch to resolve the issue.
(In reply to Kai-Heng Feng from comment #5) > The issue is Intel VMD specific. If VMD is turned off, the NVMe is under > regular PCIe root port, and the issue is not observed. When VMD is turned off, the Samsung NVMe will be in domain 0000. If _OSC remains the same as when VMD is turned on, it will indicate that AER is not supported in that domain, so we wouldn't expect to see the issue. Here's the negotiation from comment #1 when VMD is turned on: [ 0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0]) [ 0.408995] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3] [ 0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER] [ 0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR] and AER is disabled on all the domain 0000 Root Ports in the comment #2 lspci output. It would be interesting to boot with VMD turned off and with the "pcie_ports=native" parameter. Then we should ignore _OSC and turn on AER even if firmware doesn't grant ownership. If we see the Correctable Errors then, it suggests some issue between VMD and the Samsung NVMe.
> It would be interesting to boot with VMD turned off and with the > "pcie_ports=native" parameter. Then we should ignore _OSC and turn on AER > even if firmware doesn't grant ownership. I remember I tried that and the AER error floods start to appear. If you want to see the dmesg with that I'll need to dig the laptop out from lab.
Thanks, you have a fantastic memory! No need to dig out the laptop for now. This suggests to me that this isn't related to the VMD functionality itself. It could be an underlying hardware issue, e.g., a signal integrity issue, slot connector issue, etc., with this specific platform or NVMe device. But I see several similar reports involving this and other devices that say "pcie_aspm=off" is a workaround, which makes me wonder if there's an ASPM configuration issue involved. (These are from a search for "144d:a80a" "AER: Corrected error" and for "corrected error" site:launchpad.net) https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/ is an ASUS X99 Deluxe II with Kingston A2000 NVMe, "pcie_aspm=off" stopped the errors. https://forum.proxmox.com/threads/pve-kernel-panics-on-reboots.144481/ is ASUS Pro WS with Samsung PM9A1/PM9A3/980PRO NVMe, similar errors, no info about workaround. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670 is a Dell Inc. Inspiron 3793, similar errors with Realtek RTL810xE NIC and "pcie_aspm=off" is a workaround; another reporter with similar errors with Samsung PM9A1/PM9A3/980PRO NVMe. https://www.eevblog.com/forum/general-computing/linux-mint-21-02-clone-replace-1tb-nvme-with-a-2tb-nvme/ is Lenovo Thinkpad where Samsung 980 PRO 2 TB NVMe shows similar errors but WD SN 570 does not, no workaround info. https://linux-hardware.org/?probe=7c13a64c8a&log=dmesg is v6.4.6 on ASUSTek VivoBook with VMD enabled and Samsung PM9A1/PM9A3/980PRO NVMe, similar errors for NVMe behind VMD, [10ec:8168] Realtek RTL8111/8168/8211/8411 NIC, Genesys GL9755 SDHCI controller. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2043665 AMI MPG Z690 FORCE WIFI with ASM1062 SATA, "pcie_aspm=off" workaround, but errors mysteriously evaporated even without "pcie_aspm=off". https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1888523 Gigabyte Aorus Gaming 7 with [10ec:5762] Realtek NVMe (XPG NVMe?), errors don't happen with NVMe from different manufacturer.
> This suggests to me that this isn't related to the VMD functionality itself. > It could be an underlying hardware issue, e.g., a signal integrity issue, > slot connector issue, etc., with this specific platform or NVMe device. Agree. Unfortunately ODM often toggle bits in _OSC to workaround issues. I don't really blame them because they have a deadline to meet.