Created attachment 301596 [details] dmesg Hardware: CPU: Intel i7-12700K (Alder Lake) GPU: AMD RX 6700 XT [1002:73df] Motherboard: ASUS Prime Z690-A Problem: After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors for my GPU. I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635] PCI/portdrv: Don't disable AER reporting in get_port_device_capability() Reverting that commit causes the errors to cease. I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar problem, but it did not fix my issue. [1] https://lore.kernel.org/linux-pci/20220706123244.18056-1-kai.heng.feng@canonical.com/ dmesg snippet: pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) amdgpu 0000:03:00.0: device [1002:73df] error status/mask=00100000/00000000 amdgpu 0000:03:00.0: [20] UnsupReq (First) amdgpu 0000:03:00.0: AER: TLP Header: 40000001 0000000f 95e7f000 00000000 [...] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device [drm] PCI error: detected callback, state(1)!! pci 0000:03:00.1: AER: can't recover (no error_detected callback) Apcieport 0000:02:00.0: AER: device recovery failed pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0 pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
Created attachment 301597 [details] lspci -vvnn
Adding "pci=noaer" to the kernel command-line masks/hides these errors. Thanks to Bjorn Helgaas for the suggestion.
Created attachment 301606 [details] dmesg output from debug patch I applied Bjorn's debugging patch to v6.0-rc1 and there are multiple stack traces similar to: amdgpu: ** writing 0x00000000 to ffffa7930187f000 CPU: 10 PID: 457 Comm: systemd-udevd Not tainted 6.0.0-rc1+ #3 Hardware name: iBUYPOWER INTEL/PRIME Z690-A, BIOS 1720 08/12/2022 Call Trace: <TASK> dump_stack_lvl+0x37/0x4a amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu] gmc_v10_0_hw_init+0xa8/0x180 [amdgpu] amdgpu_device_init.cold+0x1592/0x1d18 [amdgpu] ? acpi_pci_irq_enable+0x115/0x230 ? pci_conf1_read+0x9f/0x100 amdgpu_driver_load_kms+0x19/0x110 [amdgpu] amdgpu_pci_probe+0x136/0x350 [amdgpu] local_pci_probe+0x42/0x80 pci_device_probe+0xb6/0x1f0 ? sysfs_do_create_link_sd+0x6e/0xe0 really_probe+0xdb/0x380 ? pm_runtime_barrier+0x54/0x90 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xc2/0x1c0 ? __device_attach_driver+0xe0/0xe0 bus_for_each_dev+0x6a/0xa0 bus_add_driver+0x1b2/0x200 driver_register+0x8d/0xe0 ? 0xffffffffc1019000 do_one_initcall+0x45/0x200 ? kmem_cache_alloc_trace+0x14f/0x2e0 do_init_module+0x4a/0x1e0 __do_sys_finit_module+0x93/0xf0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f9a66e15f3d Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 ce 0e 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc881134c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000055c810127fd0 RCX: 00007f9a66e15f3d RDX: 0000000000000000 RSI: 00007f9a66f7c43c RDI: 000000000000001a RBP: 00007f9a66f7c43c R08: 0000000000000000 R09: 000055c81012ddc0 R10: 000000000000001a R11: 0000000000000246 R12: 0000000000020000 R13: 000055c81012ba40 R14: 0000000000000000 R15: 000055c81012c9c0 </TASK>
Created attachment 301607 [details] amdgpu debug patch 1 from Bjorn
Created attachment 301642 [details] dmesg from Lijo's patch I have applied a patch sent by amdgpu developer Lijo Lazar [1] to v6.0-rc2, and it does appear to resolve (or at least hide) the uncorrected PCI errors I have been seeing. [1] https://lore.kernel.org/linux-pci/30671d88-85a1-0cdf-03db-3a77d6ef96e9@amd.com/T/#m4b7397327e636ccc656500f25be6e0b7a6670737
Created attachment 301643 [details] Patch from Lijo Lazar
I can reproduce the same issue on 5.19.5 (with gentoo patches applied). I see the offending commit applied to v5.19.5 upstream: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/pci/pcie/portdrv_core.c?h=v5.19.5&id=65e393fddc5379b2c41ca7e73cd4bb9572c4d90e Hardware: CPU: Ryzen Threadripper 1950X MB: Asrock X399 Taichi GPU: Radeon Vega 64 [1002:687f]
Created attachment 301700 [details] Filtered dmesg (truncated as it overwrote itself) For some reason, snd_hda_intel is mentioned. The problem seems to originate on device [1022:1471], which is [AMD] Vega 10 PCIe Bridge. [ 19.786024] pcieport 0000:43:00.0: AER: device recovery failed [ 19.795058] pcieport 0000:40:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:43:00.0 [ 19.800911] pcieport 0000:43:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 19.802559] pcieport 0000:43:00.0: device [1022:1471] error status/mask=00100000/00000000 [ 19.804179] pcieport 0000:43:00.0: [20] UnsupReq (First) [ 19.805664] pcieport 0000:43:00.0: AER: TLP Header: 34000000 44000010 00000000 84288428 [ 19.807124] [drm] PCI error: detected callback, state(1)!! [ 19.808500] snd_hda_intel 0000:44:00.1: AER: can't recover (no error_detected callback)
Created attachment 301718 [details] possible fix Does this patch also fix the issue?
Unfortunately, the kernel does not boot with this patch applied on top of 5.19.5. The screen went blank just after the bootloader, the USB devices (like keyboard) also turned off.
Created attachment 301719 [details] possible fix How about this one?
Does not apply cleanly because of 5.19.5 (and also 6.0-rc3) the condition contains additional `&& !amdgpu_sriov_vf(adev)`: https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/nv.c#L1039 https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/soc15.c#L1247 Is it ok to ignore that? Or should I add this to the gmc* hunks?
I've figured this expression deals with virtualized GPU so I ignored it. This patch does boot but there are still AER errors.
(In reply to Gustaw Smolarczyk from comment #12) > Does not apply cleanly because of 5.19.5 (and also 6.0-rc3) the condition > contains additional `&& !amdgpu_sriov_vf(adev)`: > > https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/ > nv.c#L1039 > > https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/ > soc15.c#L1247 > > Is it ok to ignore that? Or should I add this to the gmc* hunks? you can just drop all of this code in soc15.c and nv.c on that kernel: if (adev->nbio.funcs->remap_hdp_registers && !amdgpu_sriov_vf(adev)) adev->nbio.funcs->remap_hdp_registers(adev); If you left it there, that may explain the AER errors.
No, I applied your patch in all 4 files (i.e. I have moved the remap calls from soc15/nv to gmc*). In that state there were still AER errors.
Created attachment 301720 [details] possible fix This should be functionally equivalent to Lijo's original patch.
Isn't gmc_v10_0 present in Navi10 and later? My Vega10 is probably using gmc_v9_0 instead. Will test a similar patch for soc15 and gmc9 (will call just before gmc_v9_0_init_golden_registers()).
No change. Please note that the AER errors have different payload than what Tom reported. The PCI device is also different (instead of the GPU it is the Vega 10 PCIe bridge). Maybe I should have opened a separate bug report...
(In reply to Gustaw Smolarczyk from comment #18) > No change. > > Please note that the AER errors have different payload than what Tom > reported. The PCI device is also different (instead of the GPU it is the > Vega 10 PCIe bridge). > > Maybe I should have opened a separate bug report... ah, sorry, yeah I was thinking Tom's config. Can you apply Bjorn's debugging patch from comment 4 and attach the dmesg output from that?
Created attachment 301721 [details] dmesg with the tracing patch (beginning) Attached is the beginning of dmesg, with the tracing messages from the Bjorn's patch present. This time it is since boot as I have learned about log_buf_len cmdline option.
Can you attach the output of lspci -vvnn as well?
Created attachment 301722 [details] lspci -vvnn on vega10 system Attached.
Created attachment 301723 [details] disable HDP remapping on vega10 Let's try this to narrow down the problem on vega10. This removes the HDP remapping and just uses the original register. If this doesn't fix the issue, then it's likely something else. I'm not sure why the error is reporting as part of the vega10 bridge rather than GPU itself, but that just might be the hardware design.
No difference, other than the bjorn's tracing messages being gone. Maybe the wrong PCI accesses do not originate from amdgpu? My limited knowledge regarding TLP (from googling for a little [1]) suggests this is a generic "Message Request without data" with routing "Local - Terminate at Receiver" from 44:00 (which is Vega10 GPU itself), message code 0x10. The documentation suggests the last two dwords are "reserved" so probably should be zero - but the last one is not. I have confirmed that this is a 4-dword TLP (from the first "3"), so the last dword is significant. Maybe this is a hardware error? TLP Header: 34000000 44000010 00000000 84288428 [1] https://www.cl.cam.ac.uk/~djm202/pdf/specifications/pcie/PCI_Express_Base_11.pdf
> TLP Header: 34000000 44000010 00000000 84288428 This is an LTR message from 44:00.0. The last dword contains the no-snoop/snoop latencies. LTR was added after PCIe r1.1, which is why that spec says message code 0x10 is reserved. From your lspci, 43:00.0 Downstream Port to [bus 44] DevCtl2: LTR- 44:00.0 Vega 10 XL/XT DevCtl2: LTR+ 44:00.0 has LTR enabled, so it will sent LTR messages upstream periodically. 43:00.0 has LTR disabled, so when it receives those messages, it will log a UR error. This is an illegal configuration. Can you boot with "pci=earlydump" and attach the dmesg log? Either BIOS left this illegal config, the PCI core set it this way, or amdgpu did.
Created attachment 301724 [details] debug patch for LTR config Add debug output for PCI core config of LTR. Gustaw, it'd be great if you could try this. You can combine it with a "pci=earlydump" boot.
Created attachment 301725 [details] dmesg with pci=earlydump (vanilla v5.19.5)
This was without the LTR debug output patch, will retry with it.
Created attachment 301726 [details] dmesg with pci=earlydump (v5.19.5 + ltr debug patch) I see LTR disabled and not changed.
Thanks. Unless I goofed in the debug patch, that means amdgpu is enabling LTR when it shouldn't. Things like nbio_v2_3_program_aspm() and nbio_v2_3_program_ltr() look possibly relevant. BIF_CFG_DEV0_EPF0_DEVICE_CNTL2__LTR_EN_MASK matches the value of PCI_EXP_DEVCTL2_LTR_EN (0x0400), but smnBIF_CFG_DEV0_EPF0_DEVICE_CNTL2 (0x1014008c) is not a PCI config offset, so maybe amdgpu is configuring this via a device-specific MMIO access path. This all looks highly irregular to me. It is absolutely not safe to configure ASPM and LTR for an endpoint without coordinating with other devices upstream.
Gustaw, would you mind opening a separate bugzilla for this LTR issue? This is a completely different problem than the original issue. Tom's original report was an Unsupported Request error logged by the AMDGPU device when it received a 32-bit MMIO write to 0x95e7f000 performed by the driver: amdgpu 0000:03:00.0: device [1002:73df] error status/mask=00100000/00000000 amdgpu 0000:03:00.0: [20] UnsupReq (First) amdgpu 0000:03:00.0: AER: TLP Header: 40000001 0000000f 95e7f000 00000000 The issue you're seeing is an Unsupported Request error logged by a Switch Downstream Port when it received an LTR message sent by 44:00.0 when the Switch has LTR disabled: pcieport 0000:43:00.0: device [1022:1471] error status/mask=00100000/00000000 pcieport 0000:43:00.0: [20] UnsupReq (First) pcieport 0000:43:00.0: AER: TLP Header: 34000000 44000010 00000000 84288428
Does setting amdgpu.aspm=0 prevent the issue?
I have tested on v5.19.6 with amdgpu.aspm=0 set. There are no longer AER errors, and I have confirmed that there is no LTR+ in DevCtl2: # lspci -vvnn | grep 'LTR+' | grep DevCtl2 # I will open a separate bugzilla issue.
See bug 216455
@tseewald@gmail.com Does attachment 301718 [details] fix the issue for you?
Patches to fix this: https://patchwork.freedesktop.org/series/108375/
This has been fixed in mainline by commit a8671493d2074950553da3cf07d1be43185ef6c6 (drm/amdgpu: make sure to init common IP before gmc). https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a8671493d2074950553da3cf07d1be43185ef6c6