Bug 216373 - Unsupported Request error for amdgpu access to invalid MMIO address
Summary: Unsupported Request error for amdgpu access to invalid MMIO address
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-17 23:45 UTC by Tom Seewald
Modified: 2022-09-24 16:15 UTC (History)
3 users (show)

See Also:
Kernel Version: v6.0-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg (129.95 KB, text/plain)
2022-08-17 23:45 UTC, Tom Seewald
Details
lspci -vvnn (63.47 KB, text/plain)
2022-08-17 23:45 UTC, Tom Seewald
Details
dmesg output from debug patch (549.19 KB, text/plain)
2022-08-19 18:42 UTC, Tom Seewald
Details
amdgpu debug patch 1 from Bjorn (5.40 KB, patch)
2022-08-19 18:42 UTC, Tom Seewald
Details | Diff
dmesg from Lijo's patch (98.59 KB, text/plain)
2022-08-24 14:38 UTC, Tom Seewald
Details
Patch from Lijo Lazar (2.85 KB, patch)
2022-08-24 14:39 UTC, Tom Seewald
Details | Diff
Filtered dmesg (truncated as it overwrote itself) (260.84 KB, text/plain)
2022-08-30 20:19 UTC, Gustaw Smolarczyk
Details
possible fix (2.57 KB, patch)
2022-09-01 18:51 UTC, Alex Deucher
Details | Diff
possible fix (3.97 KB, patch)
2022-09-01 19:35 UTC, Alex Deucher
Details | Diff
possible fix (1.78 KB, patch)
2022-09-01 20:12 UTC, Alex Deucher
Details | Diff
dmesg with the tracing patch (beginning) (236.87 KB, text/plain)
2022-09-01 20:57 UTC, Gustaw Smolarczyk
Details
lspci -vvnn on vega10 system (141.39 KB, text/plain)
2022-09-01 21:06 UTC, Gustaw Smolarczyk
Details
disable HDP remapping on vega10 (2.14 KB, patch)
2022-09-01 21:20 UTC, Alex Deucher
Details | Diff
debug patch for LTR config (1.72 KB, patch)
2022-09-01 22:29 UTC, Bjorn Helgaas
Details | Diff
dmesg with pci=earlydump (vanilla v5.19.5) (162.92 KB, text/plain)
2022-09-01 22:34 UTC, Gustaw Smolarczyk
Details
dmesg with pci=earlydump (v5.19.5 + ltr debug patch) (167.39 KB, text/plain)
2022-09-01 22:52 UTC, Gustaw Smolarczyk
Details

Description Tom Seewald 2022-08-17 23:45:15 UTC
Created attachment 301596 [details]
dmesg

Hardware:
CPU: Intel i7-12700K (Alder Lake)
GPU: AMD RX 6700 XT [1002:73df]
Motherboard: ASUS Prime Z690-A

Problem:
After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors for my GPU.

I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635] PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
Reverting that commit causes the errors to cease.

I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar problem, but it did not fix my issue.

[1] https://lore.kernel.org/linux-pci/20220706123244.18056-1-kai.heng.feng@canonical.com/

dmesg snippet:

pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
amdgpu 0000:03:00.0:   device [1002:73df] error status/mask=00100000/00000000
amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 00000000
[...]
amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[drm] PCI error: detected callback, state(1)!!
pci 0000:03:00.1: AER: can't recover (no error_detected callback)
Apcieport 0000:02:00.0: AER: device recovery failed
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.0
pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:00.0
Comment 1 Tom Seewald 2022-08-17 23:45:45 UTC
Created attachment 301597 [details]
lspci -vvnn
Comment 2 Tom Seewald 2022-08-18 23:48:40 UTC
Adding "pci=noaer" to the kernel command-line masks/hides these errors. Thanks to Bjorn Helgaas for the suggestion.
Comment 3 Tom Seewald 2022-08-19 18:42:13 UTC
Created attachment 301606 [details]
dmesg output from debug patch

I applied Bjorn's debugging patch to v6.0-rc1 and there are multiple stack traces similar to:

amdgpu: ** writing 0x00000000 to ffffa7930187f000
CPU: 10 PID: 457 Comm: systemd-udevd Not tainted 6.0.0-rc1+ #3
Hardware name: iBUYPOWER INTEL/PRIME Z690-A, BIOS 1720 08/12/2022
Call Trace:
<TASK>
dump_stack_lvl+0x37/0x4a
amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
gmc_v10_0_hw_init+0xa8/0x180 [amdgpu]
amdgpu_device_init.cold+0x1592/0x1d18 [amdgpu]
? acpi_pci_irq_enable+0x115/0x230
? pci_conf1_read+0x9f/0x100
amdgpu_driver_load_kms+0x19/0x110 [amdgpu]
amdgpu_pci_probe+0x136/0x350 [amdgpu]
local_pci_probe+0x42/0x80
pci_device_probe+0xb6/0x1f0
? sysfs_do_create_link_sd+0x6e/0xe0
really_probe+0xdb/0x380
? pm_runtime_barrier+0x54/0x90
__driver_probe_device+0x78/0x170
driver_probe_device+0x1f/0x90
__driver_attach+0xc2/0x1c0
? __device_attach_driver+0xe0/0xe0
bus_for_each_dev+0x6a/0xa0
bus_add_driver+0x1b2/0x200
driver_register+0x8d/0xe0
? 0xffffffffc1019000
do_one_initcall+0x45/0x200
? kmem_cache_alloc_trace+0x14f/0x2e0
do_init_module+0x4a/0x1e0
__do_sys_finit_module+0x93/0xf0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f9a66e15f3d
Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 ce 0e 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc881134c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000055c810127fd0 RCX: 00007f9a66e15f3d
RDX: 0000000000000000 RSI: 00007f9a66f7c43c RDI: 000000000000001a
RBP: 00007f9a66f7c43c R08: 0000000000000000 R09: 000055c81012ddc0
R10: 000000000000001a R11: 0000000000000246 R12: 0000000000020000
R13: 000055c81012ba40 R14: 0000000000000000 R15: 000055c81012c9c0
</TASK>
Comment 4 Tom Seewald 2022-08-19 18:42:58 UTC
Created attachment 301607 [details]
amdgpu debug patch 1 from Bjorn
Comment 5 Tom Seewald 2022-08-24 14:38:30 UTC
Created attachment 301642 [details]
dmesg from Lijo's patch

I have applied a patch sent by amdgpu developer Lijo Lazar [1] to v6.0-rc2, and it does appear to resolve (or at least hide) the uncorrected PCI errors I have been seeing.

[1] https://lore.kernel.org/linux-pci/30671d88-85a1-0cdf-03db-3a77d6ef96e9@amd.com/T/#m4b7397327e636ccc656500f25be6e0b7a6670737
Comment 6 Tom Seewald 2022-08-24 14:39:03 UTC
Created attachment 301643 [details]
Patch from Lijo Lazar
Comment 7 Gustaw Smolarczyk 2022-08-30 19:57:38 UTC
I can reproduce the same issue on 5.19.5 (with gentoo patches applied). I see the offending commit applied to v5.19.5 upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/pci/pcie/portdrv_core.c?h=v5.19.5&id=65e393fddc5379b2c41ca7e73cd4bb9572c4d90e

Hardware:
CPU: Ryzen Threadripper 1950X
MB: Asrock X399 Taichi
GPU: Radeon Vega 64 [1002:687f]
Comment 8 Gustaw Smolarczyk 2022-08-30 20:19:55 UTC
Created attachment 301700 [details]
Filtered dmesg (truncated as it overwrote itself)

For some reason, snd_hda_intel is mentioned. The problem seems to originate on device [1022:1471], which is [AMD] Vega 10 PCIe Bridge.

[   19.786024] pcieport 0000:43:00.0: AER: device recovery failed
[   19.795058] pcieport 0000:40:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:43:00.0
[   19.800911] pcieport 0000:43:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   19.802559] pcieport 0000:43:00.0:   device [1022:1471] error status/mask=00100000/00000000
[   19.804179] pcieport 0000:43:00.0:    [20] UnsupReq               (First)
[   19.805664] pcieport 0000:43:00.0: AER:   TLP Header: 34000000 44000010 00000000 84288428
[   19.807124] [drm] PCI error: detected callback, state(1)!!
[   19.808500] snd_hda_intel 0000:44:00.1: AER: can't recover (no error_detected callback)
Comment 9 Alex Deucher 2022-09-01 18:51:49 UTC
Created attachment 301718 [details]
possible fix

Does this patch also fix the issue?
Comment 10 Gustaw Smolarczyk 2022-09-01 19:16:17 UTC
Unfortunately, the kernel does not boot with this patch applied on top of 5.19.5. The screen went blank just after the bootloader, the USB devices (like keyboard) also turned off.
Comment 11 Alex Deucher 2022-09-01 19:35:05 UTC
Created attachment 301719 [details]
possible fix

How about this one?
Comment 12 Gustaw Smolarczyk 2022-09-01 19:44:22 UTC
Does not apply cleanly because of 5.19.5 (and also 6.0-rc3) the condition contains additional `&& !amdgpu_sriov_vf(adev)`:

https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/nv.c#L1039

https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/soc15.c#L1247

Is it ok to ignore that? Or should I add this to the gmc* hunks?
Comment 13 Gustaw Smolarczyk 2022-09-01 19:54:03 UTC
I've figured this expression deals with virtualized GPU so I ignored it.

This patch does boot but there are still AER errors.
Comment 14 Alex Deucher 2022-09-01 20:02:26 UTC
(In reply to Gustaw Smolarczyk from comment #12)
> Does not apply cleanly because of 5.19.5 (and also 6.0-rc3) the condition
> contains additional `&& !amdgpu_sriov_vf(adev)`:
> 
> https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/
> nv.c#L1039
> 
> https://elixir.bootlin.com/linux/v5.19.5/source/drivers/gpu/drm/amd/amdgpu/
> soc15.c#L1247
> 
> Is it ok to ignore that? Or should I add this to the gmc* hunks?

you can just drop all of this code in soc15.c and nv.c on that kernel:

	if (adev->nbio.funcs->remap_hdp_registers && !amdgpu_sriov_vf(adev))
		adev->nbio.funcs->remap_hdp_registers(adev);

If you left it there, that may explain the AER errors.
Comment 15 Gustaw Smolarczyk 2022-09-01 20:04:52 UTC
No, I applied your patch in all 4 files (i.e. I have moved the remap calls from soc15/nv to gmc*). In that state there were still AER errors.
Comment 16 Alex Deucher 2022-09-01 20:12:31 UTC
Created attachment 301720 [details]
possible fix

This should be functionally equivalent to Lijo's original patch.
Comment 17 Gustaw Smolarczyk 2022-09-01 20:24:29 UTC
Isn't gmc_v10_0 present in Navi10 and later? My Vega10 is probably using gmc_v9_0 instead.

Will test a similar patch for soc15 and gmc9 (will call just before gmc_v9_0_init_golden_registers()).
Comment 18 Gustaw Smolarczyk 2022-09-01 20:30:00 UTC
No change.

Please note that the AER errors have different payload than what Tom reported. The PCI device is also different (instead of the GPU it is the Vega 10 PCIe bridge).

Maybe I should have opened a separate bug report...
Comment 19 Alex Deucher 2022-09-01 20:36:57 UTC
(In reply to Gustaw Smolarczyk from comment #18)
> No change.
> 
> Please note that the AER errors have different payload than what Tom
> reported. The PCI device is also different (instead of the GPU it is the
> Vega 10 PCIe bridge).
> 
> Maybe I should have opened a separate bug report...

ah, sorry, yeah I was thinking Tom's config.  Can you apply Bjorn's debugging patch from comment 4 and attach the dmesg output from that?
Comment 20 Gustaw Smolarczyk 2022-09-01 20:57:00 UTC
Created attachment 301721 [details]
dmesg with the tracing patch (beginning)

Attached is the beginning of dmesg, with the tracing messages from the Bjorn's patch present.

This time it is since boot as I have learned about log_buf_len cmdline option.
Comment 21 Alex Deucher 2022-09-01 21:04:07 UTC
Can you attach the output of lspci -vvnn as well?
Comment 22 Gustaw Smolarczyk 2022-09-01 21:06:35 UTC
Created attachment 301722 [details]
lspci -vvnn on vega10 system

Attached.
Comment 23 Alex Deucher 2022-09-01 21:20:24 UTC
Created attachment 301723 [details]
disable HDP remapping on vega10

Let's try this to narrow down the problem on vega10.  This removes the HDP remapping and just uses the original register.  If this doesn't fix the issue, then it's likely something else.  I'm not sure why the error is reporting as part of the vega10 bridge rather than GPU itself, but that just might be the hardware design.
Comment 24 Gustaw Smolarczyk 2022-09-01 22:01:58 UTC
No difference, other than the bjorn's tracing messages being gone.

Maybe the wrong PCI accesses do not originate from amdgpu?

My limited knowledge regarding TLP (from googling for a little [1]) suggests this is a generic "Message Request without data" with routing "Local - Terminate at Receiver" from 44:00 (which is Vega10 GPU itself), message code 0x10. The documentation suggests the last two dwords are "reserved" so probably should be zero - but the last one is not. I have confirmed that this is a 4-dword TLP (from the first "3"), so the last dword is significant.

Maybe this is a hardware error?

TLP Header: 34000000 44000010 00000000 84288428

[1] https://www.cl.cam.ac.uk/~djm202/pdf/specifications/pcie/PCI_Express_Base_11.pdf
Comment 25 Bjorn Helgaas 2022-09-01 22:15:57 UTC
> TLP Header: 34000000 44000010 00000000 84288428

This is an LTR message from 44:00.0.  The last dword contains the no-snoop/snoop latencies.  LTR was added after PCIe r1.1, which is why that spec says message code 0x10 is reserved.

From your lspci,

  43:00.0 Downstream Port to [bus 44]
    DevCtl2: LTR-

  44:00.0 Vega 10 XL/XT
    DevCtl2: LTR+

44:00.0 has LTR enabled, so it will sent LTR messages upstream periodically.  43:00.0 has LTR disabled, so when it receives those messages, it will log a UR error.

This is an illegal configuration.  Can you boot with "pci=earlydump" and attach the dmesg log?  Either BIOS left this illegal config, the PCI core set it this way, or amdgpu did.
Comment 26 Bjorn Helgaas 2022-09-01 22:29:46 UTC
Created attachment 301724 [details]
debug patch for LTR config

Add debug output for PCI core config of LTR.  Gustaw, it'd be great if you could try this.  You can combine it with a "pci=earlydump" boot.
Comment 27 Gustaw Smolarczyk 2022-09-01 22:34:44 UTC
Created attachment 301725 [details]
dmesg with pci=earlydump (vanilla v5.19.5)
Comment 28 Gustaw Smolarczyk 2022-09-01 22:36:05 UTC
This was without the LTR debug output patch, will retry with it.
Comment 29 Gustaw Smolarczyk 2022-09-01 22:52:18 UTC
Created attachment 301726 [details]
dmesg with pci=earlydump (v5.19.5 + ltr debug patch)

I see LTR disabled and not changed.
Comment 30 Bjorn Helgaas 2022-09-01 23:48:08 UTC
Thanks.  Unless I goofed in the debug patch, that means amdgpu is enabling LTR when it shouldn't.

Things like nbio_v2_3_program_aspm() and nbio_v2_3_program_ltr() look possibly relevant.  BIF_CFG_DEV0_EPF0_DEVICE_CNTL2__LTR_EN_MASK matches the value of PCI_EXP_DEVCTL2_LTR_EN (0x0400), but smnBIF_CFG_DEV0_EPF0_DEVICE_CNTL2 (0x1014008c) is not a PCI config offset, so maybe amdgpu is configuring this via a device-specific MMIO access path.  This all looks highly irregular to me.  It is absolutely not safe to configure ASPM and LTR for an endpoint without coordinating with other devices upstream.
Comment 31 Bjorn Helgaas 2022-09-06 14:29:22 UTC
Gustaw, would you mind opening a separate bugzilla for this LTR issue?  This is a completely different problem than the original issue.

Tom's original report was an Unsupported Request error logged by the AMDGPU device when it received a 32-bit MMIO write to 0x95e7f000 performed by the driver:

  amdgpu 0000:03:00.0:   device [1002:73df] error status/mask=00100000/00000000
  amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
  amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 00000000

The issue you're seeing is an Unsupported Request error logged by a Switch Downstream Port when it received an LTR message sent by 44:00.0 when the Switch has LTR disabled:

  pcieport 0000:43:00.0:   device [1022:1471] error status/mask=00100000/00000000
  pcieport 0000:43:00.0:    [20] UnsupReq               (First)
  pcieport 0000:43:00.0: AER:   TLP Header: 34000000 44000010 00000000 84288428
Comment 32 Alex Deucher 2022-09-06 15:31:20 UTC
Does setting amdgpu.aspm=0 prevent the issue?
Comment 33 Gustaw Smolarczyk 2022-09-06 16:35:45 UTC
I have tested on v5.19.6 with amdgpu.aspm=0 set. There are no longer AER errors, and I have confirmed that there is no LTR+ in DevCtl2:

# lspci -vvnn | grep 'LTR+' | grep DevCtl2
#

I will open a separate bugzilla issue.
Comment 34 Gustaw Smolarczyk 2022-09-06 16:44:34 UTC
See bug 216455
Comment 35 Alex Deucher 2022-09-06 22:04:03 UTC
@tseewald@gmail.com

Does attachment 301718 [details] fix the issue for you?
Comment 36 Alex Deucher 2022-09-13 14:57:35 UTC
Patches to fix this:
https://patchwork.freedesktop.org/series/108375/
Comment 37 Tom Seewald 2022-09-24 16:15:48 UTC
This has been fixed in mainline by commit a8671493d2074950553da3cf07d1be43185ef6c6 (drm/amdgpu: make sure to init common IP before gmc).

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a8671493d2074950553da3cf07d1be43185ef6c6

Note You need to log in before you can comment on or make changes to this bug.