Bug 217797 - [amdgpu/mm?] HSA_AMD_SVM=y causes/triggers PAT issues
Summary: [amdgpu/mm?] HSA_AMD_SVM=y causes/triggers PAT issues
Status: RESOLVED ANSWERED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-15 17:11 UTC by Žilvinas Žaltiena
Modified: 2023-08-16 14:18 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Žilvinas Žaltiena 2023-08-15 17:11:35 UTC
I have a hunch this might be MM/HMM issue, but I am reporting this as amdgpu bug just because problematic behavior is triggered by loading amdgpu, which was compiled with HSA_AMD_SVM=y. I checked problematic behavior on kernels 6.4 and 6.5-rc6, however I have seen people saying it started with 5.14.

My system is on X99 platform with Intel Broadwell-E CPU. It has multiple GPUs: AMD W6600 (which drives display) and NVIDIA RTX 3080 (used for compute and vfio). iommu is on and not in PT mode. HSA_AMD_SVM=y somehow messes PAT entries for NVIDIA card. Example follows.

NVIDIA card has two relevant BARs:
Region 1: Memory at 380000000000 (64-bit, prefetchable) [size=16G]
Region 3: Memory at 380400000000 (64-bit, prefetchable) [size=32M]

example supposes "cat /sys/kernel/debug/x86/pat_memtype_list | grep 380" is used check PAT entries.

1) fresh system start, amdgpu is loaded (blacklisting it prevents the issue), NVIDIA card is deliberately not bound to any driver on boot. No PAT entries for it is visible - good.
2) card is bound to vfio-pci and passed to VM, multiple PAT entries are visible - good.
3) VM is stopped, card is unbound from vfio-pci. This is where difference is seen. If HSA_AMD_SVM=n, then there is no PAT entries visible - good, however with HSA_AMD_SVM=y two PAT entries remain - BAD. In addition, the amount of these entries depend on how many times the card has been passed-through. It is like some clean up routine fails.

The above example is made to avoid requiring out of tree drivers for NVIDIA, however same (and probably with less hassle) can be repeated with just bounding card to nvidia driver, running compute/render task, unbinding it and then checking for left over PAT entries. This also shows it is not vfio-pci only issue.

It looks benign at first, but in real use case that card has to be switched from nvidia driver to vfio-pci and back without restarting the system. This PAT issue breaks it, because  left over PAT entries from one driver are not compatible with the other. vfio-pci needs UC-, otherwise VM throws lots of ioremap/memtype errors; and nvidia driver prefers WC entries for performance reasons.

If amdgpu is just a trigger, and issue is in general MM part of kernel, please CC relevant people.
Comment 1 Artem S. Tashkinov 2023-08-16 14:18:24 UTC
Please report here instead:

https://gitlab.freedesktop.org/drm/amd/-/issues

Note You need to log in before you can comment on or make changes to this bug.