Bug 217797

Summary: [amdgpu/mm?] HSA_AMD_SVM=y causes/triggers PAT issues
Product: Drivers Reporter: Žilvinas Žaltiena (zaltys)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED ANSWERED    
Severity: normal    
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description Žilvinas Žaltiena 2023-08-15 17:11:35 UTC
I have a hunch this might be MM/HMM issue, but I am reporting this as amdgpu bug just because problematic behavior is triggered by loading amdgpu, which was compiled with HSA_AMD_SVM=y. I checked problematic behavior on kernels 6.4 and 6.5-rc6, however I have seen people saying it started with 5.14.

My system is on X99 platform with Intel Broadwell-E CPU. It has multiple GPUs: AMD W6600 (which drives display) and NVIDIA RTX 3080 (used for compute and vfio). iommu is on and not in PT mode. HSA_AMD_SVM=y somehow messes PAT entries for NVIDIA card. Example follows.

NVIDIA card has two relevant BARs:
Region 1: Memory at 380000000000 (64-bit, prefetchable) [size=16G]
Region 3: Memory at 380400000000 (64-bit, prefetchable) [size=32M]

example supposes "cat /sys/kernel/debug/x86/pat_memtype_list | grep 380" is used check PAT entries.

1) fresh system start, amdgpu is loaded (blacklisting it prevents the issue), NVIDIA card is deliberately not bound to any driver on boot. No PAT entries for it is visible - good.
2) card is bound to vfio-pci and passed to VM, multiple PAT entries are visible - good.
3) VM is stopped, card is unbound from vfio-pci. This is where difference is seen. If HSA_AMD_SVM=n, then there is no PAT entries visible - good, however with HSA_AMD_SVM=y two PAT entries remain - BAD. In addition, the amount of these entries depend on how many times the card has been passed-through. It is like some clean up routine fails.

The above example is made to avoid requiring out of tree drivers for NVIDIA, however same (and probably with less hassle) can be repeated with just bounding card to nvidia driver, running compute/render task, unbinding it and then checking for left over PAT entries. This also shows it is not vfio-pci only issue.

It looks benign at first, but in real use case that card has to be switched from nvidia driver to vfio-pci and back without restarting the system. This PAT issue breaks it, because  left over PAT entries from one driver are not compatible with the other. vfio-pci needs UC-, otherwise VM throws lots of ioremap/memtype errors; and nvidia driver prefers WC entries for performance reasons.

If amdgpu is just a trigger, and issue is in general MM part of kernel, please CC relevant people.
Comment 1 Artem S. Tashkinov 2023-08-16 14:18:24 UTC
Please report here instead:

https://gitlab.freedesktop.org/drm/amd/-/issues