Bug 214859

Summary: drm-amdgpu-init-iommu~fd-device-init.patch introduce bug
Product: Drivers Reporter: towo
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexdeucher, bjo, chexum, jamesz, sd, spasswolf
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.14.15, 5.15.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: patch to fix
analysis for this issue

Description towo 2021-10-28 18:23:07 UTC
After commit d60096b3b2c2..cd8cc7d31b49 100644 drm-amdgpu-init-iommu~fd-device-init.patch

Kernel 5.14.15 on most Ryzen Notebooks X cant't start really.
There is a long time, before x is starting, dmesg is spammed with failure messages like

Okt 28 10:28:08 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Okt 28 10:28:21 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Okt 28 10:28:34 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Okt 28 10:28:47 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Okt 28 10:29:01 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Okt 28 10:29:14 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Okt 28 10:29:27 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706

and/or

Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:128 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x0000000000872000 from IH client 0x1b (UTCL2)
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00040D00
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          Faulty UTCL2 client ID: CPG (0x6)
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          MORE_FAULTS: 0x0
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          WALKER_ERROR: 0x0
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          MAPPING_ERROR: 0x1
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          RW: 0x1
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:128 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:   in page starting at address 0x0000000000872000 from IH client 0x1b (UTCL2)
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00040D00
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          Faulty UTCL2 client ID: CPG (0x6)
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          MORE_FAULTS: 0x0
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          WALKER_ERROR: 0x0
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          MAPPING_ERROR: 0x1
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu:          RW: 0x1

Reverting that commit and the kernel is back working normal.
Here the related reports from our users (ignore the nvidia posts).
https://forum.siduction.org/index.php?topic=8439.0
Comment 1 Sebastian Dalfuß 2021-10-31 09:46:13 UTC
I can confirm this for a
"04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2)".
Comment 2 towo 2021-11-01 19:42:28 UTC
The relevant commit is 714d9e4574d54596973ee3b0624ee4a16264d700
Comment 3 towo 2021-11-01 20:05:26 UTC
Additional info, after installing the kernel from a working system, 1st boot with that kernel is working flawless. Rebooting with that kernel and the boot is hanging a long time, then the desktop starts but the system is not really usuable. All the problems do not happen after reverting 714d9e4574d54596973ee3b0624ee4a16264d700.
Comment 4 Alex Deucher 2021-11-02 20:30:40 UTC
I think this patch set should address the issue:
https://patchwork.freedesktop.org/series/96508/
Comment 5 James Zhu 2021-11-03 01:46:19 UTC
Created attachment 299413 [details]
patch to fix

Suggest to upgrade to 5.15rc7 and apply this patch, then make a test.
Comment 6 James Zhu 2021-11-03 14:29:22 UTC
Created attachment 299437 [details]
analysis for this issue

Linux 5.14.15  + afd1818 can fix the issue.

Linux 5.15rc7 re-apply "init iommu after amdkfd device init" and "move iommu_resume before ip init/resume" which overwrote afd1818 caused the issue again.

714d9e4 drm/amdgpu: init iommu after amdkfd device init

f02abeb drm/amdgpu: move iommu_resume before ip init/resume

afd1818 drm/amdkfd: fix boot failure when iommu is disabled in Picasso.

286826d drm/amdgpu: init iommu after amdkfd device init

9cec53c drm/amdgpu: move iommu_resume before ip init/resume
Comment 7 towo 2021-11-04 16:55:08 UTC
With linux 5.14.17-rc1 and 5.15.1-rc1 the problem is gone.
So i think, that bug is resolved.
Comment 8 spasswolf 2021-11-17 16:38:59 UTC
*** Bug 214901 has been marked as a duplicate of this bug. ***