Bug 218900

Summary: amdgpu: Fatal error during GPU init
Product: Drivers Reporter: Jean-Christophe Guillain (jean-christophe)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: blocking CC: alexdeucher, bp, dreamlike_clinking040, i.r.e.c.c.a.k.u.n+bugzilla.kernel.org, jd.girard, mario.limonciello, regressions, suravee.suthikulpanit, vasant.hegde
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.10.0-rc1 Subsystem:
Regression: Yes Bisected commit-id: c4cb23111103a841c2df30058597398443bcad5f
Attachments: Full logs of the boot.
Check Enhanced PPR support before enabling PPR
Full dmesg after applying Vasant's patch
Complete dmesg

Description Jean-Christophe Guillain 2024-05-27 14:52:00 UTC
Hello !

Trying the new kernel RC today (6.10.0-rc1), I no longer have video.
With 6.9.1 works.

Lenovo ThinkCentre M715q

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev e4)


In the journal, I have multiple entries like this one :
May 27 14:24:22 youpi kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:00:01.0 pasid=0x00000 address=0x102e89980 flags=0x0080]
May 27 14:24:22 youpi kernel: AMD-Vi: DTE[0]: 7190000000000003
May 27 14:24:22 youpi kernel: AMD-Vi: DTE[1]: 00001001034f0002
May 27 14:24:22 youpi kernel: AMD-Vi: DTE[2]: 200000010022a013
May 27 14:24:22 youpi kernel: AMD-Vi: DTE[3]: 0000000000000000



Then, multiple entries like that one :
May 27 14:24:22 youpi kernel: amdgpu 0000:00:01.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
May 27 14:24:22 youpi kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -110
May 27 14:24:22 youpi kernel: amdgpu 0000:00:01.0: amdgpu: amdgpu_device_ip_init failed
May 27 14:24:22 youpi kernel: amdgpu 0000:00:01.0: amdgpu: Fatal error during GPU init
May 27 14:24:22 youpi kernel: amdgpu 0000:00:01.0: amdgpu: amdgpu: finishing device.
May 27 14:24:22 youpi kernel: ------------[ cut here ]------------
May 27 14:24:22 youpi kernel: WARNING: CPU: 0 PID: 179 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:630 amdgpu_irq_put+0x45/0x70 [amdgpu]
May 27 14:24:22 youpi kernel: Modules linked in: sd_mod usbhid uas hid usb_storage amdgpu(+) amdxcp drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm>
May 27 14:24:22 youpi kernel: CPU: 0 PID: 179 Comm: (udev-worker) Not tainted 6.10.0-rc1-jcg #1
May 27 14:24:22 youpi kernel: Hardware name: LENOVO 10VGS02P00/3130, BIOS M1XKT57A 02/10/2022
May 27 14:24:22 youpi kernel: RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
May 27 14:24:22 youpi kernel: Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 e9 80 d8 a3 fc e9 6b fd ff ff <0f>
May 27 14:24:22 youpi kernel: RSP: 0018:ffffbc9c80813a48 EFLAGS: 00010246
May 27 14:24:22 youpi kernel: RAX: ffff985ad74e3780 RBX: ffff985a82f18878 RCX: 0000000000000000
May 27 14:24:22 youpi kernel: RDX: 0000000000000000 RSI: ffff985a82f254b8 RDI: ffff985a82f00000
May 27 14:24:22 youpi kernel: RBP: ffff985a82f10208 R08: 0000000000000000 R09: 0000000000000003
May 27 14:24:22 youpi kernel: R10: ffffbc9c80813880 R11: ffffffffbdec7828 R12: ffff985a82f105e8
May 27 14:24:22 youpi kernel: R13: ffff985a82f00010 R14: ffff985a82f00000 R15: ffff985a82f254b8
May 27 14:24:22 youpi kernel: FS:  00007f18ca0058c0(0000) GS:ffff985b57600000(0000) knlGS:0000000000000000
May 27 14:24:22 youpi kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 14:24:22 youpi kernel: CR2: 00005563a55b3a68 CR3: 000000010f8bc000 CR4: 00000000001506f0
May 27 14:24:22 youpi kernel: Call Trace:
May 27 14:24:22 youpi kernel:  <TASK>
May 27 14:24:22 youpi kernel:  ? __warn+0x7c/0x120
May 27 14:24:22 youpi kernel:  ? amdgpu_irq_put+0x45/0x70 [amdgpu]
May 27 14:24:22 youpi kernel:  ? report_bug+0x155/0x170
May 27 14:24:22 youpi kernel:  ? handle_bug+0x3f/0x80
May 27 14:24:22 youpi kernel:  ? exc_invalid_op+0x13/0x60
May 27 14:24:22 youpi kernel:  ? asm_exc_invalid_op+0x16/0x20
May 27 14:24:22 youpi kernel:  ? amdgpu_irq_put+0x45/0x70 [amdgpu]
May 27 14:24:22 youpi kernel:  amdgpu_fence_driver_hw_fini+0xfa/0x130 [amdgpu]
May 27 14:24:22 youpi kernel:  amdgpu_device_fini_hw+0xa2/0x3f0 [amdgpu]
May 27 14:24:22 youpi kernel:  amdgpu_driver_load_kms+0x79/0xb0 [amdgpu]
May 27 14:24:22 youpi kernel:  amdgpu_pci_probe+0x182/0x4f0 [amdgpu]
May 27 14:24:22 youpi kernel:  local_pci_probe+0x41/0x90
May 27 14:24:22 youpi kernel:  pci_device_probe+0xbb/0x1e0
May 27 14:24:22 youpi kernel:  really_probe+0xd6/0x390
May 27 14:24:22 youpi kernel:  ? __pfx___driver_attach+0x10/0x10
May 27 14:24:22 youpi kernel:  __driver_probe_device+0x78/0x150
May 27 14:24:22 youpi kernel:  driver_probe_device+0x1f/0x90
May 27 14:24:22 youpi kernel:  __driver_attach+0xce/0x1c0
May 27 14:24:22 youpi kernel:  bus_for_each_dev+0x84/0xd0
May 27 14:24:22 youpi kernel:  bus_add_driver+0x10e/0x240
May 27 14:24:22 youpi kernel:  driver_register+0x55/0x100
May 27 14:24:22 youpi kernel:  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
May 27 14:24:22 youpi kernel:  do_one_initcall+0x57/0x320
May 27 14:24:22 youpi kernel:  do_init_module+0x60/0x230
May 27 14:24:22 youpi kernel:  init_module_from_file+0x86/0xc0
May 27 14:24:22 youpi kernel:  idempotent_init_module+0x11b/0x2b0
May 27 14:24:22 youpi kernel:  __x64_sys_finit_module+0x5a/0xb0
May 27 14:24:22 youpi kernel:  do_syscall_64+0x7e/0x190
May 27 14:24:22 youpi kernel:  ? ksys_mmap_pgoff+0x14e/0x1f0
May 27 14:24:22 youpi kernel:  ? syscall_exit_to_user_mode+0x71/0x1e0
May 27 14:24:22 youpi kernel:  ? do_syscall_64+0x8a/0x190
May 27 14:24:22 youpi kernel:  ? do_syscall_64+0x8a/0x190
May 27 14:24:22 youpi kernel:  ? do_syscall_64+0x8a/0x190
May 27 14:24:22 youpi kernel:  ? __irq_exit_rcu+0x38/0xb0
May 27 14:24:22 youpi kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
May 27 14:24:22 youpi kernel: RIP: 0033:0x7f18c9e79719
May 27 14:24:22 youpi kernel: Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48>
May 27 14:24:22 youpi kernel: RSP: 002b:00007ffd56f52208 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
May 27 14:24:22 youpi kernel: RAX: ffffffffffffffda RBX: 00005563a558e400 RCX: 00007f18c9e79719
May 27 14:24:22 youpi kernel: RDX: 0000000000000000 RSI: 00007f18ca01defd RDI: 0000000000000015
May 27 14:24:22 youpi kernel: RBP: 00007f18ca01defd R08: 0000000000000000 R09: 00005563a55902b0
May 27 14:24:22 youpi kernel: R10: 0000000000000015 R11: 0000000000000246 R12: 0000000000020000
May 27 14:24:22 youpi kernel: R13: 0000000000000000 R14: 00005563a5591f30 R15: 000055638158bec1
May 27 14:24:22 youpi kernel:  </TASK>
May 27 14:24:22 youpi kernel: ---[ end trace 0000000000000000 ]---

I suspect this commit : https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c?id=db5d28c0bfe566908719bec8e25443aabecbb802

Let me now if you need more information.

Cheers,
jC
Comment 1 Jean-Christophe Guillain 2024-05-27 15:08:53 UTC
Created attachment 306354 [details]
Full logs of the boot.

I added the full log of the boot process showing all the errors.
Comment 2 Alex Deucher 2024-05-27 15:22:43 UTC
Can you bisect?  https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 3 Jean-Christophe Guillain 2024-05-27 15:46:18 UTC
Bisecting: 5720 revisions left to test after this (roughly 13 steps)

I'll try, but it will take some time. My machine is not very powerful...
Comment 4 Mario Limonciello (AMD) 2024-05-28 18:01:39 UTC
Possibly the same as this report:

https://lore.kernel.org/all/20240527192159.GEZlTdV7OoOuJrHmI0@fat_crate.local/
Comment 5 Vasant Hegde 2024-05-29 06:15:51 UTC
Created attachment 306364 [details]
Check Enhanced PPR support before enabling PPR
Comment 6 Vasant Hegde 2024-05-29 06:16:29 UTC
Hi,

Attached patch should fix this issue. Can you please test it?

I will send proper patch to mailing list soon.

-Vasant
Comment 7 Vasant Hegde 2024-05-29 07:11:24 UTC
Also can you please attach full dmesg? I want to see IOMMU feature list and confirm what I am doing is right.

-Vasant
Comment 8 Jean-Christophe Guillain 2024-05-29 07:42:08 UTC
Hi,

I plan to finish the bisection today, and I'll test your patch.

jC
Comment 9 Vasant Hegde 2024-05-29 10:41:54 UTC
(In reply to Jean-Christophe Guillain from comment #8)
> Hi,
> 
> I plan to finish the bisection today, and I'll test your patch.
> 

You mean bisecting for this issue? If so we know the culprit commit. Issue is happening because IOMMU driver tried to enable PPR bit in DTE without checking Enhanced PPR support in EFR register.



-Vasant
Comment 10 Jean-Christophe Guillain 2024-05-29 12:41:05 UTC
I applied your patch to the 6.10.0-rc1 kernel, and I confirm that it fixes this bug.

Thank you very much !

jC

(full dmesg attached)
Comment 11 Jean-Christophe Guillain 2024-05-29 12:43:41 UTC
Created attachment 306367 [details]
Full dmesg after applying Vasant's patch
Comment 12 Jean-Christophe Guillain 2024-05-29 16:17:50 UTC
(I still finished my bisection, and as you said, c4cb23111103a841c2df30058597398443bcad5f is the first bad commit.)
Comment 13 Vasant Hegde 2024-05-30 06:20:38 UTC
Thanks Jean for testing. I will send patch with your Tested-by today.

-Vasant
Comment 14 Hanabishi 2024-06-06 12:26:11 UTC
*** Bug 218921 has been marked as a duplicate of this bug. ***
Comment 15 Hanabishi 2024-06-07 15:18:11 UTC
(In reply to Vasant Hegde from comment #5)
> Created attachment 306364 [details]
> Check Enhanced PPR support before enabling PPR

I applied your patch on top of rc2 and also confirm that it works.
Thank you.
Comment 16 Vasant Hegde 2024-06-10 15:02:24 UTC
(In reply to Hanabishi from comment #15)
> (In reply to Vasant Hegde from comment #5)
> > Created attachment 306364 [details]
> > Check Enhanced PPR support before enabling PPR
> 
> I applied your patch on top of rc2 and also confirm that it works.
> Thank you.

Thanks Hanabishi for testing.

FYI. Patches merged into -rc3.

-Vasant
Comment 17 Jean-Denis Girard 2024-06-25 17:10:06 UTC
I seem to have a similar problem on 6.10-rc5 after suspend. I get a black screen on resume.

[  269.157149] amdgpu 0000:02:00.0: amdgpu: reserve 0x400000 from 0xf41f800000 for PSP TMR
[  269.159956] iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:02:00.0 pasid=0x00000 address=0x131400000 flags=0x0180]
[  269.159960] AMD-Vi: DTE[0]: 6190000000000003
[  269.159962] AMD-Vi: DTE[1]: 00001001049e000b
[  269.159963] AMD-Vi: DTE[2]: 200000013c610013
[  269.159963] AMD-Vi: DTE[3]: 0000000000000000
[  269.160104] amdgpu 0000:02:00.0: amdgpu: failed to load ucode SDMA0(0x1) 
[  269.160108] amdgpu 0000:02:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) failed and response status is (0xF)
Comment 18 Jean-Denis Girard 2024-06-25 17:11:10 UTC
Created attachment 306495 [details]
Complete dmesg
Comment 19 Vasant Hegde 2024-06-25 17:14:26 UTC
Unfortunately there was another big in suspend/resume path. Can you please test with below patch?

https://lore.kernel.org/linux-iommu/ZnqzXyCU8bn32j4-@8bytes.org/T/#m1cd1520facb8b758efdf7a8c0261f9ee2ec217d7



-Vasant
Comment 20 Jean-Denis Girard 2024-06-25 17:55:20 UTC
Yes, I confirm the patch "iommu/amd: Fix GT feature enablement again" applied to 6.10-rc5 fixes resume on my machine.

Thanks for prompt reply!
Comment 21 dreamlike_clinking040 2024-06-27 16:00:11 UTC
(In reply to Vasant Hegde from comment #19)
> Unfortunately there was another big in suspend/resume path. Can you please
> test with below patch?
> 
> https://lore.kernel.org/linux-iommu/ZnqzXyCU8bn32j4-@8bytes.org/T/
> #m1cd1520facb8b758efdf7a8c0261f9ee2ec217d7
> 
> 
> 
> -Vasant

Can confirm this patch also fixes my suspend/resume issue, thanks!
Comment 22 Vasant Hegde 2024-06-28 09:13:43 UTC
(In reply to dreamlike_clinking040 from comment #21)
> (In reply to Vasant Hegde from comment #19)

> 
> Can confirm this patch also fixes my suspend/resume issue, thanks!

Thanks a lot.

-Vasant