Dear Maintainer, compiling the current Debian source with a linux-5.8.2 kernel gives the following trace on a B550I AORUS PRO AX with an AMD Ryzen 5 PRO 4650G: [ 3.974191] ------------[ cut here ]------------ [ 3.974265] WARNING: CPU: 9 PID: 175 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn21/rn_clk_mgr.c:654 rn_clk_mgr_constru ct+0x11e/0x390 [amdgpu] [ 3.974268] Modules linked in: hid_generic(E) usbhid(E) hid(E) amdgpu(E+) gpu_sched(E) i2c_algo_bit(E) ttm(E) drm_kms_helper(E) ce c(E) ahci(E) libahci(E) nvme(E) xhci_pci(E) nvme_core(E) crc32_pclmul(E) xhci_hcd(E) r8169(E) t10_pi(E) crc32c_intel(E) realtek(E) li bata(E) crc_t10dif(E) drm(E) i2c_piix4(E) crct10dif_generic(E) mfd_core(E) crct10dif_pclmul(E) libphy(E) usbcore(E) crct10dif_common( E) scsi_mod(E) usb_common(E) wmi(E) video(E) gpio_amdpt(E) gpio_generic(E) button(E) [ 3.974284] CPU: 9 PID: 175 Comm: systemd-udevd Tainted: G E 5.8.0-trunk-amd64 #1 Debian 5.8.2-1 [ 3.974285] Hardware name: Gigabyte Technology Co., Ltd. B550I AORUS PRO AX/B550I AORUS PRO AX, BIOS F2a 06/16/2020 [ 3.974348] RIP: 0010:rn_clk_mgr_construct+0x11e/0x390 [amdgpu] [ 3.974351] Code: 00 00 00 41 8b 8c c4 80 00 00 00 41 89 c1 89 c7 85 c9 74 10 41 8b 94 c4 84 00 00 00 85 d2 0f 85 87 01 00 00 48 8 3 e8 01 73 d9 <0f> 0b 83 7b 20 01 74 0c 81 bd e8 00 00 00 ff 14 37 00 7f 27 48 8b [ 3.974353] RSP: 0018:ffffa98a8068f850 EFLAGS: 00010297 [ 3.974355] RAX: ffffffffffffffff RBX: ffff9a36d7eb2540 RCX: 0000000000000640 [ 3.974356] RDX: 0000000000000000 RSI: ffffa98a8068f878 RDI: 0000000000000000 [ 3.974357] RBP: ffff9a3625cf9800 R08: 0000000000000000 R09: 0000000000000000 [ 3.974358] R10: 7fc9117fffffffff R11: ffff9a36d7d51000 R12: ffffa98a8068f878 [ 3.974359] R13: ffff9a36d7eb2cc0 R14: ffff9a36bccc0000 R15: ffff9a36d7eb2540 [ 3.974361] FS: 00007f53ebac18c0(0000) GS:ffff9a371f240000(0000) knlGS:0000000000000000 [ 3.974362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.974363] CR2: 00007f53ebaaaee0 CR3: 00000003d7f38000 CR4: 0000000000340ee0 [ 3.974365] Call Trace: [ 3.974427] dc_clk_mgr_create+0x179/0x1a0 [amdgpu] [ 3.974488] dc_create+0x238/0x700 [amdgpu] [ 3.974493] ? _cond_resched+0x16/0x40 [ 3.974554] amdgpu_dm_init.isra.0+0x15b/0x1c0 [amdgpu] [ 3.974614] dm_hw_init+0xe/0x20 [amdgpu] [ 3.974676] amdgpu_device_init.cold+0x17a7/0x192b [amdgpu] [ 3.974722] amdgpu_driver_load_kms+0x5c/0x220 [amdgpu] [ 3.974766] amdgpu_pci_probe+0x15f/0x1f0 [amdgpu] [ 3.974770] local_pci_probe+0x42/0x80 [ 3.974772] ? _cond_resched+0x16/0x40 [ 3.974773] pci_device_probe+0xfa/0x1b0 [ 3.974776] really_probe+0x160/0x400 [ 3.974777] driver_probe_device+0xe1/0x150 [ 3.974779] device_driver_attach+0xa1/0xb0 [ 3.974780] __driver_attach+0x8a/0x150 [ 3.974781] ? device_driver_attach+0xb0/0xb0 [ 3.974782] ? device_driver_attach+0xb0/0xb0 [ 3.974784] bus_for_each_dev+0x78/0xc0 [ 3.974786] bus_add_driver+0x12b/0x1e0 [ 3.974787] driver_register+0x8b/0xe0 [ 3.974789] ? 0xffffffffc0a6b000 [ 3.974791] do_one_initcall+0x46/0x200 [ 3.974792] ? _cond_resched+0x16/0x40 [ 3.974794] ? kmem_cache_alloc_trace+0x192/0x220 [ 3.974796] ? do_init_module+0x23/0x250 [ 3.974798] do_init_module+0x5c/0x250 [ 3.974799] __do_sys_finit_module+0xac/0x110 [ 3.974802] do_syscall_64+0x4d/0xc0 [ 3.974804] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 3.974805] RIP: 0033:0x7f53ebf6ba79 [ 3.974807] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 53 0c 00 f7 d8 64 89 01 48 [ 3.974809] RSP: 002b:00007ffde26b8228 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 3.974811] RAX: ffffffffffffffda RBX: 000055c5cb205da0 RCX: 00007f53ebf6ba79 [ 3.974812] RDX: 0000000000000000 RSI: 00007f53ec0f6e4d RDI: 0000000000000012 [ 3.974813] RBP: 0000000000020000 R08: 0000000000000000 R09: 000055c5cb205fb8 [ 3.974814] R10: 0000000000000012 R11: 0000000000000246 R12: 00007f53ec0f6e4d [ 3.974815] R13: 0000000000000000 R14: 000055c5cb206e20 R15: 000055c5cb205da0 [ 3.974817] ---[ end trace 071eac41bffe7f9b ]--- best regards, Florian La Roche
Is this a regression? If so, can you bisect? Please attach your full dmesg output.
New system, so no regression for me. I'll try to check some older kernels the next days and report back here. Thanks a lot, Florian La Roche
Created attachment 292039 [details] dmesg output Full dmesg output of the system.
Hi, have you solved your problem? I have the same problem as you. Here is my environment: asrock A520M-ITX/AC + AMD 4750G Ubuntu 20.04.1 LTS + AMDGPU-pro-20.30-1109583 - Ubuntu-20.04.tar.xz
I have tested kernels 5.6.19, 5.7.10 and 5.7.17 and they all show this problem. I assume your report means this also happens on a 5.4.x kernel (Ubuntu 20.04 LTS) Display seams to work ok and I am mostly using it on a server machine, so maybe not a huge problem, still a trace on each reboot... :-) (The kernel source mentions Display Port, clock values for power management etc. ???) Also seems to depend on BIOS data (?), so I'll check again on future BIOS versions as well as future kernel source code for fixes. [ 4.207712] smu driver if version = 0x0000000b, smu fw if version = 0x0000000e, smu fw version = 0x00374100 (55.65.0) [ 4.207717] SMU driver if version not matched [ 4.207795] SMU is initialized successfully! best regards, Florian La Roche
Similar trace with Radeon RX 5500M, see at the end of this report: https://bugzilla.kernel.org/show_bug.cgi?id=209225 It might be related to the same cause.
The same happens with the following setup: - Kernel 5.9-rc8 with mostly Debian kernel config - AMD Ryzen 5 4650G CPU - MSI MAG B550M Mortar mainboard - MSI AMD RX460 graphics card
The same happened with the following setup: - Kernel 5.8.14 and 5.9 with mostly Gentoo kernel config - AMD Ryzen 7 PRO 4750G CPU+iGPU - ASRock A520M-ITX/ac mainboard + ECC UDIMM memory The trace mentioned above disappeared when I updated BIOS (v. 1.20 from 2020/9/18, it contains AGESA 1.0.8.0). However, I'm still not able to run ROCm OpenCL (tried various versions, including 3.7 and 3.8), system either hangs, or (if the program is killed early) dmesg shows Evicting PASID 0x8001 queues BTW, clinfo causes GPU resets, and leaves 99% GPU utilization, while dmesg shows something like qcm fence wait loop timeout expired The cp might be in an unrecoverable state due to an unsuccessful queues preemption amdgpu: Failed to evict process queues amdgpu: Failed to quiesce KFD amdgpu 0000:07:00.0: amdgpu: GPU reset begin! [drm] free PSP TMP buffer amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume ...(and similarly for kernel 5.9.0) It is probably an off-topic, but it seems to be related to amdgpu driver, and I don't know how to move forward (and somebody reported that ROCk 3.7 driver works well with APU Renoir).
Hello, Am Mi., 14. Okt. 2020 um 11:44 Uhr schrieb <bugzilla-daemon@bugzilla.kernel.org>: > - Kernel 5.8.14 and 5.9 with mostly Gentoo kernel config > - AMD Ryzen 7 PRO 4750G CPU+iGPU > - ASRock A520M-ITX/ac mainboard + ECC UDIMM memory > > The trace mentioned above disappeared when I updated BIOS (v. 1.20 from > 2020/9/18, it contains AGESA 1.0.8.0). However, I'm still not able to run > ROCm I have updated my motherboard Gigabyte B550I AORUS PRO AX to BIOS F10 from 09/18/2020 with AMD AGESA ComboV2 1.0.8.1. The trace is still present, so this issue is still open for me. > OpenCL (tried various versions, including 3.7 and 3.8), system either hangs, > or > (if the program is killed early) dmesg shows > > Evicting PASID 0x8001 queues > > BTW, clinfo causes GPU resets, and leaves 99% GPU utilization, while dmesg > shows something like > > qcm fence wait loop timeout expired > The cp might be in an unrecoverable state due to an unsuccessful queues > preemption > amdgpu: Failed to evict process queues > amdgpu: Failed to quiesce KFD > amdgpu 0000:07:00.0: amdgpu: GPU reset begin! > [drm] free PSP TMP buffer > amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume > ...(and similarly for kernel 5.9.0) > > It is probably an off-topic, but it seems to be related to amdgpu driver, and > I > don't know how to move forward (and somebody reported that ROCk 3.7 driver > works well with APU Renoir). Seems this is all unrelated to my bug-report. best regards, Florian La Roche
This seems to be fixed after updating to BIOS F12 from 2021-01-18, BIOS Revision: 5.17. There are even newer BIOS revisions available, but they only work with RAM at 2133 MT/s instead of the usual 3200 MT/s and seem to be unstable. best regards, Florian La Roche