Bug 208981

Summary: trace with B550I AORUS PRO AX and AMD Ryzen 5 PRO 4650G
Product: Drivers Reporter: Florian La Roche (florian.laroche)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexdeucher, anton, arthurborsboom, liliorg, tino+kernel
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.8.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output

Description Florian La Roche 2020-08-20 19:03:59 UTC
Dear Maintainer,

compiling the current Debian source with a linux-5.8.2 kernel gives the
following trace on a B550I AORUS PRO AX with an AMD Ryzen 5 PRO 4650G:



[    3.974191] ------------[ cut here ]------------
[    3.974265] WARNING: CPU: 9 PID: 175 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn21/rn_clk_mgr.c:654 rn_clk_mgr_constru
ct+0x11e/0x390 [amdgpu]
[    3.974268] Modules linked in: hid_generic(E) usbhid(E) hid(E) amdgpu(E+) gpu_sched(E) i2c_algo_bit(E) ttm(E) drm_kms_helper(E) ce
c(E) ahci(E) libahci(E) nvme(E) xhci_pci(E) nvme_core(E) crc32_pclmul(E) xhci_hcd(E) r8169(E) t10_pi(E) crc32c_intel(E) realtek(E) li
bata(E) crc_t10dif(E) drm(E) i2c_piix4(E) crct10dif_generic(E) mfd_core(E) crct10dif_pclmul(E) libphy(E) usbcore(E) crct10dif_common(
E) scsi_mod(E) usb_common(E) wmi(E) video(E) gpio_amdpt(E) gpio_generic(E) button(E)
[    3.974284] CPU: 9 PID: 175 Comm: systemd-udevd Tainted: G            E     5.8.0-trunk-amd64 #1 Debian 5.8.2-1
[    3.974285] Hardware name: Gigabyte Technology Co., Ltd. B550I AORUS PRO AX/B550I AORUS PRO AX, BIOS F2a 06/16/2020
[    3.974348] RIP: 0010:rn_clk_mgr_construct+0x11e/0x390 [amdgpu]
[    3.974351] Code: 00 00 00 41 8b 8c c4 80 00 00 00 41 89 c1 89 c7 85 c9 74 10 41 8b 94 c4 84 00 00 00 85 d2 0f 85 87 01 00 00 48 8
3 e8 01 73 d9 <0f> 0b 83 7b 20 01 74 0c 81 bd e8 00 00 00 ff 14 37 00 7f 27 48 8b
[    3.974353] RSP: 0018:ffffa98a8068f850 EFLAGS: 00010297
[    3.974355] RAX: ffffffffffffffff RBX: ffff9a36d7eb2540 RCX: 0000000000000640
[    3.974356] RDX: 0000000000000000 RSI: ffffa98a8068f878 RDI: 0000000000000000
[    3.974357] RBP: ffff9a3625cf9800 R08: 0000000000000000 R09: 0000000000000000
[    3.974358] R10: 7fc9117fffffffff R11: ffff9a36d7d51000 R12: ffffa98a8068f878
[    3.974359] R13: ffff9a36d7eb2cc0 R14: ffff9a36bccc0000 R15: ffff9a36d7eb2540
[    3.974361] FS:  00007f53ebac18c0(0000) GS:ffff9a371f240000(0000) knlGS:0000000000000000
[    3.974362] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.974363] CR2: 00007f53ebaaaee0 CR3: 00000003d7f38000 CR4: 0000000000340ee0
[    3.974365] Call Trace:
[    3.974427]  dc_clk_mgr_create+0x179/0x1a0 [amdgpu]
[    3.974488]  dc_create+0x238/0x700 [amdgpu]
[    3.974493]  ? _cond_resched+0x16/0x40
[    3.974554]  amdgpu_dm_init.isra.0+0x15b/0x1c0 [amdgpu]
[    3.974614]  dm_hw_init+0xe/0x20 [amdgpu]
[    3.974676]  amdgpu_device_init.cold+0x17a7/0x192b [amdgpu]
[    3.974722]  amdgpu_driver_load_kms+0x5c/0x220 [amdgpu]
[    3.974766]  amdgpu_pci_probe+0x15f/0x1f0 [amdgpu]
[    3.974770]  local_pci_probe+0x42/0x80
[    3.974772]  ? _cond_resched+0x16/0x40
[    3.974773]  pci_device_probe+0xfa/0x1b0
[    3.974776]  really_probe+0x160/0x400
[    3.974777]  driver_probe_device+0xe1/0x150
[    3.974779]  device_driver_attach+0xa1/0xb0
[    3.974780]  __driver_attach+0x8a/0x150
[    3.974781]  ? device_driver_attach+0xb0/0xb0
[    3.974782]  ? device_driver_attach+0xb0/0xb0
[    3.974784]  bus_for_each_dev+0x78/0xc0
[    3.974786]  bus_add_driver+0x12b/0x1e0
[    3.974787]  driver_register+0x8b/0xe0
[    3.974789]  ? 0xffffffffc0a6b000
[    3.974791]  do_one_initcall+0x46/0x200
[    3.974792]  ? _cond_resched+0x16/0x40
[    3.974794]  ? kmem_cache_alloc_trace+0x192/0x220
[    3.974796]  ? do_init_module+0x23/0x250
[    3.974798]  do_init_module+0x5c/0x250
[    3.974799]  __do_sys_finit_module+0xac/0x110
[    3.974802]  do_syscall_64+0x4d/0xc0
[    3.974804]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    3.974805] RIP: 0033:0x7f53ebf6ba79
[    3.974807] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 53 0c 00 f7 d8 64 89 01 48
[    3.974809] RSP: 002b:00007ffde26b8228 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    3.974811] RAX: ffffffffffffffda RBX: 000055c5cb205da0 RCX: 00007f53ebf6ba79
[    3.974812] RDX: 0000000000000000 RSI: 00007f53ec0f6e4d RDI: 0000000000000012
[    3.974813] RBP: 0000000000020000 R08: 0000000000000000 R09: 000055c5cb205fb8
[    3.974814] R10: 0000000000000012 R11: 0000000000000246 R12: 00007f53ec0f6e4d
[    3.974815] R13: 0000000000000000 R14: 000055c5cb206e20 R15: 000055c5cb205da0
[    3.974817] ---[ end trace 071eac41bffe7f9b ]---


best regards,

Florian La Roche
Comment 1 Alex Deucher 2020-08-20 19:08:18 UTC
Is this a regression?  If so, can you bisect?  Please attach your full dmesg output.
Comment 2 Florian La Roche 2020-08-20 19:28:01 UTC
New system, so no regression for me. I'll try to check some older kernels
the next days and report back here.

Thanks a lot,

Florian La Roche
Comment 3 Florian La Roche 2020-08-20 19:29:07 UTC
Created attachment 292039 [details]
dmesg output

Full dmesg output of the system.
Comment 4 ahren 2020-09-06 00:43:20 UTC
Hi, have you solved your problem? I have the same problem as you.
Here is my environment:


asrock A520M-ITX/AC + AMD 4750G
Ubuntu 20.04.1 LTS + AMDGPU-pro-20.30-1109583 - Ubuntu-20.04.tar.xz
Comment 5 Florian La Roche 2020-09-06 08:14:12 UTC
I have tested kernels 5.6.19, 5.7.10 and 5.7.17 and they all show this problem.
I assume your report means this also happens on a 5.4.x kernel (Ubuntu 20.04 LTS)

Display seams to work ok and I am mostly using it on a server machine, so
maybe not a huge problem, still a trace on each reboot... :-)
(The kernel source mentions Display Port, clock values for power management
etc. ???)

Also seems to depend on BIOS data (?), so I'll check again on future
BIOS versions as well as future kernel source code for fixes.

[    4.207712] smu driver if version = 0x0000000b, smu fw if version = 0x0000000e, smu fw version = 0x00374100 (55.65.0)
[    4.207717] SMU driver if version not matched
[    4.207795] SMU is initialized successfully!

best regards,

Florian La Roche
Comment 6 Arthur Borsboom 2020-09-12 09:44:35 UTC
Similar trace with Radeon RX 5500M, see at the end of this report:

https://bugzilla.kernel.org/show_bug.cgi?id=209225

It might be related to the same cause.
Comment 7 Tino Mettler 2020-10-07 07:39:24 UTC
The same happens with the following setup:

- Kernel 5.9-rc8 with mostly Debian kernel config
- AMD Ryzen 5 4650G CPU
- MSI MAG B550M Mortar mainboard
- MSI AMD RX460 graphics card
Comment 8 Anton Repko 2020-10-14 09:44:26 UTC
The same happened with the following setup:

- Kernel 5.8.14 and 5.9 with mostly Gentoo kernel config
- AMD Ryzen 7 PRO 4750G CPU+iGPU
- ASRock A520M-ITX/ac mainboard + ECC UDIMM memory

The trace mentioned above disappeared when I updated BIOS (v. 1.20 from 2020/9/18, it contains AGESA 1.0.8.0). However, I'm still not able to run ROCm OpenCL (tried various versions, including 3.7 and 3.8), system either hangs, or (if the program is killed early) dmesg shows

 Evicting PASID 0x8001 queues

BTW, clinfo causes GPU resets, and leaves 99% GPU utilization, while dmesg shows something like

 qcm fence wait loop timeout expired
 The cp might be in an unrecoverable state due to an unsuccessful queues 
 preemption
 amdgpu: Failed to evict process queues
 amdgpu: Failed to quiesce KFD
 amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
 [drm] free PSP TMP buffer
 amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
...(and similarly for kernel 5.9.0)

It is probably an off-topic, but it seems to be related to amdgpu driver, and I don't know how to move forward (and somebody reported that ROCk 3.7 driver works well with APU Renoir).
Comment 9 florian.laroche 2020-10-16 08:36:37 UTC
Hello,

Am Mi., 14. Okt. 2020 um 11:44 Uhr schrieb
<bugzilla-daemon@bugzilla.kernel.org>:
> - Kernel 5.8.14 and 5.9 with mostly Gentoo kernel config
> - AMD Ryzen 7 PRO 4750G CPU+iGPU
> - ASRock A520M-ITX/ac mainboard + ECC UDIMM memory
>
> The trace mentioned above disappeared when I updated BIOS (v. 1.20 from
> 2020/9/18, it contains AGESA 1.0.8.0). However, I'm still not able to run
> ROCm

I have updated my motherboard Gigabyte B550I AORUS PRO AX to
BIOS F10 from 09/18/2020 with AMD AGESA ComboV2 1.0.8.1.

The trace is still present, so this issue is still open for me.


> OpenCL (tried various versions, including 3.7 and 3.8), system either hangs,
> or
> (if the program is killed early) dmesg shows
>
>  Evicting PASID 0x8001 queues
>
> BTW, clinfo causes GPU resets, and leaves 99% GPU utilization, while dmesg
> shows something like
>
>  qcm fence wait loop timeout expired
>  The cp might be in an unrecoverable state due to an unsuccessful queues
>  preemption
>  amdgpu: Failed to evict process queues
>  amdgpu: Failed to quiesce KFD
>  amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
>  [drm] free PSP TMP buffer
>  amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
> ...(and similarly for kernel 5.9.0)
>
> It is probably an off-topic, but it seems to be related to amdgpu driver, and
> I
> don't know how to move forward (and somebody reported that ROCk 3.7 driver
> works well with APU Renoir).


Seems this is all unrelated to my bug-report.

best regards,

Florian La Roche
Comment 10 Florian La Roche 2021-08-04 13:19:02 UTC
This seems to be fixed after updating to BIOS F12 from 2021-01-18,
BIOS Revision: 5.17.

There are even newer BIOS revisions available, but they only work with RAM at
2133 MT/s instead of the usual 3200 MT/s and seem to be unstable.

best regards,

Florian La Roche