Bug 214807 - AMD 5700G iGPU crash / freeze on OpenCL application
Summary: AMD 5700G iGPU crash / freeze on OpenCL application
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-25 04:34 UTC by Liberty
Modified: 2022-01-14 09:27 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.15.0 RC6
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Liberty 2021-10-25 04:34:46 UTC
I am unable to get OpenCL working reliable for darktable or Davinci Resolve on linux 5.14.14 (with linux-firmware) or linux-mainline 5.15.0.rc6 (With linux-firmware-git), opencl-amd 21.30.1290604-1 (I cannot use opencl-mesa since it supports non of the programs I need to run).
On both stable and mainline kernels, it only sometimes works for clinfo or "darktable-cltest". And then it crashes resets the GPU, and then freezes.

[code]
[   51.181768] ------------[ cut here ]------------
[   51.181771] WARNING: CPU: 9 PID: 235 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2d1/0x300 [ttm]
[   51.181782] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cfg80211 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc mousedev joydev uas usb_storage intel_rapl_msr nct6775 intel_rapl_common hwmon_vid amdgpu snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi eeepc_wmi kvm_amd snd_hda_intel gpu_sched asus_wmi nls_iso8859_1 i2c_algo_bit sparse_keymap snd_intel_dspcfg platform_profile drm_ttm_helper snd_intel_sdw_acpi vfat rfkill kvm snd_hda_codec usbhid video wmi_bmof ttm fat irqbypass snd_hda_core crct10dif_pclmul drm_kms_helper crc32_pclmul ghash_clmulni_intel snd_hwdep aesni_intel snd_pcm cec crypto_simd agpgart snd_timer cryptd syscopyarea snd sp5100_tco sysfillrect rapl sysimgblt pcspkr k10temp i2c_piix4 ccp fb_sys_fops soundcore igc wmi tpm_crb mac_hid tpm_tis tpm_tis_core tpm gpio_amdpt pinctrl_amd gpio_generic rng_core acpi_cpufreq sg crypto_user drm fuse bpf_preload ip_tables
[   51.181817]  x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq xhci_pci crc32c_intel xhci_pci_renesas
[   51.181821] CPU: 9 PID: 235 Comm: kworker/9:2 Not tainted 5.15.0-rc6-1-mainline #1 2eb2dce07dbd87701c12affdd03a7e57c707456d
[   51.181823] Hardware name: ASUS System Product Name/ROG STRIX B550-I GAMING, BIOS 2423 08/11/2021
[   51.181824] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[   51.181905] RIP: 0010:ttm_bo_release+0x2d1/0x300 [ttm]
[   51.181909] Code: 8d b6 b8 fe ff ff e8 7e 12 9a ff 49 8b 76 08 48 89 ef e8 b2 24 00 00 49 8b 7e 98 e9 70 fd ff ff e8 b4 b3 6e d5 e9 aa fd ff ff <0f> 0b e9 50 fd ff ff e8 b3 b1 6e d5 e9 f8 fe ff ff be 03 00 00 00
[   51.181910] RSP: 0018:ffffb02a80d67cc0 EFLAGS: 00010202
[   51.181911] RAX: 0000000000000001 RBX: ffffb02a80d67d08 RCX: 000000008040002d
[   51.181912] RDX: 0000000000000001 RSI: 000000008040002d RDI: ffff9f5ae5001db8
[   51.181913] RBP: ffff9f5ae6b05270 R08: ffff9f5ae5001db8 R09: 0000000000000000
[   51.181913] R10: 0000000000000001 R11: 0000000000000000 R12: ffff9f5b1bb35e30
[   51.181914] R13: ffff9f5ae5001c58 R14: ffff9f5ae5001db8 R15: dead000000000100
[   51.181914] FS:  0000000000000000(0000) GS:ffff9f65bde40000(0000) knlGS:0000000000000000
[   51.181915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.181916] CR2: 000055a15fff4d38 CR3: 00000002bee10000 CR4: 0000000000750ee0
[   51.181916] PKRU: 55555554
[   51.181917] Call Trace:
[   51.181921]  amdgpu_bo_unref+0x1a/0x30 [amdgpu 15265334394386c2d975b46dc2248cfec063d665]
[   51.181977]  amdgpu_gem_object_free+0x30/0x50 [amdgpu 15265334394386c2d975b46dc2248cfec063d665]
[   51.182030]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x359/0x3c0 [amdgpu 15265334394386c2d975b46dc2248cfec063d665]
[   51.182098]  kfd_process_device_free_bos+0x9f/0xf0 [amdgpu 15265334394386c2d975b46dc2248cfec063d665]
[   51.182158]  kfd_process_wq_release+0x20d/0x2e0 [amdgpu 15265334394386c2d975b46dc2248cfec063d665]
[   51.182215]  process_one_work+0x1e8/0x3c0
[   51.182219]  worker_thread+0x50/0x3b0
[   51.182220]  ? process_one_work+0x3c0/0x3c0
[   51.182221]  kthread+0x132/0x160
[   51.182223]  ? set_kthread_struct+0x40/0x40
[   51.182223]  ret_from_fork+0x22/0x30
[   51.182226] ---[ end trace c1ead71d4c485365 ]---
[   62.360928] amdgpu: qcm fence wait loop timeout expired
[   62.360933] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[   62.360940] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
[   62.360935] amdgpu: Failed to evict process queues
[   62.360958] amdgpu: Failed to quiesce KFD
[   62.391040] [drm] free PSP TMR buffer
[   62.419108] amdgpu 0000:07:00.0: amdgpu: MODE2 reset
[   62.419684] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
[   62.419799] [drm] PCIE GART of 1024M enabled.
[   62.419801] [drm] PTB located at 0x000000F400900000
[   62.420039] [drm] PSP is resuming...
[   62.440067] [drm] reserve 0x400000 from 0xf7ff800000 for PSP TMR
[   62.519663] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[   62.527921] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   62.527923] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[   62.527925] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
[   62.528936] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
[   62.718941] [drm] kiq ring mec 2 pipe 1 q 0
[   62.720190] [drm] DMUB hardware initialized: version=0x01010019
[   62.779619] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[   62.779665] [drm] JPEG decode initialized successfully.
[   62.779667] amdgpu 0000:07:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[   62.779669] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[   62.779670] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[   62.779671] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[   62.779672] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[   62.779672] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[   62.779673] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[   62.779674] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[   62.779674] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[   62.779675] amdgpu 0000:07:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[   62.779676] amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[   62.779677] amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
[   62.779677] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
[   62.779678] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
[   62.779678] amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
[   62.782386] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow start
[   62.782388] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow done
[   62.782398] amdgpu 0000:07:00.0: amdgpu: GPU reset(1) succeeded!
[/code]

The hardware is running on Asus Strix B550 and 5700G with 64G ram and 16G gfx ram. The OpenCL works fine on Windows. I disabled RAM XMP profile but still get crashes and freezes.
Sometimes OpenCL problems can work for a short while. hashcat -b shows 5.15 performance is 10x of the 5.14 kernel. But it will crash in a minute or so.

This is probably a kernel bug probably combined with amdgpu user driver.

Note You need to log in before you can comment on or make changes to this bug.