Bug 217141

Summary: [amdgpu] ring gfx_0.0.0 timeout steam deck AMD APU
Product: Drivers Reporter: Serg Podtynnyi (serg)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED ANSWERED    
Severity: high CC: regressions
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.1.12 Subsystem:
Regression: No Bisected commit-id:

Description Serg Podtynnyi 2023-03-05 15:32:25 UTC
[  257.182206] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=26043, emitted[64/36172]
[  257.182668] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process NMS.exe pid 2571 thread NMS.exe
pid 2571
[  257.183084] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[  257.183094] ------------[ cut here ]------------
[  257.183095] Evicting all processes
[  257.183151] WARNING: CPU: 6 PID: 745 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1935 kfd_suspend_all_proc
esses+0x100/0x110 [amdgpu]
[  257.183562] Modules linked in: uinput snd_seq_dummy snd_hrtimer snd_seq snd_seq_device ccm algif_aead cbc des_generi
c libdes ecb md4 cmac algif_hash algif_skcipher af_alg bnep ramoops reed_solomon snd_acp5x_pcm_dma snd_soc_acp5x_mach s
nd_acp5x_i2s snd_sof_amd_rembrandt rtw88_8822ce snd_sof_amd_renoir rtw88_8822c snd_sof_amd_acp rtw88_pci intel_rapl_msr
 snd_sof_pci intel_rapl_common rtw88_core snd_sof edac_mce_amd snd_sof_utils btusb kvm_amd btrtl snd_pci_ps mac80211 sn
d_hda_codec_hdmi btbcm snd_soc_cs35l41_spi btintel kvm snd_soc_cs35l41 snd_rpl_pci_acp6x snd_hda_intel btmtk snd_soc_wm
_adsp snd_intel_dspcfg cs_dsp snd_acp_pci libarc4 leds_steamdeck extcon_steamdeck snd_pci_acp6x snd_intel_sdw_acpi snd_
soc_nau8821 snd_soc_cs35l41_lib steamdeck_hwmon irqbypass bluetooth snd_hda_codec snd_pci_acp5x snd_soc_core rapl snd_r
n_pci_acp3x cfg80211 pcspkr snd_hda_core snd_compress i2c_piix4 mousedev cdc_acm ac97_bus snd_acp_config joydev ecdh_ge
neric snd_pcm_dmaengine snd_hwdep snd_soc_acpi
[  257.183627]  snd_pci_acp3x snd_pcm dwc3_pci rfkill ina2xx_adc kfifo_buf snd_timer opt3001 ltrf216a steamdeck spi_amd
 ina2xx industrialio snd acpi_cpufreq mac_hid soundcore fuse ip_tables x_tables overlay ext4 crc16 mbcache jbd2 hid_ste
am usbhid amdgpu vfat fat gpu_sched drm_buddy serio_raw sdhci_pci nvme_tcp drm_display_helper atkbd cqhci libps2 nvme_f
abrics crct10dif_pclmul vivaldi_fmap crc32_pclmul polyval_clmulni sdhci polyval_generic cec i8042 gf128mul nvme hid_mul
titouch drm_ttm_helper ghash_clmulni_intel xhci_pci sha512_ssse3 nvme_core aesni_intel crypto_simd sp5100_tco cryptd wd
at_wdt ttm xhci_pci_renesas ccp mmc_core nvme_common serio video i2c_hid_acpi wmi 8250_dw i2c_hid btrfs blake2b_generic
 xor raid6_pq libcrc32c crc32c_generic crc32c_intel dm_mirror dm_region_hash dm_log dm_mod pkcs8_key_parser crypto_user
[  257.183700] CPU: 6 PID: 745 Comm: kworker/u32:7 Not tainted 6.1.12-valve2-1-neptune-61 #1 4091faa51bd1be3bbac5fd4c3c
e3432202f24d92
[  257.183704] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
[  257.183708] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[  257.183718] RIP: 0010:kfd_suspend_all_processes+0x100/0x110 [amdgpu]
[  257.184119] Code: c7 c7 00 b3 3f c1 41 5c 41 5d e9 cb 4f 5f f1 be 03 00 00 00 e8 d1 89 a3 f1 e9 59 ff ff ff 48 c7 c7
 14 a2 24 c1 e8 12 d6 06 f2 <0f> 0b e9 24 ff ff ff 0f 0b eb c5 0f 1f 44 00 00 66 0f 1f 00 0f 1f
[  257.184122] RSP: 0018:ffffad1140f67cf8 EFLAGS: 00010286
[  257.184125] RAX: 0000000000000000 RBX: ffff993b46b68400 RCX: 0000000000000027
[  257.184127] RDX: ffff993e6eda0728 RSI: 0000000000000001 RDI: ffff993e6eda0720
[  257.184128] RBP: ffff993b44620000 R08: 0000000000000000 R09: ffffad1140f67b78
[  257.184130] R10: 0000000000000003 R11: ffff993e7ef7ffe8 R12: ffffad1140f67dd0
[  257.184131] R13: 0000000000000000 R14: ffff993b89dbe400 R15: 0000000000000000
[  257.184133] FS:  0000000000000000(0000) GS:ffff993e6ed80000(0000) knlGS:0000000000000000
[  257.184135] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  257.184137] CR2: 000055d62521f000 CR3: 0000000108b04000 CR4: 0000000000350ee0
[  257.184139] Call Trace:
[  257.184143]  <TASK>
[  257.184147]  kgd2kfd_suspend.part.0+0x3d/0x40 [amdgpu ad613437896db6c29581f2be9152cc5a6dd35ad7]
[  257.184571]  kgd2kfd_pre_reset+0x47/0x60 [amdgpu ad613437896db6c29581f2be9152cc5a6dd35ad7]
[  257.184965]  amdgpu_device_gpu_recover.cold+0x119/0xb40 [amdgpu ad613437896db6c29581f2be9152cc5a6dd35ad7]
[  257.185430]  amdgpu_job_timedout+0x1dc/0x220 [amdgpu ad613437896db6c29581f2be9152cc5a6dd35ad7]
[  257.185866]  ? try_to_wake_up+0xd9/0x560
[  257.185874]  drm_sched_job_timedout+0x7a/0x110 [gpu_sched 32db77b2b4e1fdeaf45e32d64ce206e5c0ca90ae]
[  257.185885]  process_one_work+0x1c7/0x380
[  257.185892]  worker_thread+0x51/0x390
[  257.185897]  ? rescuer_thread+0x3b0/0x3b0
[  257.185901]  kthread+0xde/0x110
[  257.185905]  ? kthread_complete_and_exit+0x20/0x20
[  257.185909]  ret_from_fork+0x22/0x30
[  257.185917]  </TASK>
[  257.185918] ---[ end trace 0000000000000000 ]---
[  257.284610] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[  257.294783] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-linux-neptune-61 console=tty1 rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 rd.systemd.gpt_auto=no amdgpu.noretry=0 amdgpu.ppfeaturemask=0xffffbfff amdgpu.lockup_timeout=20000 amdgpu.job_hang_limit=2 drm.debug=0x1ff amdgpu.debug_evictions=true1 tsc=directsync module_blacklist=tpm log_buf_len=4M amd_iommu=off amdgpu.gttsize=8128 spi_amd.speed_dev=1 audit=0 fbcon=rotate:1 loglevel=3 splash quiet plymouth.ignore-serial-consoles fbcon=vc:4-6 steamos.efi=PARTUUID=8bdf3e52-bf2f-7c45-9f00-45e568aa5af0


Linux Thorax 6.1.12-valve2-1-neptune-61 #1 SMP PREEMPT_DYNAMIC Mon, 27 Feb 2023 21:06:42 +0000 x86_64 GNU/Linux


Devices:
========
GPU0:
        apiVersion         = 4206830 (1.3.238)
        driverVersion      = 96469091 (0x5c00063)
        vendorID           = 0x1002
        deviceID           = 0x163f
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = AMD Custom GPU 0405 (RADV VANGOGH)
        driverID           = DRIVER_ID_MESA_RADV
        driverName         = radv
        driverInfo         = Mesa 23.1.0-devel (git-16283f7b97)
        conformanceVersion = 1.3.0.0
        deviceUUID         = 00000000-0400-0000-0000-000000000000
        driverUUID         = 414d442d-4d45-5341-2d44-525600000000
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-06 08:55:18 UTC
Sorry for causing you trouble (note: I'm just the messenger here), but most of the core graphic driver developers (just like many other kernel developers) don't really look in this bug tracker; you want to report the issue to the following place instead, as that's where the developers of the driver in question expect issues to be reported: https://gitlab.freedesktop.org/drm/amd/-/issues

If you do so, it would be great if you could afterwards share the link to your report here.