Bug 203111

Summary: Unrecoverable GPU crash with DiRT 4
Product: Drivers Reporter: Thomas (v10lator)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED INVALID    
Severity: normal CC: alexdeucher
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.0.4 Subsystem:
Regression: No Bisected commit-id:

Description Thomas 2019-03-30 09:29:31 UTC
I just played the Linux version of DiRT 4 and after some rounds of driving the screen froze. The game sound was still there but the keyboard didn't react to any inut. So I decided to try to SSH to the PC and see the logs. This is what I found:

> [52700.498697] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
> seq=1423558, emitted seq=1423560`
> [52700.498702] [drm:amdgpu_job_timedout] *ERROR* Process information: process
> Dirt4 pid 10332 thread WebViewRenderer pid 10391
> [52700.498705] amdgpu 0000:01:00.0: GPU reset begin!
> [52710.728397] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done
> or flip_done timed out

After some time sound stopped and the log showed:

> [52873.699280] WARNING: CPU: 2 PID: 4034 at kernel/kthread.c:529
> kthread_park+0x67/0x78
> [52873.699283] Modules linked in: nfsd
> [52873.699287] CPU: 2 PID: 4034 Comm: TaskSchedulerFo Not tainted 5.0.4 #1
> [52873.699288] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [52873.699290] RIP: 0010:kthread_park+0x67/0x78
> [52873.699291] Code: 18 e8 9d 78 aa 00 be 40 00 00 00 48 89 df e8 60 72 00 00
> 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b
> b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 40 00 f6 47 26 20 74
> [52873.699293] RSP: 0018:ffffa0144460fb78 EFLAGS: 00210202
> [52873.699294] RAX: 0000000000000004 RBX: ffff9155631210c0 RCX:
> 0000000000000000
> [52873.699295] RDX: ffff9155ef427428 RSI: ffff9155631210c0 RDI:
> ffff9155ef9bbfc0
> [52873.699296] RBP: ffff9155f013b8a0 R08: ffff9155f2a97480 R09:
> ffff9155f2a94a00
> [52873.699297] R10: 0000d46d0abbfe3a R11: 000033d8b581bc78 R12:
> ffff9155ef422790
> [52873.699298] R13: ffff9155a2f83c00 R14: 0000000000000202 R15:
> dead000000000100
> [52873.699299] FS:  00007fc756cff700(0000) GS:ffff9155f2a80000(0000)
> knlGS:0000000000000000
> [52873.699301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [52873.699302] CR2: 00007fc7650b8070 CR3: 0000000322b86000 CR4:
> 00000000000406e0
> [52873.699302] Call Trace:
> [52873.699307]  drm_sched_entity_fini+0x32/0x180
> [52873.699309]  amdgpu_vm_fini+0xa8/0x520
> [52873.699311]  ? idr_destroy+0x78/0xc0
> [52873.699313]  amdgpu_driver_postclose_kms+0x14c/0x268
> [52873.699316]  drm_file_free.part.7+0x21a/0x2f8
> [52873.699318]  drm_release+0xa5/0x120
> [52873.699320]  __fput+0x9a/0x1c8
> [52873.699323]  task_work_run+0x8a/0xb0
> [52873.699325]  do_exit+0x2b5/0xb30
> [52873.699326]  do_group_exit+0x35/0x98
> [52873.699328]  get_signal+0xbd/0x690
> [52873.699331]  ? _raw_spin_unlock+0xd/0x20
> [52873.699333]  ? do_signal+0x2b/0x6b8
> [52873.699335]  ? __x64_sys_futex+0x137/0x178
> [52873.699337]  ? exit_to_usermode_loop+0x46/0xa0
> [52873.699338]  ? do_syscall_64+0x14c/0x178
> [52873.699339]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [52873.699341] ---[ end trace 1e1efc0508ef22df ]---
> [52875.619562] [drm] Skip scheduling IBs!
> [52875.625247] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser
> -125!
> [52885.826983] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR*
> [CRTC:47:crtc-0] flip_done timed out
> [52896.066581] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR*
> [CRTC:47:crtc-0] flip_done timed out
> [52906.306280] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR*
> [PLANE:45:plane-5] flip_done timed out

I tried to soft reboot through SSH but it didn't work so at the end I had to hard reset by removing power. This is on a Radeon RX 580.
Comment 1 Alex Deucher 2019-04-01 16:02:46 UTC
This is probably a mesa bug.  I'd suggest trying a new version of mesa or filing a mesa bug.
Comment 2 Thomas 2019-04-02 07:38:45 UTC
(In reply to Alex Deucher from comment #1)
> This is probably a mesa bug.  I'd suggest trying a new version of mesa

That helped, thank you.
Comment 3 Thomas 2019-04-05 18:44:44 UTC
(In reply to Alex Deucher from comment #1)
> I'd suggest trying a new version of mesa

I was too fast with closing this: It crashes with newer mesa, too, just (subjective) less frequent.

Here's a log from mesa 19.0.1:

> [178793.032358] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
> seq=12332054, emitted seq=12332056
> [178793.032362] [drm:amdgpu_job_timedout] *ERROR* Process information:
> process Dirt4 pid 31348 thread WebViewRenderer pid 31422
> [178793.032365] amdgpu 0000:01:00.0: GPU reset begin!
> [178803.262008] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done
> or flip_done timed out

And from git (26e161b1e9):

> [ 7819.095648] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
> seq=2652771, emitted seq=2652773
> [ 7819.095652] [drm:amdgpu_job_timedout] *ERROR* Process information: process
> Dirt4 pid 3075 thread WebViewRenderer pid 3152
> [ 7819.095655] amdgpu 0000:01:00.0: GPU reset begin!
> [ 7829.315220] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done
> or flip_done timed out

Not sure if the log is shorter cause of new mesa or new kernel (updated from 5.0.4 to 5.0.5).

Are you sure this could be a mesa bug? Just asking cause for me a hanging kernel sounds like a kernel bug.
Comment 4 Alex Deucher 2019-04-05 20:31:41 UTC
(In reply to Thomas from comment #3)
> 
> Are you sure this could be a mesa bug? Just asking cause for me a hanging
> kernel sounds like a kernel bug.

Likely a mesa bug.  Mesa submits gfx/video/compute jobs to the kernel driver.  If there are subtle bugs in those jobs, the GPU can hang.  The kernel driver can reset the GPU, but the display server needs to catch the reset and properly re-initialize it's context and buffers.  At the moment, none of the display servers do this so you need to restart them after a GPU reset.

The:
[drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
error is because userspace tried to submit more work to the kernel after a reset without re-initializing it's context, so the kernel rejects it.
Comment 5 Thomas 2019-04-05 21:15:40 UTC
Thanks a lot for the detailed answer. I'm still not sure if I understand everything correctly (shouldn't the kernel driver validate the command stream from userspace/mesa and stop bad things before they hit hardware / hang the GPU?) but I'll close this now and check for or open a new mesa bug report tomorrow (I really need sleep now).

Damn, if this wouldn't be the wrong place I would ask for more details about your last reply (the thing about the display servers not catching up with the GPU reset - aren't there drivers which perform GPU resets just nice under X11 already? What about Wayland?). It's so freaking nice, I bet I would learn a lot if we wold continue the discussion... Anyway, thanks again for explaining and sorry for me going a bit off topic in this reply.


One last thing... It's exremely off topic but I already derailed this reply and it has to be told: Thank you Alex for being the guy you are. I bet AMD doesn't pay you to explain technical details to stupid end users like me but that's very appreciated. You're a hero, keep on rockin'!
Comment 6 Alex Deucher 2019-04-09 01:39:44 UTC
(In reply to Thomas from comment #5)
> Thanks a lot for the detailed answer. I'm still not sure if I understand
> everything correctly (shouldn't the kernel driver validate the command
> stream from userspace/mesa and stop bad things before they hit hardware /
> hang the GPU?) 

It's not really feasible.  For one, it adds a lot of CPU overhead.  There is also so much state in the 3D pipeline it's nearly impossible to validate all of the possible cases that could cause a hang.  In some cases, you may not even know that a particular combination is bad until it gets hit.

> 
> Damn, if this wouldn't be the wrong place I would ask for more details about
> your last reply (the thing about the display servers not catching up with
> the GPU reset - aren't there drivers which perform GPU resets just nice
> under X11 already? What about Wayland?). It's so freaking nice, I bet I
> would learn a lot if we wold continue the discussion... Anyway, thanks again
> for explaining and sorry for me going a bit off topic in this reply.

I'm not sure if other drivers silently reset the GPU when they encounter a hang.  It's generally easier to deal with on integrated GPUs since they operate on system memory.  On dGPUs, the contents of vram might be lost after a GPU reset as the memory controller is reset.  If vram is lost, the application that is running needs to reload it's vram state.  Also for reliability, applications should really be made aware of a GPU reset so they can validate their data.  E.g., you don't want a scientific application to silently get bad data because the GPU was reset silently in the background.

> 
> 
> One last thing... It's exremely off topic but I already derailed this reply
> and it has to be told: Thank you Alex for being the guy you are. I bet AMD
> doesn't pay you to explain technical details to stupid end users like me but
> that's very appreciated. You're a hero, keep on rockin'!

Thanks!  Glad to help.