Bug 203111
Summary: | Unrecoverable GPU crash with DiRT 4 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Thomas (v10lator) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | alexdeucher |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.0.4 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Thomas
2019-03-30 09:29:31 UTC
This is probably a mesa bug. I'd suggest trying a new version of mesa or filing a mesa bug. (In reply to Alex Deucher from comment #1) > This is probably a mesa bug. I'd suggest trying a new version of mesa That helped, thank you. (In reply to Alex Deucher from comment #1) > I'd suggest trying a new version of mesa I was too fast with closing this: It crashes with newer mesa, too, just (subjective) less frequent. Here's a log from mesa 19.0.1: > [178793.032358] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled > seq=12332054, emitted seq=12332056 > [178793.032362] [drm:amdgpu_job_timedout] *ERROR* Process information: > process Dirt4 pid 31348 thread WebViewRenderer pid 31422 > [178793.032365] amdgpu 0000:01:00.0: GPU reset begin! > [178803.262008] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done > or flip_done timed out And from git (26e161b1e9): > [ 7819.095648] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled > seq=2652771, emitted seq=2652773 > [ 7819.095652] [drm:amdgpu_job_timedout] *ERROR* Process information: process > Dirt4 pid 3075 thread WebViewRenderer pid 3152 > [ 7819.095655] amdgpu 0000:01:00.0: GPU reset begin! > [ 7829.315220] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done > or flip_done timed out Not sure if the log is shorter cause of new mesa or new kernel (updated from 5.0.4 to 5.0.5). Are you sure this could be a mesa bug? Just asking cause for me a hanging kernel sounds like a kernel bug. (In reply to Thomas from comment #3) > > Are you sure this could be a mesa bug? Just asking cause for me a hanging > kernel sounds like a kernel bug. Likely a mesa bug. Mesa submits gfx/video/compute jobs to the kernel driver. If there are subtle bugs in those jobs, the GPU can hang. The kernel driver can reset the GPU, but the display server needs to catch the reset and properly re-initialize it's context and buffers. At the moment, none of the display servers do this so you need to restart them after a GPU reset. The: [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125! error is because userspace tried to submit more work to the kernel after a reset without re-initializing it's context, so the kernel rejects it. Thanks a lot for the detailed answer. I'm still not sure if I understand everything correctly (shouldn't the kernel driver validate the command stream from userspace/mesa and stop bad things before they hit hardware / hang the GPU?) but I'll close this now and check for or open a new mesa bug report tomorrow (I really need sleep now). Damn, if this wouldn't be the wrong place I would ask for more details about your last reply (the thing about the display servers not catching up with the GPU reset - aren't there drivers which perform GPU resets just nice under X11 already? What about Wayland?). It's so freaking nice, I bet I would learn a lot if we wold continue the discussion... Anyway, thanks again for explaining and sorry for me going a bit off topic in this reply. One last thing... It's exremely off topic but I already derailed this reply and it has to be told: Thank you Alex for being the guy you are. I bet AMD doesn't pay you to explain technical details to stupid end users like me but that's very appreciated. You're a hero, keep on rockin'! (In reply to Thomas from comment #5) > Thanks a lot for the detailed answer. I'm still not sure if I understand > everything correctly (shouldn't the kernel driver validate the command > stream from userspace/mesa and stop bad things before they hit hardware / > hang the GPU?) It's not really feasible. For one, it adds a lot of CPU overhead. There is also so much state in the 3D pipeline it's nearly impossible to validate all of the possible cases that could cause a hang. In some cases, you may not even know that a particular combination is bad until it gets hit. > > Damn, if this wouldn't be the wrong place I would ask for more details about > your last reply (the thing about the display servers not catching up with > the GPU reset - aren't there drivers which perform GPU resets just nice > under X11 already? What about Wayland?). It's so freaking nice, I bet I > would learn a lot if we wold continue the discussion... Anyway, thanks again > for explaining and sorry for me going a bit off topic in this reply. I'm not sure if other drivers silently reset the GPU when they encounter a hang. It's generally easier to deal with on integrated GPUs since they operate on system memory. On dGPUs, the contents of vram might be lost after a GPU reset as the memory controller is reset. If vram is lost, the application that is running needs to reload it's vram state. Also for reliability, applications should really be made aware of a GPU reset so they can validate their data. E.g., you don't want a scientific application to silently get bad data because the GPU was reset silently in the background. > > > One last thing... It's exremely off topic but I already derailed this reply > and it has to be told: Thank you Alex for being the guy you are. I bet AMD > doesn't pay you to explain technical details to stupid end users like me but > that's very appreciated. You're a hero, keep on rockin'! Thanks! Glad to help. |