Created attachment 286079 [details] dmesg tail from immediately after a lockup I have been encountering issues the AMDGPU driver completely failing when loading games. When loading into a game and after making one click or moving the mouse, the display will completely freeze. Can't tab out or go to a TTY at all. I can SSH into the box and do stuff, such as getting the attached dmesg tail, but even killing the process doesn't unfreeze the display, which has the still image of the game. Only rebooting unlocks it. Basically it just seems to timeout and then can't recover, and this happens all the time on certain games, but inconsistent as to what environment it happens. Some lock it up on Xorg but work fine on Wayland. Some work fine on Wayland but break on Xorg. Some never work at all. My Graphics card is a Navi10, RX5700. I'm on the 5.4 kernel, but this was happening on 5.3 as well.
Thanks for the bug report. The sdma0 timeout issue (from you dmesg) has already been reported. The most active bug report is: https://gitlab.freedesktop.org/drm/amd/issues/892 Note that sdma usage for Navi is disabled for Mesa 19.3 and 19.2.5 so this issue shouldn't occur if you use one of these releases. Other related issues: - https://bugzilla.kernel.org/show_bug.cgi?id=205169 - has a patch but need to be applied manually until it makes it to an upstream release - gfx timeout issues: those are likely to be game specific and are probably a bug in Mesa (https://gitlab.freedesktop.org/mesa/mesa/issues)
Hello! Glad to see there's others looking at this. My mesa is on 19.3.0_rc4, but the issue is still happening. That related bug in your second link talks about 5.4 kernel being out but not having the fix. Is this something that might show up in 5.4.1? I might try going to mesa 19.2.5 or 19.2.6 specifically later though, in case the mesa-side disable isn't in rc4 for some reason. Seems like it might be a kernel issue though.
Update. Roughly around the time of the last update to this, I manually added that fix and it was working out for me. However, I ran some updates to both mesa and the kernel itself and now it appears the issue is back. I have updated this issue with my current specifications. I'm on the 5.5.2 kernel now, with my package manager reporting the mesa version as 20.0.0_rc1. llvm is on 9.0.1. I'll also add in an attachment with a more recent dmesg tail. I did try checking to see if I could manually re-add the patch to the file again, but it looks like those lines of code are already there, yet this issue still persists.
Created attachment 287263 [details] Newer dmesg tail from a lockup on 5.5.2