Bug 206017

Summary: Kernel 5.4.x unusable with GUI due to crashes (some hard)
Product: Drivers Reporter: udo (udovdh)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: blocking CC: alexdeucher, kernel, paul.e.hill2, priit, reuben_p
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4.x Subsystem:
Regression: No Bisected commit-id:

Description udo 2019-12-30 16:02:23 UTC
AMD Ryzen 5 3400G with Radeon Vega Graphics on Gigabyte X570 AORUS PRO, running Fedora 31, kernel.org kernels, git mesa, etc
After booting into 5.4, soon the GUI freezes, e.g. right when I start Firefox.
On 5.3.18 it takes days to crash amdgpu.
Soft recovery does not help.
5.3.18 is EOLed so 5.4 issues need priority:

(..)
[   12.884828] pps pps0: new PPS source serial0
[   12.884832] pps pps0: source "/dev/ttyS0" added
[   12.898511] it87: it87 driver version <not provided>
[   12.898635] it87: Found IT8792E/IT8795E chip at 0xa60, revision 3
[   12.898675] it87: Beeping is supported
[   14.244524] igb 0000:04:00.0: changing MTU from 1500 to 7200
[   17.328845] igb 0000:04:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   17.331973] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   18.564142] io scheduler mq-deadline registered
[   22.352636] igb 0000:04:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   31.464130] fuse: init (API version 7.31)
[   75.198868] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[   80.318799] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Comment 1 udo 2019-12-30 16:03:00 UTC
See https://gitlab.freedesktop.org/drm/amd/issues/934 for more details.
Comment 2 Alex Deucher 2020-01-01 13:48:03 UTC
There is no need to file another bug report.  Let's keep this all in one place.
Comment 3 udo 2020-01-01 14:31:42 UTC
amdgpu.noretry=0 appears to help on 5.4.6.
Comment 4 udo 2020-01-03 06:06:37 UTC
But 5.4.x is not really stable; it crashes easily within a day where 5.3.18 can stay up for a few days.
Comment 5 udo 2020-01-04 04:53:04 UTC
Firefox is still the trigger.
When I do not use it the system remains usable.
When I use Firefox the system crashes hard within a few hours.
Comment 6 udo 2020-01-05 11:53:34 UTC
5.4.8 also suffers from the hard hang, Firefox is involded playing youtube and such.
Comment 7 udo 2020-01-05 12:22:45 UTC
And it happened again, without youtube playing but while browsing.
5.3.18 takes a lot longer to crash/hang or whatever.
Comment 8 udo 2020-01-05 15:38:42 UTC
Does the screen corruption I see now and then have something to do with this issue?
Comment 9 udo 2020-01-06 11:52:09 UTC
5.4.8 runs less than 12 hours until hard crash when used.
Comment 10 udo 2020-01-06 12:02:28 UTC
More like 6 hours or less.
Comment 11 udo 2020-01-08 15:23:35 UTC
I.e.: it is stable and working OK with e.g. mkv playing. Then we start Firefox and boom. System freezes,
Comment 12 udo 2020-01-10 16:56:57 UTC
5.4.9 also has this issue.
Runs ok with firefox not being used, as far as I can test and detect.
With firefox the system locks hard after a while.
Comment 13 Paul 2020-01-17 13:36:51 UTC
Hello! 

I am experiencing the same issue on 5.4.10 (Fedora 31, KDE Spin). I'm going to attempt the 'amdgpu.noretry=0' fix later today.
 
I made the below bug report with Fedora:
https://ask.fedoraproject.org/t/fedora-kde-amdgpu-issue/5026


Summarized: 
gpu: Radeon Vega 10

Issue: I discovered a lot of these entries within journalctl and dmesg after gui freezes:

kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!

Thank you!
Comment 14 Paul 2020-01-18 23:51:26 UTC
(In reply to Paul from comment #13)
> Hello! 
> 
> I am experiencing the same issue on 5.4.10 (Fedora 31, KDE Spin). I'm going
> to attempt the 'amdgpu.noretry=0' fix later today.
>  
> I made the below bug report with Fedora:
> https://ask.fedoraproject.org/t/fedora-kde-amdgpu-issue/5026
> 
> 
> Summarized: 
> gpu: Radeon Vega 10
> 
> Issue: I discovered a lot of these entries within journalctl and dmesg after
> gui freezes:
> 
> kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but
> soft recovered
> kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for
> fences timed out!
> 
> Thank you!

Just wanted to report in that the 'amdgpu.noretry=0' workaround resolved my issues. Thanks!
Comment 15 Alex Deucher 2020-01-20 15:07:23 UTC
Should be fixed in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7aec9ec1cf324d5c5a8d17b9c78a34c388e5f17b
which should also be landing in various stable kernels as well.
Comment 16 udo 2020-01-20 15:14:50 UTC
amdgpu.noretry=0 works as workaround so the commit should fix things well.
Thanks for the commit!

Still looking for the right component for https://bugzilla.kernel.org/show_bug.cgi?id=206191 :-/
Comment 17 udo 2020-04-20 09:54:25 UTC
Kernel 5.6.x works very well.
Git mesa might help too.
Comment 18 Priit O. 2020-06-17 10:05:24 UTC
And it's back after several months.

5.6.16-1-MANJARO
mesa 20.0.7-3

amdgpu.ppfeaturemask=0xfffd7fff amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt
Comment 19 udo 2020-06-17 13:50:17 UTC
Appears to work OK for me:

AMD Ryzen 5 3400G with Radeon Vega Graphics on Gigabyte X570 AORUS PRO,
Fedora 31, git mesa, kernel.org 5.6.x, etc

amdgpu.gttsize=8192 amdgpu.lockup_timeout=1000 amdgpu.gpu_recovery=1 amdgpu.noretry=0 amdgpu.ppfeaturemask=0xfffd3fff
Comment 20 Priit O. 2020-08-20 12:22:34 UTC
kernel 5.8.0-2-MANJARO; Vega 64 GPU; mesa 20.1.5; xf86-video-amdgpu 19.1.0

aug   20 12:58:47 Zen kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
aug   20 12:58:47 Zen kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
aug   20 12:58:52 Zen kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
aug   20 12:58:52 Zen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=158674961, emitted seq=158674963
aug   20 12:58:52 Zen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 933 thread Xorg:cs0 pid 941
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed to send message: 0x63, ret value: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed to send message: 0x26, ret value: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed to send message: 0x61, ret value: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed message: 0x37, input parameter: 0x0, error code: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed to send message: 0x63, ret value: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed to send message: 0x26, ret value: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed to send message: 0x61, ret value: 0xffffffff
aug   20 12:58:53 Zen kernel: amdgpu: [powerplay] Failed message: 0x37, input parameter: 0x0, error code: 0xffffffff
aug   20 12:58:56 Zen systemd-coredump[109412]: Process 933 (Xorg) of user 0 dumped core.