Bug 213889 - soft lockup after screen does a sleep cycle (dpms off then on) with AMD rx590
Summary: soft lockup after screen does a sleep cycle (dpms off then on) with AMD rx590
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-28 14:34 UTC by Marc
Modified: 2021-07-28 14:48 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.10.53
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg using 5.10.53 (112.50 KB, text/plain)
2021-07-28 14:34 UTC, Marc
Details
lspci (23.08 KB, text/plain)
2021-07-28 14:45 UTC, Marc
Details

Description Marc 2021-07-28 14:34:17 UTC
Created attachment 298081 [details]
dmesg using 5.10.53

Original report in Debian : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=991546

The issue seems resolved in more recent kernel version.

Originally found using Debian's linux-image-5.10.0-8-amd64, but confirmed with latest 5.10.53 vanilla kernel.

When the display is switched off and on, the system becomes unresponsive before really crashing.

The easiest way to reproduce in my case is:
 - start a sway session
 - start firefox
 - cycle the display with : swaymsg "output * dpms off"; sleep 10; swaymsg "output * dpms on"

When the display is switched back on:
 - if firefox is correctly displayed, then there's no issue
 - if the top part of firefox is not correctly displayed (transparent), then the issue is visible. Firefox is also not responsive, and after few seconds, the following is emitted :

Jul 25 21:13:40 arrakis kernel: [  109.007130] 	(t=5250 jiffies g=8557 q=4735)
Jul 25 21:13:40 arrakis kernel: [  109.007131] NMI backtrace for cpu 8
Jul 25 21:13:40 arrakis kernel: [  109.007133] CPU: 8 PID: 1797 Comm: Xwayland Tainted: G S          E     5.10.53 #1
Jul 25 21:13:40 arrakis kernel: [  109.007134] Hardware name: Gigabyte Technology Co., Ltd. AB350M-Gaming 3/AB350M-Gaming 3-CF, BIOS F42d 10/18/2019
Jul 25 21:13:40 arrakis kernel: [  109.007135] Call Trace:
Jul 25 21:13:40 arrakis kernel: [  109.007137]  <IRQ>
Jul 25 21:13:40 arrakis kernel: [  109.007142]  dump_stack+0x6b/0x83
Jul 25 21:13:40 arrakis kernel: [  109.007143]  nmi_cpu_backtrace.cold+0x32/0x69
Jul 25 21:13:40 arrakis kernel: [  109.007146]  ? lapic_can_unplug_cpu+0x80/0x80
Jul 25 21:13:40 arrakis kernel: [  109.007148]  nmi_trigger_cpumask_backtrace+0xd7/0xe0
Jul 25 21:13:40 arrakis kernel: [  109.007150]  rcu_dump_cpu_stacks+0xa2/0xd0
Jul 25 21:13:40 arrakis kernel: [  109.007152]  rcu_sched_clock_irq.cold+0x1ff/0x3d6
Jul 25 21:13:40 arrakis kernel: [  109.007154]  update_process_times+0x8c/0xc0
Jul 25 21:13:40 arrakis kernel: [  109.007156]  tick_sched_handle+0x22/0x60
Jul 25 21:13:40 arrakis kernel: [  109.007158]  tick_sched_timer+0x7c/0xb0
Jul 25 21:13:40 arrakis kernel: [  109.007159]  ? tick_do_update_jiffies64.part.0+0xc0/0xc0
Jul 25 21:13:40 arrakis kernel: [  109.007160]  __hrtimer_run_queues+0x12a/0x270
Jul 25 21:13:40 arrakis kernel: [  109.007161]  hrtimer_interrupt+0x110/0x2c0
Jul 25 21:13:40 arrakis kernel: [  109.007163]  __sysvec_apic_timer_interrupt+0x5f/0xd0
Jul 25 21:13:40 arrakis kernel: [  109.007164]  asm_call_irq_on_stack+0x12/0x20
Jul 25 21:13:40 arrakis kernel: [  109.007165]  </IRQ>
Jul 25 21:13:40 arrakis kernel: [  109.007167]  sysvec_apic_timer_interrupt+0x72/0x80
Jul 25 21:13:40 arrakis kernel: [  109.007168]  asm_sysvec_apic_timer_interrupt+0x12/0x20
Jul 25 21:13:40 arrakis kernel: [  109.007182] RIP: 0010:__drm_dbg+0x3e/0x90 [drm]
Jul 25 21:13:40 arrakis kernel: [  109.007184] Code: 4c 24 48 4c 89 44 24 50 4c 89 4c 24 58 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 23 3d 51 1c 05 00 75 12 48 8b 44 24 28 <65> 48 2b 04 25 28 00 00 00 75 40 c9 c3 48 8d 45 10 48 89 34 24 48
Jul 25 21:13:40 arrakis kernel: [  109.007185] RSP: 0018:ffffb880836d7ba0 EFLAGS: 00000246
Jul 25 21:13:40 arrakis kernel: [  109.007187] RAX: 4f8e6fb112e3c800 RBX: ffffb880836d7d38 RCX: 0000000200000000
Jul 25 21:13:40 arrakis kernel: [  109.007188] RDX: 0000000404000000 RSI: ffffffffc09d01f8 RDI: 0000000000000000
Jul 25 21:13:40 arrakis kernel: [  109.007188] RBP: ffffb880836d7c00 R08: 0000000000000000 R09: 0000000000000000
Jul 25 21:13:40 arrakis kernel: [  109.007189] R10: 000000000000000a R11: 0000000404000000 R12: ffffb880836d7d38
Jul 25 21:13:40 arrakis kernel: [  109.007189] R13: 00000000fffffff4 R14: ffff90a352d80000 R15: ffffb880836d7e28
Jul 25 21:13:40 arrakis kernel: [  109.007252]  amdgpu_bo_do_create+0x2a4/0x4f0 [amdgpu]
Jul 25 21:13:40 arrakis kernel: [  109.007305]  amdgpu_bo_create+0x40/0x270 [amdgpu]
Jul 25 21:13:40 arrakis kernel: [  109.007359]  amdgpu_gem_create_ioctl+0x123/0x310 [amdgpu]
Jul 25 21:13:40 arrakis kernel: [  109.007413]  ? amdgpu_gem_object_close+0x200/0x200 [amdgpu]
Jul 25 21:13:40 arrakis kernel: [  109.007423]  drm_ioctl_kernel+0xaa/0xf0 [drm]
Jul 25 21:13:40 arrakis kernel: [  109.007433]  drm_ioctl+0x20f/0x3a0 [drm]
Jul 25 21:13:40 arrakis kernel: [  109.007486]  ? amdgpu_gem_object_close+0x200/0x200 [amdgpu]
Jul 25 21:13:40 arrakis kernel: [  109.007487]  ? do_setitimer+0x179/0x210
Jul 25 21:13:40 arrakis kernel: [  109.007539]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Jul 25 21:13:40 arrakis kernel: [  109.007541]  __x64_sys_ioctl+0x83/0xb0
Jul 25 21:13:40 arrakis kernel: [  109.007543]  do_syscall_64+0x33/0x80
Jul 25 21:13:40 arrakis kernel: [  109.007545]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 25 21:13:40 arrakis kernel: [  109.007546] RIP: 0033:0x7fb8fffa2cc7
Jul 25 21:13:40 arrakis kernel: [  109.007548] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
Jul 25 21:13:40 arrakis kernel: [  109.007548] RSP: 002b:00007fff2f0db438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jul 25 21:13:40 arrakis kernel: [  109.007550] RAX: ffffffffffffffda RBX: 00007fff2f0db490 RCX: 00007fb8fffa2cc7
Jul 25 21:13:40 arrakis kernel: [  109.007550] RDX: 00007fff2f0db490 RSI: 00000000c0206440 RDI: 000000000000000a
Jul 25 21:13:40 arrakis kernel: [  109.007551] RBP: 00000000c0206440 R08: 00000000ffffffff R09: 00007fb90006cbe0
Jul 25 21:13:40 arrakis kernel: [  109.007551] R10: 0000000000000100 R11: 0000000000000246 R12: 0000555ce28ba2a0
Jul 25 21:13:40 arrakis kernel: [  109.007552] R13: 000000000000000a R14: 0000000404000000 R15: 0000000000200000

After being instructed to test with latest stable (5.13 -- no issue) and to bisect to find when the kernel changes behavior wrt to this, I found this commit :

commit 89fa15ecdca7eb46a711476b961f70a74765bbe4
Author: Huang Rui <ray.huang@amd.com>
Date:   Sat Jan 30 17:14:30 2021 +0800

    drm/amdgpu: fix the issue that retry constantly once the buffer is oversize

    We cannot modify initial_domain every time while the retry starts. That
    will cause the busy waiting that unable to switch to GTT while the vram
    is not enough.

    Fixes: f8aab60422c3 ("drm/amdgpu: Initialise drm_gem_object_funcs for imported BOs")

    Signed-off-by: Huang Rui <ray.huang@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org

As a very naive test, I applied it blindly over v5.10 and can confirm I can't reproduce the problem, but have no clue if this correct.

I've been asked to file this issue for a possible backport in the 5.10.y line.

I'll be happy to help if necessary.

Thank you for your work!
Comment 1 Marc 2021-07-28 14:45:33 UTC
Created attachment 298083 [details]
lspci
Comment 2 Marc 2021-07-28 14:48:51 UTC
As mentioned in debian bugreport, my kernel is tainted:

[  101.233439] CPU: 9 PID: 1811 Comm: Xwayland Tainted: G S          E     5.11.0-rc4-00375-ga692a610d7ed #6

** Tainted: S (4)
 * SMP kernel oops on an officially SMP incapable processor

The reason is caused by my Ryzen 1600 being faulty and I need to disable its C6 state by writing in some MSR (using https://github.com/r4m0n/ZenStates-Linux )

Note You need to log in before you can comment on or make changes to this bug.