Bug 212293

Summary: [amdgpu] divide error: 0000 on resume from S3
Product: Drivers Reporter: Sefa Eyeoglu (contact)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.11.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel log since resume
git bisect log

Description Sefa Eyeoglu 2021-03-15 16:49:29 UTC
Created attachment 295869 [details]
kernel log since resume

My system experiences a kernel panic when resuming from S3, coming from amdgpu.
The GPU has to be in a specific state for this to happen. Mainly when my desktop environment turns off the screens after some inactivity, and subsequently suspends the system.

This issue only occurs with kernel versions 5.11.x. 
I could only reproduce this with KDE Plasma / KWin on Wayland, while testing KDE Plasma / KWin on Xorg and on Wayland (Xorg seems to work fine).


REPRODUCTION
1. Start KDE Plasma / KWin on Wayland
2. Set Screen Energy Saving "Switch off after" to a low value like 1min
3. Wait until Plasma has turned off screens
4. Suspend the system (via SSH for example)
5. Try to wake from sleep


SYSTEM INFO
CPU: AMD Ryzen 9 3900X
Mainboard: ASUS ROG STRIX B450-F GAMING II
GPU: GIGABYTE Radeon RX VEGA 56 GAMING OC 8G


ATTACHMENTS
I attached the kernel panic I could capture via ttyS0.
Comment 1 Sefa Eyeoglu 2021-03-15 16:57:55 UTC
ADDITIONAL SYSTEM INFO
OS: Arch Linux (with testing repos)

Kernels with this issue: 5.11.6.arch1, 5.11.6.zen1, 5.12rc2 (built from Arch Linux User Repository)
Kernels without this issue: 5.10.23-1-lts
Comment 2 Alex Deucher 2021-03-15 17:35:55 UTC
Can you bisect?  https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Comment 3 Sefa Eyeoglu 2021-03-16 18:18:01 UTC
This took some time, as I apparently went wrong paths sometimes.

Anyways.

I bisected between tags v5.10 (good) and v5.11 (bad), while only looking at path "drivers/gpu/drm/amd".

At the end I landed at commit 12f4849a1cfd69f3c37cca042f2e9c512f923741 by Simon Ser (emersion).

I will do some debugging myself to see if it's the real deal, but that change might very well be it.
Comment 4 Sefa Eyeoglu 2021-03-16 18:18:44 UTC
Created attachment 295887 [details]
git bisect log
Comment 5 Sefa Eyeoglu 2021-03-16 18:19:38 UTC
I was unable to add Simon Ser to CC
Comment 6 Sefa Eyeoglu 2021-03-16 19:54:22 UTC
Okay I tried to debug it by printing.

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 573cf17262da..8e6b890ad611 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -9271,6 +9271,8 @@ static int dm_check_crtc_cursor(struct drm_atomic_state *state,
 		return 0;
 	}
 
+	printk("SCRUMPLEX_DEBUG %d %d %d %d", new_cursor_state->src_w, new_cursor_state->src_h, new_primary_state->src_w, new_primary_state->src_h);
+
 	cursor_scale_w = new_cursor_state->crtc_w * 1000 /
 			 (new_cursor_state->src_w >> 16);
 	cursor_scale_h = new_cursor_state->crtc_h * 1000 /
-- 
2.31.0


This adds my very professional printk, which outputs all values that are used to divide in any way later.


While reproducing the issue I got the following output

[   89.850437] SCRUMPLEX_DEBUG 8388608 8388608 0 0


So some weird state is causing the src_w and src_h values of "new_primary_state" to be 0.

That would explain the issue to me. Now I don't know enough about drm_plane_state and drm_atomic_get_new_plane_state to say why this is like this. But as with most of these kinds of issues. A simple condition check beforehand would solve this issue.
Comment 7 Sefa Eyeoglu 2021-03-17 08:19:32 UTC
I submitted a patch here: https://lists.freedesktop.org/archives/amd-gfx/2021-March/060754.html
Comment 8 Sefa Eyeoglu 2021-05-29 15:07:44 UTC
Fixed in 5.11 and 5.12