Bug 204227 - Visual artefacts and crash from suspend on amdgpu
Summary: Visual artefacts and crash from suspend on amdgpu
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-18 22:49 UTC by Łukasz Żarnowiecki
Modified: 2019-09-23 07:36 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.2.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (124.63 KB, text/plain)
2019-07-18 22:49 UTC, Łukasz Żarnowiecki
Details
lspci (36.68 KB, text/plain)
2019-07-18 22:50 UTC, Łukasz Żarnowiecki
Details
syslog from 5.2.16 with warnings (2.02 MB, text/plain)
2019-09-19 21:11 UTC, Mirek Kratochvil
Details

Description Łukasz Żarnowiecki 2019-07-18 22:49:01 UTC
Created attachment 283823 [details]
dmesg

After upgrading kernel from 5.1.14 to 5.2.1 I encountered many artifacts during desktop session.  Also when going from suspend state, external monitor is green and kernel crashes.  See dmesg
Comment 1 Łukasz Żarnowiecki 2019-07-18 22:50:06 UTC
Created attachment 283825 [details]
lspci
Comment 2 Alex Deucher 2019-07-19 19:16:52 UTC
Can you bisect?
Comment 3 Łukasz Żarnowiecki 2019-07-20 15:27:43 UTC
Well, that took me some time...

Looks like this is the cause...

005440066f929ba0dca8f4e0aebfbf8daac592cc is the first bad commit
commit 005440066f929ba0dca8f4e0aebfbf8daac592cc
Author: Huang Rui <ray.huang@amd.com>
Date:   Wed Mar 13 20:21:00 2019 +0800

    drm/amdgpu: enable gfxoff again on raven series (v2)
    
    This patch enables gfxoff and stutter mode again, since we take more testing on
    raven series. For raven2 and picasso, we can enable it directly. And for raven,
    we need check the RLC/SMC ucode version cannot be less than #531/0x1e45.
    
    v2: add smc version checking for raven.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (v1)
    Tested-by: Likun Gao <Likun.Gao@amd.com> (v2)
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c             |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c               | 21 +++++++++++++++++++++
 drivers/gpu/drm/amd/powerplay/hwmgr/smu10_hwmgr.c   | 13 ++++---------
 drivers/gpu/drm/amd/powerplay/smumgr/smu10_smumgr.c |  4 ++++
 5 files changed, 33 insertions(+), 11 deletions(-)
Comment 4 tones111 2019-08-26 23:42:58 UTC
I'm seeing the same problems when running 5.2.x that were not present in 5.1.  The commit above is the source of the visual artifacts, but I believe the lockup issue was introduced later.  Is there any help I can provide in testing a fix?  

It looks like there might have been some previous effort here:
https://www.spinics.net/lists/amd-gfx/msg32192.html


I created https://bugzilla.kernel.org/show_bug.cgi?id=204611 that can be used to track the lockup issue.
Comment 6 Łukasz Żarnowiecki 2019-08-27 18:07:36 UTC
(In reply to Alex Deucher from comment #5)
> This issue should be fixed with this patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=98f58ada2d37e68125c056f1fc005748251879c2

Is this patch going to 5.2?
Comment 7 Alex Deucher 2019-08-27 18:31:22 UTC
yes.
Comment 8 tones111 2019-08-28 00:47:29 UTC
I applied this to 5.2.10 and I'm still seeing artifacts.
Comment 9 tones111 2019-08-28 01:48:51 UTC
(In reply to tones111 from comment #8)
> I applied this to 5.2.10 and I'm still seeing artifacts.

Sorry, I realized that statement doesn't give much context to work with.  My system has an R5 2500U.  lspci shows the following:

05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
	Flags: bus master, fast devsel, latency 0, IRQ 51
	Memory at b0000000 (64-bit, prefetchable) [size=256M]
	Memory at c0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 1000 [size=256]
	Memory at c0600000 (32-bit, non-prefetchable) [size=512K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/4 Maskable- 64bit+
	Capabilities: [c0] MSI-X: Enable- Count=3 Masked-
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [200] Resizable BAR <?>
	Capabilities: [270] Secondary PCI Express <?>
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
Comment 10 Mirek Kratochvil 2019-09-09 08:29:31 UTC
Hello everyone,

would the artefacts look like on this picture, or am I having a different issue?
http://e-x-a.org/stuff/amdgpu-artefacts.jpg
(Taken with a phone, as the artefacts are not screenshottable.)

The squares appear around small stuff that changes (esp. terminal text) and disappear in around half a second. Notably, they are only seen in xfce (suspect compositor is needed); not in LightDM (which does not do composition) nor around any frequently refreshed/accelerated surface (glxgears and animations in forefox are clean.)

Mine is:

05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev c3) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
	Flags: bus master, fast devsel, latency 0, IRQ 58
	Memory at b0000000 (64-bit, prefetchable) [size=256M]
	Memory at c0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 1000 [size=256]
	Memory at c0800000 (32-bit, non-prefetchable) [size=512K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/4 Maskable- 64bit+
	Capabilities: [c0] MSI-X: Enable- Count=3 Masked-
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [200] Resizable BAR <?>
	Capabilities: [270] Secondary PCI Express <?>
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

The problem happens on all 5.2 kernels I tried (from debian). "Debian stable" 4.19 and one 5.1 I tried are OK.

If this is a different kind of artifacts, please let me know (I'd open a different kind of bug.)

Thanks in advance!
-mk
Comment 11 tones111 2019-09-09 11:07:15 UTC
Some good news.  After a bios update to Lenovo's E485/E585 1.54 I no longer need to provide additional boot arguments in order for the machine to come up and the visual artifacts have gone away.

I would see issues with some fonts in Firefox that looked similar to your screenshot.  The easiest way for me to reproduce the problem was to resize my terminal (Alacritty) or scroll around in gitk or gvim.

After a few days running on the new bios I haven't seen the artifacts, so this bug looks to be resolved for me since kernel 5.2.11.

Thanks!
Comment 12 Mirek Kratochvil 2019-09-09 11:14:41 UTC
That sounds great, thank you very much for the information and confirmation. I will try to update the BIOS and confirm ASAP.
Comment 13 Mirek Kratochvil 2019-09-19 21:09:01 UTC
After the BIOS upgrade the kernel parameters can be removed, but the kernel (5.2.16) now locks up when entering XFCE (it survives lightdm though). The error is almost same as as in the posted dmesg; I'll attach mine with backtraces in a few seconds.

Highlights:

This gets printed out before each warning:
[   66.159175] [drm] pstate TEST_DEBUG_DATA: 0x36F60000

R08 gets increased by some value between 49 and 56 after each next warning (the value is sometimes in R10)

Userspace seems working otherwise (the logs are from syslog), just the display won't show anything.

I will try a few other kernels available for debian and eventually bisect.
Comment 14 Mirek Kratochvil 2019-09-19 21:11:15 UTC
Created attachment 285069 [details]
syslog from 5.2.16 with warnings
Comment 15 Łukasz Żarnowiecki 2019-09-23 07:36:43 UTC
I updated kernel to 5.3 and problem disappeared. I did not update bios or anything like that.  Perhaps the problem you guys are facing is different than originally reported.

Note You need to log in before you can comment on or make changes to this bug.