Bug 201727 - Hardware Error reported on Ryzen 5 2500U
Summary: Hardware Error reported on Ryzen 5 2500U
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-19 09:25 UTC by Michal Herko
Modified: 2019-01-25 20:24 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.20-rc3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (77.52 KB, text/plain)
2018-11-19 09:25 UTC, Michal Herko
Details
/proc/cpuinfo (10.88 KB, text/plain)
2018-11-19 09:26 UTC, Michal Herko
Details
lspci -vvv (39.47 KB, text/plain)
2018-11-19 09:26 UTC, Michal Herko
Details
.config (149.20 KB, text/plain)
2018-11-19 09:27 UTC, Michal Herko
Details
revert of 284dec (1.54 KB, patch)
2018-11-19 11:48 UTC, Michal Herko
Details | Diff
v5.0-rc1 boot lockup (106.25 KB, text/plain)
2019-01-07 04:30 UTC, tones111
Details

Description Michal Herko 2018-11-19 09:25:33 UTC
Created attachment 279511 [details]
dmesg

How to reproduce: Boot and start any graphical application, for example gdm
Expected: gdm will start
Actual: screen goes black, with the cursor visible in top left corner,
log is full of [Hardware Error]

I have bisected the bug, it seems it is introduced in commit 284dec4317c8e76f45d3ce922f673c80331812f1.

let me know if you think the hardware is actually broken, but i have had no issues on windows, and no hardware errors on 4.19 kernel.
Comment 1 Michal Herko 2018-11-19 09:26:13 UTC
Created attachment 279513 [details]
/proc/cpuinfo
Comment 2 Michal Herko 2018-11-19 09:26:36 UTC
Created attachment 279515 [details]
lspci -vvv
Comment 3 Michal Herko 2018-11-19 09:27:12 UTC
Created attachment 279517 [details]
.config
Comment 4 Mike Lothian 2018-11-19 10:58:07 UTC
If you've bisected then it's unlikely to be a hardware issue

If you revert the commit does everything work again?
Comment 5 Michel Dänzer 2018-11-19 11:03:29 UTC
That's

commit 284dec4317c8e76f45d3ce922f673c80331812f1
Author: Christian König <christian.koenig@amd.com>
Date:   Wed Aug 22 16:44:56 2018 +0200

    drm/amdgpu: enable GTT PD/PT for raven v3
Comment 6 Christian König 2018-11-19 11:40:42 UTC
Currently GTT is only used for PD/PT as last resort when there is so few stolen memory assigned to the APU that it won't work at all otherwise.

The faulting address looks suspicious like we miss to handle an error code correctly somewhere and instead use the value as DMA address.

What is your BIOS setting for the stolen VRAM?
Comment 7 Michal Herko 2018-11-19 11:47:26 UTC
Everything works after a revert. There was a conflict, i am attaching a diff of the revert.
Comment 8 Michal Herko 2018-11-19 11:48:23 UTC
Created attachment 279521 [details]
revert of 284dec
Comment 9 Michal Herko 2018-11-19 12:50:44 UTC
> What is your BIOS setting for the stolen VRAM?
I can't find any such settings in bios. I really do not have any options regarding video.
Comment 10 Alex Deucher 2018-11-19 16:31:08 UTC
(In reply to Christian König from comment #6)
> 
> What is your BIOS setting for the stolen VRAM?

[    2.665594] [drm] amdgpu: 256M of VRAM memory ready
Comment 11 Michal Herko 2018-12-09 21:01:31 UTC
The bug is no more present on v4.20-rc5.
Comment 12 Rafał Miłecki 2018-12-17 05:43:46 UTC
I was curious what could have fixed that. I tried to reproduce it on a totally different notebook with 2500U (EliteBook 745 G5) but I wasn't getting any MCE reported errors with the commit 284dec4317c8 ("drm/amdgpu: enable GTT PD/PT for raven v3"). Probably because of having more stole VRAM:
[    5.232179] [drm] amdgpu: 1024M of VRAM memory ready

It's probably one of those:
git log --oneline v4.20-rc2..v4.20-rc5 drivers/gpu/drm/amd/amdgpu/
ad97d9de4583 drm/amdgpu: Add delay after enable RLC ucode
1954db153d18 drm/amdgpu: Avoid endless loop in GPUVM fragment processing
9ce2b991f7ea drm/amdgpu: Cast to uint64_t before left shift
a5d0f4565996 drm/amdgpu: Enable HDP memory light sleep
8d4d7c589947 drm/amdgpu: Add missing firmware entry for HAINAN
919a52fc4ca1 drm/amdgpu: Fix oops when pp_funcs->switch_power_profile is unset
69756c6ff0de drm/amdgpu: Add amdgpu "max bpc" connector property (v2)
c1a17777eb45 drm/amdgpu: fix huge page handling on Vega10
c837243ff401 drm/amdgpu: fix bug with IH ring setup
5581c670fb7e drm/amdgpu: set system aperture to cover whole FB region
Comment 13 Michal Herko 2018-12-29 21:06:11 UTC
i reopen because i have realized the bug is still present.
i have accidentally tested with "iommu=soft" kernel parameter. When using this parameter the bug is not displayed, and the system is usable.
Comment 14 tones111 2019-01-05 01:20:14 UTC
I'm experiencing this bug on a Lenovo E585.  My boot logs also show only "256M of VRAM memory ready" when running 4.19.12 or commit 284dec4317c8.  4.19.12 boots fine, but the commit in question causes a lockup.

Is there any data I can collect or patches to test to support resolving this?  Thanks for any insight or direction.
Comment 15 tones111 2019-01-07 04:30:00 UTC
Created attachment 280291 [details]
v5.0-rc1 boot lockup

Boot log running v5.0-rc1 attached
Comment 17 Michal Herko 2019-01-24 12:53:39 UTC
I confirm the fix works. v5.0-rc3 also works.
Let me know if you need any more testing.

Note You need to log in before you can comment on or make changes to this bug.