Bug 201273

Summary: Fatal error during GPU init amdgpu RX560
Product: Drivers Reporter: qbl
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher, marco.rodolfi
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.18.9 4.18.10 and possibly earlier 4.19, 4.20 and 5.0 too Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg lsmod lspci lsusb cpuinfo url
dmesg old + new
dmesg + amdgpu_pm_info
config+dmesg for patched 4.19.1
test fix
dmesg lsmod additional infos
dmesg amd_gpu_firmware_info
boot.omsg amdgpu_firmware_info messages.part Xorg.0.log.old

Description qbl 2018-09-28 16:37:07 UTC
Created attachment 278827 [details]
dmesg lsmod lspci lsusb cpuinfo url

Since an installation of an AMD-Radeon RX 560 to an APU-based system it sometimes shows a black screen at bootup ( USB-Keyboard hangs too, no sysreq -> reset)
Sometimes system boots to GUI, console garbled, dmesg shows fatal error during GPU init, reboot/shutdown hangs.
BIOS: APU auto or disabled (on not tested)
Attached:
dmesg with/without Error
lsmod with/without Error
lspci / lsusb
/proc/cpuinfo
url to MB and graphics card
Comment 1 Alex Deucher 2018-10-09 17:36:03 UTC
Is this a regression?  If so, can you bisect?
Comment 2 qbl 2018-10-09 21:44:12 UTC
Created attachment 278973 [details]
dmesg old + new

The RX560 is new. Error happens only from time to time and only at bootup and i have to press reset mostly. While bisecting https://bugzilla.kernel.org/show_bug.cgi?id=201275
i had created two more files with output of dmesg. I have included all of them in archive. Maybe v4.18.0 is affected too. I could use an older Kernel to verify, but since this is not an predictible case, i would have to use it for several days to check this. Rarely it happens that Monitor blacks out ( really black, back lights and sound off too), but this may be an hardware issue. Fiddling at the hdmi plug on the monitors side resolves that. This may be an issue, if code is affected at init, but i don´t think it is.
I could disable CONFIG_DRM_AMDGPU_CIK to test, if this is related too.
Comment 3 qbl 2018-10-10 15:47:07 UTC
(In reply to Alex Deucher from comment #1)
> Is this a regression?  If so, can you bisect?

which is the first release supporting RX560?
Comment 4 qbl 2018-10-17 05:56:54 UTC
I have mounted my monitor on the wall about a week ago. Hence the torsional moment at the HDMI-plug has changed. None of the described errors did occur since. So the crash may be triggered by a bad HDMI signal caused by a bad plug or even a hair crack at the monitors board. I have used v4.18.12 and earlier.
Comment 5 qbl 2018-10-17 09:27:06 UTC
It just happened again. (v4.18.14)
For now I will attach another monitor with another cable. To check that may take more than a week.
Comment 6 qbl 2018-10-18 13:21:23 UTC
Created attachment 279089 [details]
dmesg + amdgpu_pm_info

New Monitor and HDMI-cable. Bug is not impressed - i.e. System hangs at bootup sometimes (v4.18.14).

Maybe another Bug:
System booted normally, but graphics stucks later.
dmesg shows errormessages at bootup:
[    9.941637] amdgpu: [powerplay] 
                failed to send message 148 ret is 0
...

This kind of messages got triggered by use of sensors later again. Graphics stucked and cat /sys/.../amdgpu_pm_info took some seconds.
Comment 7 qbl 2018-10-23 14:00:46 UTC
Bug is still alive. v4.19
Comment 8 qbl 2018-11-05 06:22:09 UTC
I have replaced HDMI-cable by displayport about 2 weeks ago. No bug visible. (Firmware update about 1 week ago).
Maybe HDMI is broken or implementation in monitors/graphics board is bad or cables are bad or implementation in amdgpu is bad.
HDMI with old monitor and old cable worked well for about 2 years with radeon and APU.
Comment 9 Alex Deucher 2018-11-05 15:39:54 UTC
Does this patch help?
https://patchwork.freedesktop.org/patch/259364/
Comment 10 qbl 2018-11-05 19:52:50 UTC
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Am 05.11.18 um 16:39 schrieb bugzilla-daemon@bugzilla.kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=201273
> 
> --- Comment #9 from Alex Deucher (alexdeucher@gmail.com) --- Does
> this patch help? https://patchwork.freedesktop.org/patch/259364/
> 
Bug just had a short leave. It is still alive with v4.19, even with
displayport. I'll check mentioned patch, but it may not work as
mainboard supports pci-e 3.0 (with limitations). However it may take
several weeks to verify.

https://asrock.com/MB/AMD/FM2A78M%20Pro3+/index.de.asp?cat=Download
- - 1 x PCI Express 3.0 x16 Slot (PCIE1 @ x16 mode)
...
*PCIE 3.0 is only supported with FM2+ CPU. With FM2 CPU, it only
supports PCIE 2.0.
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEER2Zow4uiUfO8Mj4TrEnn2SiQ7YcFAlvgkH0ACgkQrEnn2SiQ
7YcX3Qf/aL+2nIYDculQZdJzBVxByBLscwUgOLqrKUWh16JAwG3DVcffA4vwEKP2
eQVfjO8hyAeWnAQAzmPLEqvs5DCRP8fGqK63/JQ3hPNOia1ljY3djxJj+mLaFKA4
6tSyV+A/2ulnixQO/1I8SWNMAG4c9H7L7TY/RtGRYyLuakjut5OIPmCBJ1eQ77ZD
G8nH3hL2bW7e1/dH7pIkihvX1j7H+cDPDYTxSMe9sPyfcBeurwqwSvRz+bGHjI10
FCiJfu69u7z3i3W8uFPBUi0XNEXXI0jbJtqKR1NoFm/Pa8nVhGt1k1LnWKrPL7b1
oCKlYiH67c1qbpr2UYXhcVKhYhROgQ==
=tRbn
-----END PGP SIGNATURE-----
Comment 11 qbl 2018-11-09 17:24:10 UTC
Created attachment 279393 [details]
config+dmesg for patched 4.19.1

Bug is still alive.
A more precise description:
System boots and shows some plymouth boot messages at low res. At normal boot console switches to high res and shows more messages:
Console: switching to colour frame buffer device 240x67
amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
This may be the point where system hangs, because high res messages never have been seen, when system crashed.
Comment 12 qbl 2018-11-21 18:10:00 UTC
Bug is still alive. v4.19.3 + patch
Comment 13 qbl 2018-11-23 16:00:11 UTC
(In reply to quirin.blaeser from comment #11)
> Created attachment 279393 [details]
> config+dmesg for patched 4.19.1
> 
> Bug is still alive.
> A more precise description:
> System boots and shows some plymouth boot messages at low res. At normal
> boot console switches to high res and shows more messages:
> Console: switching to colour frame buffer device 240x67
> amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
> This may be the point where system hangs, because high res messages never
> have been seen, when system crashed.

Addendum:
- Count of plymouth boot messages in low res is not constant
- Error message has not been seen for at least 4 weeks.
- Behaviuor has changed somewhat:
 · old: plymouth low res - clear screen - switch off backlight - system hangs
 · new: plymouth low res - clear screen - system hangs
Comment 14 Alex Deucher 2018-11-23 21:34:17 UTC
Created attachment 279635 [details]
test fix

Does this patch help?
Comment 15 qbl 2018-11-24 18:03:45 UTC
(In reply to Alex Deucher from comment #14)
> Created attachment 279635 [details]
> test fix
> 
> Does this patch help?

I'll check that, but it may require at least a week to verify.
Comment 16 qbl 2018-11-25 20:41:20 UTC
(In reply to quirin.blaeser from comment #15)
> (In reply to Alex Deucher from comment #14)
> > Created attachment 279635 [details]
> > test fix
> > 
> > Does this patch help?
> 
> I'll check that, but it may require at least a week to verify.

Bug is still alive: V4.19.3 + patches
Comment 17 qbl 2018-12-08 07:12:26 UTC
Bug is still alive: v4.19.5, no patches
Comment 18 qbl 2018-12-10 19:51:31 UTC
Bug is still alive: v4.19.7
Comment 19 qbl 2018-12-13 06:36:43 UTC
Created attachment 279983 [details]
dmesg lsmod additional infos

Error message returned. v4.19.7 no patches
Archive contains dmesg and lsmod, both for error and normal startup.
It also contains some additional infos, e.g. describing changed behavior of my monitor at bootup, Xorg.0.log (normal) and Xorg.0.log.old (error).
Comment 21 qbl 2018-12-13 17:24:08 UTC
(In reply to Alex Deucher from comment #20)
> Can you try one of these branches?
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.21-wip
> Do they help?

As i understand, the first one matches v4.19. So this may be a candidate, but your server is slow. I'll try it again, when US is sleeping.
Comment 22 Alex Deucher 2018-12-13 18:25:22 UTC
You'll also need this updated firmware:
https://people.freedesktop.org/~agd5f/radeon_ucode/polaris11_k_mc.bin
Comment 23 qbl 2018-12-14 07:59:53 UTC
Created attachment 280009 [details]
dmesg amd_gpu_firmware_info

(In reply to Alex Deucher from comment #22)
> You'll also need this updated firmware:
> https://people.freedesktop.org/~agd5f/radeon_ucode/polaris11_k_mc.bin

I've downloaded this Firmware and copied it to /lib/firmware/amdgpu.
At next boot system crashed with a slightly modified behavior:
plymouth low res -> clear -> backlights off -> backlights on -> plymouth high res -> crash
Last message: Starting X display manager

Archive contains dmesg (normal) and amd_gpu_firmware_info from
/sys/kernel/debug/dri/1/
Comment 24 qbl 2018-12-16 10:49:26 UTC
Bug is still alive. amd-staging-drm-next 4da295299bda146326b44f22d8eeaa797d6acb38
Comment 25 qbl 2018-12-17 14:53:02 UTC
Bug is still alive. amd-staging-drm-next
75061f375e7f627acbc3aa5466bc1ee552f3f22c
plymouth low res -> clear -> crash
(4da295299bda146326b44f22d8eeaa797d6acb38 too)
Comment 26 qbl 2018-12-22 17:57:11 UTC
Bug is still alive. amd-staging-drm-next
0050e1bd9b509dc764c0180ad5010d4591755abf
Comment 27 qbl 2018-12-24 11:32:03 UTC
Bug is still alive. amd-staging-drm-next
d9c54d61df327dc93374b718d7941a09e02e32e1
Comment 28 qbl 2019-01-05 22:46:31 UTC
Bug is still alive. amd-staging-drm-next
d004289b58fe5ddc71c86411c53f52db2327b3fb

Firmware:
0f22c8527439eaaf5c3fcf87b31c89445b6fa84d
git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
Comment 29 qbl 2019-01-13 09:22:30 UTC
Bug is still alive. amd-staging-drm-next
9698024e8a191481321574bec1fe886bbce797cf
Comment 30 qbl 2019-01-22 16:58:09 UTC
Bug is still alive. amd-staging-drm-next
d2d07f246b126b23d02af0603b83866a3c3e2483
Comment 31 qbl 2019-01-23 09:29:21 UTC
Bug is still alive. amd-staging-drm-next
9846725b3f1c54bdc072b42c9b67b8dd6fdf9a90
Comment 32 qbl 2019-01-27 12:42:00 UTC
Bug is still alive. amd-staging-drm-next
7990e4b7b3b63e417a8a8cb8b33bc732b7e9e32f
Comment 33 qbl 2019-02-09 20:50:34 UTC
Bug is still alive. amd-staging-drm-next
8202c53d8f8e1045b8d1ec2db9401618b8889614
Comment 34 qbl 2019-02-09 21:04:30 UTC
git pull failed.
drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c: needs merge
drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c: needs merge
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c: needs merge
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h: needs merge
drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c: needs merge
drivers/gpu/drm/amd/amdgpu/soc15.c: needs merge
drivers/gpu/drm/amd/amdkfd/kfd_crat.c: needs merge
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c: needs merge
...

I suggest you to fix repository. Meanwhile I'll switch to mainline.
Comment 35 qbl 2019-02-19 17:25:40 UTC
Bug is still alive. v4.20.9
Comment 36 qbl 2019-03-06 17:20:25 UTC
Bug is still alive. v5.0
Comment 37 qbl 2019-03-07 21:01:43 UTC
Created attachment 281575 [details]
boot.omsg amdgpu_firmware_info messages.part Xorg.0.log.old

Possibly new Bug.
No plymouth messages seen, screen turned backlights off ( no signal), USB-Keyboard worked -> sys-req
Archive contains /var/log/boot.omsg, part of /var/log/messages, /var/log/Xorg.0.log.old and /sys/kernel/debug/dri/1/amdgpu_firmware_info.

boot.omsg and messages.part contain backtrace
Comment 38 qbl 2019-03-25 18:14:33 UTC
Bug is still alive. v5.0.4
Comment 39 qbl 2019-04-04 21:00:52 UTC
Bug is still alive. v5.0.5
Comment 40 qbl 2019-04-05 06:25:24 UTC
Bug is still alive. v5.0.6
Comment 41 qbl 2019-04-11 19:54:24 UTC
Bug is still alive. v5.0.7
Comment 42 qbl 2019-04-24 12:16:12 UTC
Bug is still alive. v5.0.9
An additional attached ps/2 keyboard is as dead as usb.
Comment 43 Alex Deucher 2019-04-24 19:44:56 UTC
Does booting with any of the following options on the kernel command line in grub help?
amd_iommu=off
idle=nomwait
iommu=pt
pci=noats
Can you also try different IOMMU and SVM settings in the sbios?  E.g., change from "auto" to "enabled" or "disabled".
Comment 44 qbl 2019-04-26 11:43:31 UTC
(In reply to Alex Deucher from comment #43)
> Does booting with any of the following options on the kernel command line in
> grub help?
> amd_iommu=off
> idle=nomwait
> iommu=pt
> pci=noats
> Can you also try different IOMMU and SVM settings in the sbios?  E.g.,
> change from "auto" to "enabled" or "disabled".

I´ll try idle=nomwait iommu=pt pci=noats for now
iommu and svm are still enabled in BIOS
Comment 45 qbl 2019-04-27 19:22:22 UTC
(In reply to quirin.blaeser from comment #44)
> (In reply to Alex Deucher from comment #43)
> > Does booting with any of the following options on the kernel command line
> in
> > grub help?
> > amd_iommu=off
> > idle=nomwait
> > iommu=pt
> > pci=noats
> > Can you also try different IOMMU and SVM settings in the sbios?  E.g.,
> > change from "auto" to "enabled" or "disabled".
> 
> I´ll try idle=nomwait iommu=pt pci=noats for now
> iommu and svm are still enabled in BIOS

Bug is still alive. v5.0.9
Next Step:
iommu and svm disabled in BIOS
changed cmdline to amd_iommu=off
Comment 46 qbl 2019-04-28 15:46:48 UTC
(In reply to quirin.blaeser from comment #45)
> (In reply to quirin.blaeser from comment #44)
> > (In reply to Alex Deucher from comment #43)
> > > Does booting with any of the following options on the kernel command line
> > in
> > > grub help?
> > > amd_iommu=off
> > > idle=nomwait
> > > iommu=pt
> > > pci=noats
> > > Can you also try different IOMMU and SVM settings in the sbios?  E.g.,
> > > change from "auto" to "enabled" or "disabled".
> > 
> > I´ll try idle=nomwait iommu=pt pci=noats for now
> > iommu and svm are still enabled in BIOS
> 
> Bug is still alive. v5.0.9
> Next Step:
> iommu and svm disabled in BIOS
> changed cmdline to amd_iommu=off

Bug is still alive. v5.0.9
now iommu and svm enabled in BIOS
cmdline not changed: amd_iommu=off
Comment 47 qbl 2019-04-30 19:12:04 UTC
(In reply to quirin.blaeser from comment #46)
> (In reply to quirin.blaeser from comment #45)
> > (In reply to quirin.blaeser from comment #44)
> > > (In reply to Alex Deucher from comment #43)
> > > > Does booting with any of the following options on the kernel command
> line
> > > in
> > > > grub help?
> > > > amd_iommu=off
> > > > idle=nomwait
> > > > iommu=pt
> > > > pci=noats
> > > > Can you also try different IOMMU and SVM settings in the sbios?  E.g.,
> > > > change from "auto" to "enabled" or "disabled".
> > > 
> > > I´ll try idle=nomwait iommu=pt pci=noats for now
> > > iommu and svm are still enabled in BIOS
> > 
> > Bug is still alive. v5.0.9
> > Next Step:
> > iommu and svm disabled in BIOS
> > changed cmdline to amd_iommu=off
> 
> Bug is still alive. v5.0.9
> now iommu and svm enabled in BIOS
> cmdline not changed: amd_iommu=off

Bug is still alive. v5.0.9
more suggestions?
Comment 48 Marco 2019-05-01 18:43:54 UTC
Same here on a B450 and a RX580 GPU. Random black screen with complete lockup of the system (panic, probably?) sometimes after booting up the system, beetween 5 seconds and two minutes post graphical boot. If the system survives this time it's stable until shutdown.

Quite annoying, frankly. I still need to test properly the module parameter amdgpu.dc=0, that seems to disable the firmware load of the Display Controller (if I remember correctly); but what feature of the card (if any at all) will not work if the firmware is not loaded?

Monitor used is a 120 Hz connected via DVI-D to the GPU. Since someone quoted that was simply a lockup of the output, simply fixed with a replug of the monitor, I've tried to connect the monitor via HDMI after it froze and blackscreened; however no output there either, and no response from the system when I was pressing the power button; henceforth that to me look like a kernel panic.

It is a problem of the open source driver or the DC firmware itself the cause of this panics? Has anyone tried to contact AMD about this issue if it's the latter one?
Comment 49 Alex Deucher 2019-05-01 19:25:50 UTC
Can you still log in remotely via ssh and get an updated dmesg?  If it's a blank screen, can you try another display connector on the board?  The amdgpu.dc option switches between the old simpler display code (dc=0) and the newer display code that support more advanced features like audio and DP MST (dc=1).  There is no firmware in the display engine for this asic.
Comment 50 qbl 2019-05-02 20:36:23 UTC
Bug is still alive. v5.0.10
Comment 51 Marco 2019-05-03 10:33:00 UTC
(In reply to Alex Deucher from comment #49)
> Can you still log in remotely via ssh and get an updated dmesg?  If it's a
> blank screen, can you try another display connector on the board?  The
> amdgpu.dc option switches between the old simpler display code (dc=0) and
> the newer display code that support more advanced features like audio and DP
> MST (dc=1).  There is no firmware in the display engine for this asic.

As stated above, it was connected through DVI-D and I already tried with HDMI, but no signal from anywhere.

For the firmware, it was my mistake since I was reading a version number when it was enabled in the module, and I automatically assumed that was a firmware version uploaded to the ASIC, my bad.

Regardless as the above, I've rebooted 25 times yesterday, and I can't reproduce the problem again, unfortunately (to me anyway is fine :P); if I get the problem again now I have an SSH server enabled so I'll surely post the dmesg (even if I doubt I'll be able to connect, since when it was presenting I wasn't even able to trigger an ACPI graceful powerdown by pressing the power button).
Comment 52 Marco 2019-05-03 10:34:10 UTC
(In reply to Marco from comment #51)
> (In reply to Alex Deucher from comment #49)
> > Can you still log in remotely via ssh and get an updated dmesg?  If it's a
> > blank screen, can you try another display connector on the board?  The
> > amdgpu.dc option switches between the old simpler display code (dc=0) and
> > the newer display code that support more advanced features like audio and
> DP
> > MST (dc=1).  There is no firmware in the display engine for this asic.
> 
> As stated above, it was connected through DVI-D and I already tried with
> HDMI, but no signal from anywhere.
> 
> For the firmware, it was my mistake since I was reading a version number
> when it was enabled in the module, and I automatically assumed that was a
> firmware version uploaded to the ASIC, my bad.
> 
> Regardless as the above, I've rebooted 25 times yesterday, and I can't
> reproduce the problem again, unfortunately (to me anyway is fine :P); if I
> get the problem again now I have an SSH server enabled so I'll surely post
> the dmesg (even if I doubt I'll be able to connect, since when it was
> presenting I wasn't even able to trigger an ACPI graceful powerdown by
> pressing the power button).

(Anywhere besides DP, since I don't have a close monitor by hand with DP)
Comment 53 Marco 2019-05-03 17:04:39 UTC
For the first time it has triggered while I was using the system after some time; and as imagined the system disappear from the network, so no dmesg available. I'll try without DC enabled to see if this fix the problem. If you know any additional kernel parameters to debug this issue, I'm here to try it out.
Comment 54 Marco 2019-05-09 09:27:25 UTC
It seems that I've managed to "fix" it by putting the amdgpu module load early at boot with mkinitcpio.conf; after 30/40 reboots I still haven't seen a crash since; but since the problem appears so randomly, I'm not sure if it just luck or something specific that triggers it that I'm not doing it.

I'll post this here in the hope that it will help someone else.
Comment 55 qbl 2019-06-08 09:09:39 UTC
No crash since upgrade to v5.1.0, a month ago.
Comment 56 Marco 2019-06-08 11:23:13 UTC
(In reply to quirin.blaeser from comment #55)
> No crash since upgrade to v5.1.0, a month ago.

Same here. Haven’t seen any crashes lately.
Comment 57 qbl 2022-07-28 15:35:30 UTC
Three years without a crash should be enough to prove: It's fixed.