Bug 201273
Summary: | Fatal error during GPU init amdgpu RX560 | ||
---|---|---|---|
Product: | Drivers | Reporter: | qbl |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alexdeucher, marco.rodolfi |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.18.9 4.18.10 and possibly earlier 4.19, 4.20 and 5.0 too | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg lsmod lspci lsusb cpuinfo url
dmesg old + new dmesg + amdgpu_pm_info config+dmesg for patched 4.19.1 test fix dmesg lsmod additional infos dmesg amd_gpu_firmware_info boot.omsg amdgpu_firmware_info messages.part Xorg.0.log.old |
Is this a regression? If so, can you bisect? Created attachment 278973 [details] dmesg old + new The RX560 is new. Error happens only from time to time and only at bootup and i have to press reset mostly. While bisecting https://bugzilla.kernel.org/show_bug.cgi?id=201275 i had created two more files with output of dmesg. I have included all of them in archive. Maybe v4.18.0 is affected too. I could use an older Kernel to verify, but since this is not an predictible case, i would have to use it for several days to check this. Rarely it happens that Monitor blacks out ( really black, back lights and sound off too), but this may be an hardware issue. Fiddling at the hdmi plug on the monitors side resolves that. This may be an issue, if code is affected at init, but i don´t think it is. I could disable CONFIG_DRM_AMDGPU_CIK to test, if this is related too. (In reply to Alex Deucher from comment #1) > Is this a regression? If so, can you bisect? which is the first release supporting RX560? I have mounted my monitor on the wall about a week ago. Hence the torsional moment at the HDMI-plug has changed. None of the described errors did occur since. So the crash may be triggered by a bad HDMI signal caused by a bad plug or even a hair crack at the monitors board. I have used v4.18.12 and earlier. It just happened again. (v4.18.14) For now I will attach another monitor with another cable. To check that may take more than a week. Created attachment 279089 [details]
dmesg + amdgpu_pm_info
New Monitor and HDMI-cable. Bug is not impressed - i.e. System hangs at bootup sometimes (v4.18.14).
Maybe another Bug:
System booted normally, but graphics stucks later.
dmesg shows errormessages at bootup:
[ 9.941637] amdgpu: [powerplay]
failed to send message 148 ret is 0
...
This kind of messages got triggered by use of sensors later again. Graphics stucked and cat /sys/.../amdgpu_pm_info took some seconds.
Bug is still alive. v4.19 I have replaced HDMI-cable by displayport about 2 weeks ago. No bug visible. (Firmware update about 1 week ago). Maybe HDMI is broken or implementation in monitors/graphics board is bad or cables are bad or implementation in amdgpu is bad. HDMI with old monitor and old cable worked well for about 2 years with radeon and APU. Does this patch help? https://patchwork.freedesktop.org/patch/259364/ -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Am 05.11.18 um 16:39 schrieb bugzilla-daemon@bugzilla.kernel.org: > https://bugzilla.kernel.org/show_bug.cgi?id=201273 > > --- Comment #9 from Alex Deucher (alexdeucher@gmail.com) --- Does > this patch help? https://patchwork.freedesktop.org/patch/259364/ > Bug just had a short leave. It is still alive with v4.19, even with displayport. I'll check mentioned patch, but it may not work as mainboard supports pci-e 3.0 (with limitations). However it may take several weeks to verify. https://asrock.com/MB/AMD/FM2A78M%20Pro3+/index.de.asp?cat=Download - - 1 x PCI Express 3.0 x16 Slot (PCIE1 @ x16 mode) ... *PCIE 3.0 is only supported with FM2+ CPU. With FM2 CPU, it only supports PCIE 2.0. -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEER2Zow4uiUfO8Mj4TrEnn2SiQ7YcFAlvgkH0ACgkQrEnn2SiQ 7YcX3Qf/aL+2nIYDculQZdJzBVxByBLscwUgOLqrKUWh16JAwG3DVcffA4vwEKP2 eQVfjO8hyAeWnAQAzmPLEqvs5DCRP8fGqK63/JQ3hPNOia1ljY3djxJj+mLaFKA4 6tSyV+A/2ulnixQO/1I8SWNMAG4c9H7L7TY/RtGRYyLuakjut5OIPmCBJ1eQ77ZD G8nH3hL2bW7e1/dH7pIkihvX1j7H+cDPDYTxSMe9sPyfcBeurwqwSvRz+bGHjI10 FCiJfu69u7z3i3W8uFPBUi0XNEXXI0jbJtqKR1NoFm/Pa8nVhGt1k1LnWKrPL7b1 oCKlYiH67c1qbpr2UYXhcVKhYhROgQ== =tRbn -----END PGP SIGNATURE----- Created attachment 279393 [details]
config+dmesg for patched 4.19.1
Bug is still alive.
A more precise description:
System boots and shows some plymouth boot messages at low res. At normal boot console switches to high res and shows more messages:
Console: switching to colour frame buffer device 240x67
amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
This may be the point where system hangs, because high res messages never have been seen, when system crashed.
Bug is still alive. v4.19.3 + patch (In reply to quirin.blaeser from comment #11) > Created attachment 279393 [details] > config+dmesg for patched 4.19.1 > > Bug is still alive. > A more precise description: > System boots and shows some plymouth boot messages at low res. At normal > boot console switches to high res and shows more messages: > Console: switching to colour frame buffer device 240x67 > amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device > This may be the point where system hangs, because high res messages never > have been seen, when system crashed. Addendum: - Count of plymouth boot messages in low res is not constant - Error message has not been seen for at least 4 weeks. - Behaviuor has changed somewhat: · old: plymouth low res - clear screen - switch off backlight - system hangs · new: plymouth low res - clear screen - system hangs Created attachment 279635 [details]
test fix
Does this patch help?
(In reply to Alex Deucher from comment #14) > Created attachment 279635 [details] > test fix > > Does this patch help? I'll check that, but it may require at least a week to verify. (In reply to quirin.blaeser from comment #15) > (In reply to Alex Deucher from comment #14) > > Created attachment 279635 [details] > > test fix > > > > Does this patch help? > > I'll check that, but it may require at least a week to verify. Bug is still alive: V4.19.3 + patches Bug is still alive: v4.19.5, no patches Bug is still alive: v4.19.7 Created attachment 279983 [details]
dmesg lsmod additional infos
Error message returned. v4.19.7 no patches
Archive contains dmesg and lsmod, both for error and normal startup.
It also contains some additional infos, e.g. describing changed behavior of my monitor at bootup, Xorg.0.log (normal) and Xorg.0.log.old (error).
Can you try one of these branches? https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.21-wip Do they help? (In reply to Alex Deucher from comment #20) > Can you try one of these branches? > https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next > https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.21-wip > Do they help? As i understand, the first one matches v4.19. So this may be a candidate, but your server is slow. I'll try it again, when US is sleeping. You'll also need this updated firmware: https://people.freedesktop.org/~agd5f/radeon_ucode/polaris11_k_mc.bin Created attachment 280009 [details] dmesg amd_gpu_firmware_info (In reply to Alex Deucher from comment #22) > You'll also need this updated firmware: > https://people.freedesktop.org/~agd5f/radeon_ucode/polaris11_k_mc.bin I've downloaded this Firmware and copied it to /lib/firmware/amdgpu. At next boot system crashed with a slightly modified behavior: plymouth low res -> clear -> backlights off -> backlights on -> plymouth high res -> crash Last message: Starting X display manager Archive contains dmesg (normal) and amd_gpu_firmware_info from /sys/kernel/debug/dri/1/ Bug is still alive. amd-staging-drm-next 4da295299bda146326b44f22d8eeaa797d6acb38 Bug is still alive. amd-staging-drm-next 75061f375e7f627acbc3aa5466bc1ee552f3f22c plymouth low res -> clear -> crash (4da295299bda146326b44f22d8eeaa797d6acb38 too) Bug is still alive. amd-staging-drm-next 0050e1bd9b509dc764c0180ad5010d4591755abf Bug is still alive. amd-staging-drm-next d9c54d61df327dc93374b718d7941a09e02e32e1 Bug is still alive. amd-staging-drm-next d004289b58fe5ddc71c86411c53f52db2327b3fb Firmware: 0f22c8527439eaaf5c3fcf87b31c89445b6fa84d git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git Bug is still alive. amd-staging-drm-next 9698024e8a191481321574bec1fe886bbce797cf Bug is still alive. amd-staging-drm-next d2d07f246b126b23d02af0603b83866a3c3e2483 Bug is still alive. amd-staging-drm-next 9846725b3f1c54bdc072b42c9b67b8dd6fdf9a90 Bug is still alive. amd-staging-drm-next 7990e4b7b3b63e417a8a8cb8b33bc732b7e9e32f Bug is still alive. amd-staging-drm-next 8202c53d8f8e1045b8d1ec2db9401618b8889614 git pull failed. drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c: needs merge drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c: needs merge drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c: needs merge drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h: needs merge drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c: needs merge drivers/gpu/drm/amd/amdgpu/soc15.c: needs merge drivers/gpu/drm/amd/amdkfd/kfd_crat.c: needs merge drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c: needs merge ... I suggest you to fix repository. Meanwhile I'll switch to mainline. Bug is still alive. v4.20.9 Bug is still alive. v5.0 Created attachment 281575 [details]
boot.omsg amdgpu_firmware_info messages.part Xorg.0.log.old
Possibly new Bug.
No plymouth messages seen, screen turned backlights off ( no signal), USB-Keyboard worked -> sys-req
Archive contains /var/log/boot.omsg, part of /var/log/messages, /var/log/Xorg.0.log.old and /sys/kernel/debug/dri/1/amdgpu_firmware_info.
boot.omsg and messages.part contain backtrace
Bug is still alive. v5.0.4 Bug is still alive. v5.0.5 Bug is still alive. v5.0.6 Bug is still alive. v5.0.7 Bug is still alive. v5.0.9 An additional attached ps/2 keyboard is as dead as usb. Does booting with any of the following options on the kernel command line in grub help? amd_iommu=off idle=nomwait iommu=pt pci=noats Can you also try different IOMMU and SVM settings in the sbios? E.g., change from "auto" to "enabled" or "disabled". (In reply to Alex Deucher from comment #43) > Does booting with any of the following options on the kernel command line in > grub help? > amd_iommu=off > idle=nomwait > iommu=pt > pci=noats > Can you also try different IOMMU and SVM settings in the sbios? E.g., > change from "auto" to "enabled" or "disabled". I´ll try idle=nomwait iommu=pt pci=noats for now iommu and svm are still enabled in BIOS (In reply to quirin.blaeser from comment #44) > (In reply to Alex Deucher from comment #43) > > Does booting with any of the following options on the kernel command line > in > > grub help? > > amd_iommu=off > > idle=nomwait > > iommu=pt > > pci=noats > > Can you also try different IOMMU and SVM settings in the sbios? E.g., > > change from "auto" to "enabled" or "disabled". > > I´ll try idle=nomwait iommu=pt pci=noats for now > iommu and svm are still enabled in BIOS Bug is still alive. v5.0.9 Next Step: iommu and svm disabled in BIOS changed cmdline to amd_iommu=off (In reply to quirin.blaeser from comment #45) > (In reply to quirin.blaeser from comment #44) > > (In reply to Alex Deucher from comment #43) > > > Does booting with any of the following options on the kernel command line > > in > > > grub help? > > > amd_iommu=off > > > idle=nomwait > > > iommu=pt > > > pci=noats > > > Can you also try different IOMMU and SVM settings in the sbios? E.g., > > > change from "auto" to "enabled" or "disabled". > > > > I´ll try idle=nomwait iommu=pt pci=noats for now > > iommu and svm are still enabled in BIOS > > Bug is still alive. v5.0.9 > Next Step: > iommu and svm disabled in BIOS > changed cmdline to amd_iommu=off Bug is still alive. v5.0.9 now iommu and svm enabled in BIOS cmdline not changed: amd_iommu=off (In reply to quirin.blaeser from comment #46) > (In reply to quirin.blaeser from comment #45) > > (In reply to quirin.blaeser from comment #44) > > > (In reply to Alex Deucher from comment #43) > > > > Does booting with any of the following options on the kernel command > line > > > in > > > > grub help? > > > > amd_iommu=off > > > > idle=nomwait > > > > iommu=pt > > > > pci=noats > > > > Can you also try different IOMMU and SVM settings in the sbios? E.g., > > > > change from "auto" to "enabled" or "disabled". > > > > > > I´ll try idle=nomwait iommu=pt pci=noats for now > > > iommu and svm are still enabled in BIOS > > > > Bug is still alive. v5.0.9 > > Next Step: > > iommu and svm disabled in BIOS > > changed cmdline to amd_iommu=off > > Bug is still alive. v5.0.9 > now iommu and svm enabled in BIOS > cmdline not changed: amd_iommu=off Bug is still alive. v5.0.9 more suggestions? Same here on a B450 and a RX580 GPU. Random black screen with complete lockup of the system (panic, probably?) sometimes after booting up the system, beetween 5 seconds and two minutes post graphical boot. If the system survives this time it's stable until shutdown. Quite annoying, frankly. I still need to test properly the module parameter amdgpu.dc=0, that seems to disable the firmware load of the Display Controller (if I remember correctly); but what feature of the card (if any at all) will not work if the firmware is not loaded? Monitor used is a 120 Hz connected via DVI-D to the GPU. Since someone quoted that was simply a lockup of the output, simply fixed with a replug of the monitor, I've tried to connect the monitor via HDMI after it froze and blackscreened; however no output there either, and no response from the system when I was pressing the power button; henceforth that to me look like a kernel panic. It is a problem of the open source driver or the DC firmware itself the cause of this panics? Has anyone tried to contact AMD about this issue if it's the latter one? Can you still log in remotely via ssh and get an updated dmesg? If it's a blank screen, can you try another display connector on the board? The amdgpu.dc option switches between the old simpler display code (dc=0) and the newer display code that support more advanced features like audio and DP MST (dc=1). There is no firmware in the display engine for this asic. Bug is still alive. v5.0.10 (In reply to Alex Deucher from comment #49) > Can you still log in remotely via ssh and get an updated dmesg? If it's a > blank screen, can you try another display connector on the board? The > amdgpu.dc option switches between the old simpler display code (dc=0) and > the newer display code that support more advanced features like audio and DP > MST (dc=1). There is no firmware in the display engine for this asic. As stated above, it was connected through DVI-D and I already tried with HDMI, but no signal from anywhere. For the firmware, it was my mistake since I was reading a version number when it was enabled in the module, and I automatically assumed that was a firmware version uploaded to the ASIC, my bad. Regardless as the above, I've rebooted 25 times yesterday, and I can't reproduce the problem again, unfortunately (to me anyway is fine :P); if I get the problem again now I have an SSH server enabled so I'll surely post the dmesg (even if I doubt I'll be able to connect, since when it was presenting I wasn't even able to trigger an ACPI graceful powerdown by pressing the power button). (In reply to Marco from comment #51) > (In reply to Alex Deucher from comment #49) > > Can you still log in remotely via ssh and get an updated dmesg? If it's a > > blank screen, can you try another display connector on the board? The > > amdgpu.dc option switches between the old simpler display code (dc=0) and > > the newer display code that support more advanced features like audio and > DP > > MST (dc=1). There is no firmware in the display engine for this asic. > > As stated above, it was connected through DVI-D and I already tried with > HDMI, but no signal from anywhere. > > For the firmware, it was my mistake since I was reading a version number > when it was enabled in the module, and I automatically assumed that was a > firmware version uploaded to the ASIC, my bad. > > Regardless as the above, I've rebooted 25 times yesterday, and I can't > reproduce the problem again, unfortunately (to me anyway is fine :P); if I > get the problem again now I have an SSH server enabled so I'll surely post > the dmesg (even if I doubt I'll be able to connect, since when it was > presenting I wasn't even able to trigger an ACPI graceful powerdown by > pressing the power button). (Anywhere besides DP, since I don't have a close monitor by hand with DP) For the first time it has triggered while I was using the system after some time; and as imagined the system disappear from the network, so no dmesg available. I'll try without DC enabled to see if this fix the problem. If you know any additional kernel parameters to debug this issue, I'm here to try it out. It seems that I've managed to "fix" it by putting the amdgpu module load early at boot with mkinitcpio.conf; after 30/40 reboots I still haven't seen a crash since; but since the problem appears so randomly, I'm not sure if it just luck or something specific that triggers it that I'm not doing it. I'll post this here in the hope that it will help someone else. No crash since upgrade to v5.1.0, a month ago. (In reply to quirin.blaeser from comment #55) > No crash since upgrade to v5.1.0, a month ago. Same here. Haven’t seen any crashes lately. Three years without a crash should be enough to prove: It's fixed. |
Created attachment 278827 [details] dmesg lsmod lspci lsusb cpuinfo url Since an installation of an AMD-Radeon RX 560 to an APU-based system it sometimes shows a black screen at bootup ( USB-Keyboard hangs too, no sysreq -> reset) Sometimes system boots to GUI, console garbled, dmesg shows fatal error during GPU init, reboot/shutdown hangs. BIOS: APU auto or disabled (on not tested) Attached: dmesg with/without Error lsmod with/without Error lspci / lsusb /proc/cpuinfo url to MB and graphics card