Bug 193651 - Amdgpu error messages at boot with Amd RX460
Summary: Amdgpu error messages at boot with Amd RX460
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 low
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-30 13:58 UTC by fin4478
Modified: 2019-11-11 11:05 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.11-wip
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg logfile (55.03 KB, text/plain)
2017-01-30 13:58 UTC, fin4478
Details
dmesg log (168.66 KB, text/x-log)
2017-01-31 16:26 UTC, Steven A. Falco
Details
xorg log file (38.40 KB, text/x-log)
2017-01-31 16:26 UTC, Steven A. Falco
Details
journal log file (456.84 KB, text/x-log)
2017-01-31 16:26 UTC, Steven A. Falco
Details
/var/log/messages (307.23 KB, application/octet-stream)
2017-01-31 16:27 UTC, Steven A. Falco
Details
New dmesg file (102.17 KB, text/x-log)
2017-02-02 22:26 UTC, Steven A. Falco
Details
Kernel 4.11-rc3 config file for RX400 series, add drivers for your hardware (98.14 KB, text/x-mpsub)
2017-03-26 10:18 UTC, fin4478
Details
attachment-474-0.html (991 bytes, text/html)
2017-10-24 17:18 UTC, Milo
Details

Description fin4478 2017-01-30 13:58:17 UTC
Created attachment 253581 [details]
dmesg logfile

I have Gigabyte RX460 2GB gpu card, Debian testing Xfce and adg5f drm-next-4.11-wip kernel downloaded and compiled as today. Computer works ok but the dmesg command shows the following boot errors that might interest amdgou driver developers. Mounting my home partiton fails amdgpu IB tests:

[    7.001953] [drm] ib test on ring 12 succeeded
[    7.055163] EXT4-fs (sda5): mounted filesystem with ordered data mode. Opts: (null)
[    8.011874] [drm:0xffffffffa01360ce] *ERROR* amdgpu: IB test timed out.
[    8.011910] [drm:0xffffffffa00e1b4b] *ERROR* amdgpu: failed testing IB on ring 13 (-110).
[    8.011943] [drm:0xffffffffa00be574] *ERROR* ib ring test failed (-110).


Some powerplay errors:
[    4.888584] amdgpu: [powerplay] [AVFS] Something is broken. See log!
[    4.891452] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
[    4.894807] amdgpu: [powerplay] 
                failed to send message 309 ret is 254 
[    4.894824] amdgpu: [powerplay] 
                failed to send pre message 14e ret is 254 


Bios recognition errors:
[    4.729628] [drm] BIOS signature incorrect 20 7
[    4.729635] amdgpu 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
Comment 1 Steven A. Falco 2017-01-31 16:26:03 UTC
Created attachment 253671 [details]
dmesg log
Comment 2 Steven A. Falco 2017-01-31 16:26:21 UTC
Created attachment 253681 [details]
xorg log file
Comment 3 Steven A. Falco 2017-01-31 16:26:43 UTC
Created attachment 253691 [details]
journal log file
Comment 4 Steven A. Falco 2017-01-31 16:27:06 UTC
Created attachment 253701 [details]
/var/log/messages
Comment 5 Steven A. Falco 2017-01-31 16:33:30 UTC
I began having problems with my AMD GPU when Fedora 25 switched from their 4.8.16-300.fc25 kernel to a 4.9.3 kernel, as described here:

https://bugzilla.redhat.com/show_bug.cgi?id=1414025

The initial symptom was that there was no kernel frame buffer, so the system dropped back to using an accelerated video interface.

With the latest Fedora kernel (4.9.6-200.fc25), the system eventually runs normally, but it takes upwards of 6 minutes for the system to boot.  As shown in the files I attached, I too get many messages of the form:

[  346.235933] 
                failed to send pre message 148 ret is 0 
[  346.455587] 
                failed to send message 148 ret is 0 

I'd like the importance of this bug raised to medium or high, as it is a clear regression from the 4.8.16 kernel to the 4.9.3 kernel.
Comment 6 Steven A. Falco 2017-01-31 16:35:47 UTC
Typo in the above comment:

s/an accelerated video/an un-accelerated video/
Comment 7 Alex Deucher 2017-02-01 22:11:02 UTC
Does using the new ucode here help?

https://people.freedesktop.org/~agd5f/radeon_ucode/polaris/
Comment 8 fin4478 2017-02-02 10:19:56 UTC
Alex, thanks for the new firmware. Still Bios recognition errors at boot, but otherwise ok.

[    3.461112] [drm] BIOS signature incorrect 20 7
[    3.461117] amdgpu 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

Steven, you are using Tonga gpu and radeon kernel driver that fails at  boot and the system using VESA driver. The amdgpu driver has support for Tonga, but you need to make a custom 4.11-wip kernel. Stock distribution kernels do not have stable amdgpu code. Creating a custom kernel in Debian:
Use the command: 
git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux

The kernel configuration file of Debian Official kernel are available in /boot, named after the kernel release. Copy the .config file to the linux directory. Connect all your devices and run the command: make localmodconfig. You can use the command make defconfig too for creating initial .config file. 

Use the command: make xconfig and check that you have enabled: Reroute Broken IRQ, Virtualization KVM and 300Hz CPU timer, I also disabled Swap, Kernel Debug, CPU Freq scaling , Cpu handling in Acpi, Used Bios to control CPU and devices. In the drivers->graphics->amdgpu enable cik support for a gcn 1.1 gpu and si support for a gcn 1.0 gpu.

Create debian kernel package:
export CONCURRENCY_LEVEL=4
fakeroot make-kpkg --initrd kernel_image

Install the kernel package with Gdebi. To make a custom kernel to boot, add a line to /etc/initramfs-tools/modules:
unix
And run: sudo update-initramfs
Reboot.
Comment 9 fin4478 2017-02-02 10:32:34 UTC
After updating the firmware I still have  powerplay erros:
[    3.574222] amdgpu: [powerplay] [AVFS] Something is broken. See log!
[    3.577052] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
Comment 10 Steven A. Falco 2017-02-02 15:01:02 UTC
Thanks for the information on building a new kernel.  I'll give that a try.  I'm running Fedora 25, but I think I can follow your Debian instructions.
Comment 11 Alex Deucher 2017-02-02 15:49:30 UTC
(In reply to fin4478 from comment #8)
> Alex, thanks for the new firmware. Still Bios recognition errors at boot,
> but otherwise ok.
> 
> [    3.461112] [drm] BIOS signature incorrect 20 7
> [    3.461117] amdgpu 0000:01:00.0: Invalid PCI ROM header signature:
> expecting 0xaa55, got 0xffff
> 

This is harmless. The driver tries several methods to fetch the vbios image.  The driver would not load at all if it failed to fetch the vbios image.
Comment 12 Steven A. Falco 2017-02-02 22:26:01 UTC
Created attachment 253891 [details]
New dmesg file
Comment 13 Steven A. Falco 2017-02-02 22:28:30 UTC
I successfully built a custom kernel.  It appears to be working well.  Thanks for the help!

I included a new dmesg.log file because I still see messages like:

[    9.719278] amdgpu: [powerplay] 
                failed to send pre message 15b ret is 0 
[   10.158327] amdgpu: [powerplay] 
                failed to send message 15b ret is 0

Are these harmless or do they indicate a problem?
Comment 14 Steven A. Falco 2017-02-02 22:31:27 UTC
One other error message I just noticed:

[    5.538117] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
Comment 15 Milo 2017-03-25 20:58:13 UTC
(In reply to fin4478 from comment #8)
> Alex, thanks for the new firmware. Still Bios recognition errors at boot,
> but otherwise ok.
> 
> [    3.461112] [drm] BIOS signature incorrect 20 7
> [    3.461117] amdgpu 0000:01:00.0: Invalid PCI ROM header signature:
> expecting 0xaa55, got 0xffff
> 
> Steven, you are using Tonga gpu and radeon kernel driver that fails at  boot
> and the system using VESA driver. The amdgpu driver has support for Tonga,
> but you need to make a custom 4.11-wip kernel. Stock distribution kernels do
> not have stable amdgpu code. Creating a custom kernel in Debian:
> Use the command: 
> git clone -b drm-next-4.11-wip git://people.freedesktop.org/~agd5f/linux
> 
> The kernel configuration file of Debian Official kernel are available in
> /boot, named after the kernel release. Copy the .config file to the linux
> directory. Connect all your devices and run the command: make
> localmodconfig. You can use the command make defconfig too for creating
> initial .config file. 
> 
> Use the command: make xconfig and check that you have enabled: Reroute
> Broken IRQ, Virtualization KVM and 300Hz CPU timer, I also disabled Swap,
> Kernel Debug, CPU Freq scaling , Cpu handling in Acpi, Used Bios to control
> CPU and devices. In the drivers->graphics->amdgpu enable cik support for a
> gcn 1.1 gpu and si support for a gcn 1.0 gpu.
> 
> Create debian kernel package:
> export CONCURRENCY_LEVEL=4
> fakeroot make-kpkg --initrd kernel_image
> 
> Install the kernel package with Gdebi. To make a custom kernel to boot, add
> a line to /etc/initramfs-tools/modules:
> unix
> And run: sudo update-initramfs
> Reboot.

I tried the above but notice that it fails when trying to build headers. 

If I am trying to build a 4.10.5 kernel, could I copy files from what was downloaded when I ran the git command?
Comment 16 fin4478 2017-03-26 03:22:29 UTC
(In reply to Milo from comment #15)

> 
> I tried the above but notice that it fails when trying to build headers. 

You do not need kernel headers unless you are using some dkms drivers. Currently there is a temperature bug in wip kernel. 4.11-rc3 kernel from kernel.org works. 

Some kernel version headers build failed because missing BUG-REPORTS file. Create the file into the linux directory.
Comment 17 Milo 2017-03-26 07:50:19 UTC
(In reply to fin4478 from comment #16)
> (In reply to Milo from comment #15)
> 
> > 
> > I tried the above but notice that it fails when trying to build headers. 
> 
> You do not need kernel headers unless you are using some dkms drivers.
> Currently there is a temperature bug in wip kernel. 4.11-rc3 kernel from
> kernel.org works. 
> 
> Some kernel version headers build failed because missing BUG-REPORTS file.
> Create the file into the linux directory.

Thanks.

Yes I do have some dkms packages installed that need headers. As you suggested, creating REPORTING-BUGS solved the headers build fail.

So it  built as version 4.10.0-rc5-gec3fa8e6ca19. When I booted from it, after 20 minutes the screen was still blank so I rebooted.

I had built 4.9.10 before from kernel.org which after 10 minutes eventually booted into X. I used this .config along with you comments in comment #8 to build 4.10.0-rc5-gec3fa8e6ca19.
Comment 18 fin4478 2017-03-26 10:16:22 UTC
(In reply to Milo from comment #17)

> So it  built as version 4.10.0-rc5-gec3fa8e6ca19. When I booted from it,
> after 20 minutes the screen was still blank so I rebooted.

All your software must in sync, so use Debian testing Xfce, Oibaf ppa yakkety version and latest kernels. I will post my kernel 4.11-rc config.
Comment 19 fin4478 2017-03-26 10:18:14 UTC
Created attachment 255553 [details]
Kernel 4.11-rc3 config file for RX400 series, add drivers for your hardware
Comment 20 Christian Lanig 2017-04-03 22:51:42 UTC
This bug report is partly a duplicate of that one:
https://bugs.freedesktop.org/show_bug.cgi?id=100443

I'm getting the same AVS/Powerplay messages, updating the firmware didn't help.

The topic headline is very unspecific and the replies appear very confusing to me.
Has this issue been solved or not?
Does a custom Kernel change the messages? - I'm using the newest Ubuntu mainline Kernel. Is something wrong with the configuration used to build this Kernel by the Kernel team?
How is this issue related to Tonga GPUs? Polaris is the first dGPU with AVFS.
How many issues do we count here and which replies belong to which issue?

Perhaps someone could make a summary or something - that would be very pleasant.
Comment 21 Steven A. Falco 2017-04-04 12:15:26 UTC
As I previously reported, building a custom kernel as suggested in comment 8 allows me to use my video card.

I do continue to get the powerplay error messages, but aside from slowing down boot a little, they don't seem to do any harm.
Comment 22 Christian Lanig 2017-04-04 13:36:49 UTC
Thanks for clarification. That means building the Kernel was not a possible fix for the messages but for getting the driver to start with Tonga.

So the AVFS- issue and missing/wrong value in the voltage dependency table is still existent. As well as your remaining Powerplay messages.
Comment 23 Milo 2017-04-05 18:59:23 UTC
(In reply to Christian Lanig from comment #22)
> Thanks for clarification. That means building the Kernel was not a possible
> fix for the messages but for getting the driver to start with Tonga.
> 
> So the AVFS- issue and missing/wrong value in the voltage dependency table
> is still existent. As well as your remaining Powerplay messages.
i didn't have the avfs issue but the other two on my  r9 m390 and booting taking as long as 10 minutes when i moved from kernel 4.8 to 4.9/4.10

having built 4.11-rc4, my boot times are back down to around 30 seconds but the messages still persist in dmesg. and it seems i have added some ib related messages.

so i was just confirming that from my perspective there is an issue that was solved though not the messages that led me here.
Comment 24 Steven A. Falco 2017-06-01 12:34:52 UTC
With kernel 4.11.3-200.fc25.x86_64, I no longer need a custom kernel to use my video card.  I still see messages like:

[   13.599542] [drm] ib test on ring 13 succeeded
[   13.606627] [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110).                                                                            
[   14.500572] amdgpu: [powerplay] 
                failed to send pre message 260 ret is 0 
[   14.983369] amdgpu: [powerplay] 
                failed to send message 260 ret is 0 
[   15.120567] amdgpu: [powerplay] 
                failed to send pre message 155 ret is 0 
[   15.609965] amdgpu: [powerplay] 
                failed to send message 155 ret is 0 
[   16.014478] amdgpu: [powerplay] 
                failed to send pre message 260 ret is 0 
[   16.165919] amdgpu: [powerplay] 
                failed to send pre message 15b ret is 0 
[   16.570511] amdgpu: [powerplay] 
                failed to send message 260 ret is 0 
[   16.721456] amdgpu: [powerplay] 
                failed to send message 15b ret is 0 
[   17.498715] amdgpu: [powerplay] 
                failed to send pre message 260 ret is 0 
[   17.951062] amdgpu: [powerplay] 
                failed to send message 260 ret is 0 
[   18.843123] amdgpu: [powerplay] 
                failed to send pre message 260 ret is 0 
[   19.295427] amdgpu: [powerplay] 
                failed to send message 260 ret is 0 
[   20.187154] amdgpu: [powerplay] 
                failed to send pre message 260 ret is 0 
[   20.639852] amdgpu: [powerplay] 
                failed to send message 260 ret is 0 
[   21.531857] amdgpu: [powerplay] 
                failed to send pre message 260 ret is 0 
[   21.984781] amdgpu: [powerplay] 
                failed to send message 260 ret is 0 
[   21.998448] [drm] Initialized amdgpu 3.10.0 20150101 for 0000:05:00.0 on minor 0

but at least the card is otherwise functional.
Comment 25 Milo 2017-09-15 21:48:18 UTC
trying to install a 4.13 kernel and boot times are back to more than 5 minutes (the failed to send message appears for more than 5 minutes when i run journalctl -e)

journalctl log - https://pastebin.com/7dScPxNn
Comment 26 fin4478 2017-10-23 18:11:05 UTC
Milo, try to set amdgpu.audio=0 to the kernel command line.
Comment 27 Milo 2017-10-24 17:18:02 UTC
Created attachment 260365 [details]
attachment-474-0.html

i have added it /etc/default/grub as follows:
GRUB_CMDLINE_LINUX_DEFAULT="nointremap quiet amdgpu.audio=0"
and then ran update-grub

however it still takes more than 5 minutes to boot

On Mon, Oct 23, 2017 at 8:11 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=193651
>
> --- Comment #26 from fin4478@hotmail.com ---
> Milo, try to set amdgpu.audio=0 to the kernel command line.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 28 fin4478 2017-11-13 07:51:58 UTC
(In reply to Alex Deucher from comment #11)
> (In reply to fin4478 from comment #8)
> > Alex, thanks for the new firmware. Still Bios recognition errors at boot,
> > but otherwise ok.
> > 
> > [    3.461112] [drm] BIOS signature incorrect 20 7
> > [    3.461117] amdgpu 0000:01:00.0: Invalid PCI ROM header signature:
> > expecting 0xaa55, got 0xffff
> > 
> 
> This is harmless. The driver tries several methods to fetch the vbios image.
> The driver would not load at all if it failed to fetch the vbios image.

All messages that use the dev_err function slows down booting and make it look ugly. Amd should manage this fix to the pci driver:
Change Invalid PCI ROM header signature message to use the dev_info function in drivers/pci/rom.c

Note You need to log in before you can comment on or make changes to this bug.