Bug 212691

Summary: [Regression] amdgpu driver broken on AMD HD7770 GHz edition.
Product: Other Reporter: deference
Component: OtherAssignee: other_other
Status: RESOLVED ANSWERED    
Severity: normal CC: alexdeucher, deference
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: Yes Bisected commit-id:
Attachments: My .config.

Description deference 2021-04-16 02:15:33 UTC
Created attachment 296399 [details]
My .config.

Hello,
I have tested and found this bug to occur on the specified bisected
commit through Linux Kernel version 5.11.12.
I am running Devuan (Debian) Linux with a hand created kernel config. I'm
attaching it.

I've bisected the kernel and found the broken commit. Here's how I got
there in case you're curious: for i in "start v4.19-rc1 v4.18" bad good
good skip skip good good skip good skip good skip good good bad good good
bad good bad good bad bad good; do git bisect $i; done

The broken commit is this one:
commit 8eaf2b1faaf4358c6337785f2192055c6ef41e0d
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Mon Jul 2 14:35:36 2018 -0500

    drm/amdgpu: switch firmware path for SI parts
    
    Use separate firmware path for amdgpu to avoid conflicts
    with radeon on SI parts.
    
    Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
    Reviewed-by: Christian Knig <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c | 56 +++++++++++++++++------------------
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 14 ++++-----
 drivers/gpu/drm/amd/amdgpu/si_dpm.c   | 22 +++++++------- 3 files
 changed, 46 insertions(+), 46 deletions(-)

The last few messages before the whole system, including the network
part of the LK and USB subsystem of the LK freezes up/powers down are:
#15
#16
#17
#18
#19
#20
#21
#22
#23

I think that's telling my that I have 24 threads in my 3900X.

During boot up, I don't even reach init. The whole system freezes. I
don't see any boot up messages that suggest a problem with anything. I
see only a very few massages at all. I did enable early printk.

Steps to reproduce:
1: Build kernel with my config on a bad commit.
2: Boot new kernel (with an AMD HD7770 GPU installed, of course).
3: profit.

If you even think of suggesting I upgrade to a newer GPU, recall also
that there are none in stock -- and any GPU that is is priced
unbelievably high. I'd love a newer GPU. But I'm going to have to wait
like I have been for years for RDNA's big Navi (I was hoping it was
compute/gaming oriented like a Titan. It looks like it's not.
Grrrr, more waiting...)

Thanks,
David
Comment 1 Alex Deucher 2021-04-16 05:34:16 UTC
I suspect the kernel is stalled looking for firmware that it can't fine.  Do you have the firmwares in the new location in your initrd or filesystem in the appropriate place?  I.e., they moved from radeon/ to amdgpu/.  Check if the firmware for your card exists in /lib/firmware/amdgpu/ or wherever your distro puts the firmware and make sure your initrd is up to date.
Comment 2 deference 2021-04-16 23:17:07 UTC
Because this is a custom kernel, I decided to build the FW into the binary. A quick grep of my config would have told you that.

'CONFIG_EXTRA_FIRMWARE="radeon/verde_ce.bin radeon/verde_mc.bin radeon/verde_me.bin radeon/verde_pfp.bin radeon/verde_rlc.bin radeon/verde_smc.bin radeon/TAHITI_uvd.bin"'

For some reason, when my system booted during the 4.14 (I think), series which I was using, it wanted to load the TAHITI FW also. My card is a Cape Verde. So I also included one of it's members.
Comment 3 deference 2021-04-16 23:19:48 UTC
Just in case the path changed, I ran a quick ls for you:

% ls /lib/firmware/{radeon/verde_ce.bin,radeon/verde_mc.bin,radeon/verde_me.bin,radeon/verde_pfp.bin,radeon/verde_rlc.bin,radeon/verde_smc.bin,radeon/TAHITI_uvd.bin}
/lib/firmware/radeon/TAHITI_uvd.bin  /lib/firmware/radeon/verde_me.bin   /lib/firmware/radeon/verde_smc.bin
/lib/firmware/radeon/verde_ce.bin    /lib/firmware/radeon/verde_pfp.bin
/lib/firmware/radeon/verde_mc.bin    /lib/firmware/radeon/verde_rlc.bin

It's all there.
Comment 4 deference 2021-04-18 01:33:07 UTC
Having thought about it, it's possible that more than 1 TAHITI*.bin file is desired by the amdgpu driver. It's even possible that all of the radeon firmware is desired by the amdgpu driver.
What do you think?
Comment 5 deference 2021-04-25 19:07:44 UTC
Just out of curiosity I decided to look at the firmware and see if there were any differences in the count, the naming, or binary data.

% ls /lib/firmware/radeon/ | grep -i verde
VERDE_ce.bin
VERDE_mc.bin
VERDE_mc2.bin
VERDE_me.bin
VERDE_pfp.bin
VERDE_rlc.bin
VERDE_smc.bin
verde_ce.bin
verde_k_smc.bin
verde_mc.bin
verde_me.bin
verde_pfp.bin
verde_rlc.bin
verde_smc.bin

% ls /lib/firmware/amdgpu/ | grep -i verde
verde_ce.bin
verde_k_smc.bin
verde_mc.bin
verde_me.bin
verde_pfp.bin
verde_rlc.bin
verde_smc.bin

% diff /lib/firmware/{radeon,amdgpu}/verde_ce.bin
Binary files /lib/firmware/radeon/verde_ce.bin and /lib/firmware/amdgpu/verde_ce.bin differ
% diff /lib/firmware/{radeon,amdgpu}/verde_k_smc.bin
% diff /lib/firmware/{radeon,amdgpu}/verde_mc.bin
% diff /lib/firmware/{radeon,amdgpu}/verde_me.bin
Binary files /lib/firmware/radeon/verde_me.bin and /lib/firmware/amdgpu/verde_me.bin differ
% diff /lib/firmware/{radeon,amdgpu}/verde_pfp.bin
Binary files /lib/firmware/radeon/verde_pfp.bin and /lib/firmware/amdgpu/verde_pfp.bin differ
% diff /lib/firmware/{radeon,amdgpu}/verde_rlc.bin
Binary files /lib/firmware/radeon/verde_rlc.bin and /lib/firmware/amdgpu/verde_rlc.bin differ

I'll rebuild my kernel and test with the different firmware and see if that changes anything.
Comment 6 deference 2021-04-29 15:39:36 UTC
Ok, I've rebuilt the 4.18 kernel above and it works with the AMDGPU firmware vs. the firmware in the Radeon directory.

However, the 5.11 series kernel still exhibits the above bug. I'm going to rebuild it with the latest firmware vs. the firmware from 2019 which Devuan (Debian) Linux offers in their package manager.

It should be noted that including all the FW from the AMDGPU and Radeon dirs causes the kernel to boot noticeably slower. It takes about 5-10s to load the correct firmware and continue onto init. Therefore, I hope to come up with a more minimalistic FW config in the future.
Comment 7 deference 2021-05-06 20:47:11 UTC
Thanks for the help! I would never have guessed that the FW differed between dirs. I also wouldn't normally guess that it would change for a card so old as mine.

Solution:
Use the FW from the amdgpu dir as opposed to the radeon dir.
Build kernel with the latest FW.