Bug 204689

Summary: AMD RAVEN RIDGE STABILITY REGRESSION
Product: Drivers Reporter: wolfgang.gruenstern
Component: Video(Other)Assignee: drivers_video-other
Status: NEW ---    
Severity: blocking CC: alexdeucher, bjo, Changfeng.Zhu, haxk612, huangrui, postix, rmalinverni, subsentient, wolfgang.gruenstern
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: Since version 5.2 to 5.3-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: BUILD instruction
dmesg
amdgpu_firmware_info
raven_rlc.bin

Description wolfgang.gruenstern 2019-08-25 08:55:49 UTC
Created attachment 284591 [details]
BUILD instruction

First of all between Linux 4.19 and Linux 5.1.21 including the RAVEN RIDGE APU in my case RYZEN 3 2200g is working rock stable. Before Linux 4.19 it was not possible to use at all because of frequent system freezes. Unfortunately it is obviously that since Linux 5.2 and afterwards there is a huge regression so I can't use the APU with newer kernels. The freezes are back again in certain circumstances too. It is easy to reproduce a frozen system:

In my case it is enough to start the flightgear flight simulator. The computer freezes on loading screen of flight gear. After the crash I have to use the reset button. To reproduce I attached a build instruction for flightgear.
I would really apreciate if the issue can be solved. I am happy to answer any additional questions.
Comment 1 Alex Deucher 2019-08-26 15:05:17 UTC
Can you bisect?  Did you also update mesa?
Comment 2 wolfgang.gruenstern 2019-08-26 16:39:52 UTC
In Debian I use the following packages:
xserver-xorg-video-amdgpu - 19.0.1-1(testing)
firmware-amd-graphics - 20190717-1(testing)
currently installed mesa - 19.2.0~rc1-1(experimental)

I also tried already mesa 19.1.4-1(testing) with no difference.
The issue only seems to depend on linux-image.

Debian linux-image-4.19.0-5-amd64 works stable.
Debian linux-image-5.2.0-2-amd64 works unstable.
linux-image-5.1.21 compiled by myself with debian .config settings like 5.2.0-2 works stable.
linux-image-5.3.0-rc5 compiled by myself with debian .config settings like 5.2.0-2 
works unstable.

I didn't never bisect before but I could try it.

I need the right address from Kernel git repository so that I do not clone wrong linux.

Best Regards,
Wolfgang
Comment 3 Alex Deucher 2019-08-26 17:40:41 UTC
Here's the tree:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
Comment 4 wolfgang.gruenstern 2019-08-27 15:09:19 UTC
Thank you. I think the issue is found. Bisect log is following:

005440066f929ba0dca8f4e0aebfbf8daac592cc is the first bad commit
commit 005440066f929ba0dca8f4e0aebfbf8daac592cc
Author: Huang Rui <ray.huang@amd.com>
Date:   Wed Mar 13 20:21:00 2019 +0800

    drm/amdgpu: enable gfxoff again on raven series (v2)
    
    This patch enables gfxoff and stutter mode again, since we take more testing on
    raven series. For raven2 and picasso, we can enable it directly. And for raven,
    we need check the RLC/SMC ucode version cannot be less than #531/0x1e45.
    
    v2: add smc version checking for raven.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (v1)
    Tested-by: Likun Gao <Likun.Gao@amd.com> (v2)
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 76a7156f7ff7f32be629f1dffe761499360e49f7 f903deb8648b1a3dbe98fe15a78661bc6646cadd M	drivers

I am happy to answer any additional questions if needed.

Best Regards
Wolfgang
Comment 6 wolfgang.gruenstern 2019-08-27 15:37:52 UTC
Dear Mr. Deucher,
if problems continue to occur, I will inform your whole team as fast as possible.

Best Regards
Wolfgang
Comment 7 wolfgang.gruenstern 2019-08-27 16:49:59 UTC
Dear Mr. Deucher,
I just compiled Linux 5.3-rc6 already containing this patch but the patch is not working. I also tested newer firmware files in amdgpu tree from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git

The patch works neither with newer firmware nor with the older from Debian.
The computer keeps freezing. I seems to be more to investigate.

Probably it would be better to not enable gfxoff for raven 1st generation for now.

Best Regards
Wolfgang
Comment 8 Alex Deucher 2019-08-27 17:59:20 UTC
Please attach your dmesg output from the 5.3 kernel and the content of /sys/kernel/debug/dri/0/amdgpu_firmware_info
Comment 9 Alex Deucher 2019-08-27 18:07:13 UTC
In the meantime, you can disable gfxoff at runtime by appending amdgpu.ppfeaturemask=0xffff3fff
to the kernel command line in grub.
Comment 10 Huang Rui 2019-08-27 18:09:38 UTC
Hi Wolfgang,

May I know whether your firmware version? As Alex mentioned, could you please dump the firmware to us with "cat /sys/kernel/debug/dri/0/amdgpu_firmware_info". GFXOFF feature is to rely on the RLC/SMU FW. So we need check the version in your platform.

Thanks,
Ray
Comment 11 wolfgang.gruenstern 2019-08-27 18:41:42 UTC
Created attachment 284639 [details]
dmesg

dmesg
Comment 12 wolfgang.gruenstern 2019-08-27 18:42:43 UTC
Created attachment 284641 [details]
amdgpu_firmware_info

amdgpu_firmware_info
Comment 13 wolfgang.gruenstern 2019-08-27 18:45:59 UTC
Dear Mr. Deucher, Dear Mr. Rui,
the dmesg output and amdgpu_firmware_info are attached above.
If more information is needes I will respond as soon as possible.
Thank you for fast support.
Best Regards
Wolfgang
Comment 14 Huang Rui 2019-08-27 23:51:57 UTC
Hi Wolfgang,

SMC feature version: 0, firmware version: 0x00001e45

RLC feature version: 1, firmware version: 0x00000213
RLC SRLC feature version: 1, firmware version: 0x00000001
RLC SRLG feature version: 1, firmware version: 0x00000001
RLC SRLS feature version: 1, firmware version: 0x00000001

VBIOS version: 113-RAVEN-107 

Above is the my related firmware version. Looks your SMC version is later than mine. The SMC FW is loaded by SBIOS. Could you find the previous SBIOS that SMC is 0x1e45 and give a try?

Thanks,
Ray
Comment 15 wolfgang.gruenstern 2019-08-28 12:53:23 UTC
Dear Mr. Huang,
before I try some wrong things,
do you mean I could try to downgrade the motherboard bios until the number 0x00001e45 will be displayed in amdgpu_firmware_info?
Comment 16 Roberto Malinverni 2019-11-13 11:13:46 UTC
(In reply to wolfgang.gruenstern from comment #0)

> First of all between Linux 4.19 and Linux 5.1.21 including the RAVEN RIDGE
> APU in my case RYZEN 3 2200g is working rock stable.
> ...
> since Linux 5.2 and afterwards there is a huge regression


Same here with 2400G and B450 motherboard. I'm using Arch Linux, the latest kernel is 5.3.10.

I have total random freezes with absolutely no logs whatsoever. I can only use the reset switch (SSH access doesn't work).

With "random" I mean that the freeze occurs when:
- I have just booted the PC, or after a couple of hours of normal work
- I'm watching a video in streaming
- I'm listening to a web radio
- I'm browsing the web
- I'm typing in a text editor
- I'm doing nothing, the screensaver turns on

No logs means that after a reboot there is no trace of the problem in the journal, dmesg or Xorg.log and I can't see anything if I watch "journalctl -f" and "dmesg -w" from another machine.

I tried booting with rcu_nocbs=0-7 and idle=nomwait kernel parameters, 
in UEFI "Power Supply Idle Control" is set to "Typical Current Idle" and C6 states (both core and package) are disabled.
The freeze still occurs.
Using amdgpu.ppfeaturemask=0xffff3fff boots into black screen.

LTS Kernel works just fine

I tried to bisect between 5.1 and 5.2, but the first kernel built doesn't boot properly (with an error totally - I think - unrelated to this issue) so I'm not sure how to proceed from here.
Comment 17 wolfgang.gruenstern 2019-11-15 14:09:07 UTC
@Roberto Malinverni

In my case, there were two ways to achieve stability.

1. rcu_nocbs is not needed at all for raven apu, but amdgpu.ppfeaturemask=0xffff3fff helped on my pc

2. Also you could try install older firmware-amd-graphics package.
The older firmware file do not support GFXOFF so you would possibly achieve stability.

Try to use/overwrite your files with following firmware files.

https://packages.debian.org/buster/firmware-amd-graphics


Linux 5.1.21 or Linux 4.19 has another codebase not using GFXOFF for raven.
That's why they are stable.

In my opinion it is absurd having to downgrade the bios version of the mainboard and not to be able to use most up to date version.
Comment 18 Roberto Malinverni 2019-11-18 12:52:35 UTC
(In reply to wolfgang.gruenstern from comment #17)

Thanks for your input

> amdgpu.ppfeaturemask=0xffff3fff

already tried: if I set this, I get a black screen :-/
 
> 2. Also you could try install older firmware-amd-graphics package.
> The older firmware file do not support GFXOFF so you would possibly achieve
> stability.

I tried downgrading the linux-firmware and its companion amd-ucode packages, but the freezes still occur.
Comment 19 Alex Deucher 2019-11-18 15:02:53 UTC
Does this patch help?
https://patchwork.freedesktop.org/patch/340983/
Comment 20 wolfgang.gruenstern 2019-11-18 18:12:06 UTC
Dear Mr. Deucher,
in my case, the patch works as desired.
It would be very nice if the patch would land in linux 5.4 in time.
Thank you for your efforts.
Comment 21 Roberto Malinverni 2019-11-19 05:57:59 UTC
Thanks for the patch.
I compiled kernel 5.3.11 with the patch, testing right now. I'll use it for a couple of days then I'll report back.
Comment 22 Roberto Malinverni 2019-11-21 07:23:30 UTC
So far, so good!
Comment 23 Haxk20 2019-11-23 18:36:33 UTC
(In reply to Alex Deucher from comment #19)
> Does this patch help?
> https://patchwork.freedesktop.org/patch/340983/

Is this a permanent solution? As my R5 2500U doesnt suffer from this and well GFXOFF is nice to have.
Comment 24 Roberto Malinverni 2019-12-03 17:03:42 UTC
(In reply to wolfgang.gruenstern from comment #20)
> Dear Mr. Deucher,
> in my case, the patch works as desired.
> It would be very nice if the patch would land in linux 5.4 in time.
> Thank you for your efforts.

it's in kernel 5.4
https://www.lkml.org/lkml/2019/11/24/187
Comment 25 Changfeng.Zhu 2019-12-04 02:06:53 UTC
Created attachment 286171 [details]
raven_rlc.bin

Hi wolfgang,

Could you please use the attached raven_rlc.bin instead and have a test?


BR,

Changfeng.
Comment 26 Huang Rui 2019-12-04 04:05:22 UTC
(In reply to wolfgang.gruenstern from comment #20)
> Dear Mr. Deucher,
> in my case, the patch works as desired.
> It would be very nice if the patch would land in linux 5.4 in time.
> Thank you for your efforts.

Hi Wolfgang,

We are from AMD as well and work with Alex as the same kernel team. For now, our raven rlc ucode is upgraded. Would you mind to help to do one more with with the raven_rlc.bin (#568) that changfeng provided and revert the patch of Alex not to disable gfxoff? 

Thanks,
Ray
Comment 27 wolfgang.gruenstern 2019-12-04 11:08:15 UTC
Dear Ray, Dear Changfeng, Dear Alex,
the new raven_rlc.bin does not disable gfxoff and therefore the computer freezes again.

Probably a blacklist would be useful for certain devices, so they are excluded to use gfxoff.

By the way, which changes does the new raven_rlc.bin contain?
Should it be a kind of blacklist/whitelist, or should it make gfxoff stable for all devices?

Do you need some additional information for the mainboard (ASRock Fatal1ty AB350 Gaming-ITX/ac)?

The link for the bios:
https://www.asrock.com/mb/AMD/Fatal1ty%20AB350%20Gaming-ITXac/index.asp#BIOS

Currently I use the bios number 5.70.

Thank you for your efforts.
Best Regards,
Wolfgang
Comment 28 Huang Rui 2019-12-10 04:23:08 UTC
(In reply to wolfgang.gruenstern from comment #27)
> Dear Ray, Dear Changfeng, Dear Alex,
> the new raven_rlc.bin does not disable gfxoff and therefore the computer
> freezes again.
> 
> Probably a blacklist would be useful for certain devices, so they are
> excluded to use gfxoff.
> 
> By the way, which changes does the new raven_rlc.bin contain?
> Should it be a kind of blacklist/whitelist, or should it make gfxoff stable
> for all devices?
> 
> Do you need some additional information for the mainboard (ASRock Fatal1ty
> AB350 Gaming-ITX/ac)?
> 
> The link for the bios:
> https://www.asrock.com/mb/AMD/Fatal1ty%20AB350%20Gaming-ITXac/index.asp#BIOS
> 
> Currently I use the bios number 5.70.
> 
> Thank you for your efforts.
> Best Regards,
> Wolfgang

Hi Wolfgang,

Thanks for the updates.So far we have to disable gfxoff in raven chip for the moment till we dig out a good firmware for gfxoff. 

Thanks,
Ray
Comment 29 Roberto Malinverni 2019-12-15 15:24:52 UTC
Hello, another update for this bug: it seems that even Ryzen 3400G is affected, so not only "original Raven", but the 2nd generation too - I mean, I have the exact same behavior and freezes with the 3400G that I had with the 2400G.

A difference between 2400G and 3400G is that using 

amdgpu.ppfeaturemask=0xffff3fff


boots into a black screen with 2400G, and works with 3400G.

Anyway, even when using amdgpu.ppfeaturemask=0xffff3fff, there is another (I don't know if it is related) problem - 3400G freezes for a few secs and then resumes (when viewing a video: sudden still image, audio normally running, video resume in 2-5 secs).

The error message is:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Comment 30 Subsentient 2019-12-18 07:00:39 UTC
Managed to get the GPU working on my 2200G. Added "export AMD_DEBUG=nodcc" to my profile. You can also set it in /etc/environment on Red Hat systems.