Bug 213569

Summary: Amdgpu temperature reaching dangerous levels
Product: Drivers Reporter: Martin (martin.tk)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: blocking CC: mileikasjos, mrjameshennig
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.13 Subsystem:
Regression: No Bisected commit-id:

Description Martin 2021-06-24 15:32:43 UTC
Ever since going to 5.11 version and later 5.12 the fan speed on my Radeon RX550 is erratic causing the temperature to reach dangerous level.

sensors output:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      825.00 mV 
fan1:         200 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +69.0°C  (crit = +97.0°C, hyst = -273.1°C)
power1:        7.03 W  (cap =  36.00 W)


I'm afraid it'll eventually kill my gpu.

I've already reported another bug for 5.11: 
https://bugzilla.kernel.org/show_bug.cgi?id=212107

From what I gather there were changes in fan control in 5.11. Is it possible to disable those changes?
There were no issues on 5.10. Fan went to roughly 1000rpm, it was cool and quiet.

The behaviour from 5.11 onward is dangerous, can cause hardware destruction.
Comment 1 miloog 2021-06-24 20:58:53 UTC
I can confirm.

But in a different scenario. I'm using debian bullseye with lts kernel and latest amdgpu firmware. I don't change any fan control mechanism.

5.10.44 and 5.10.45 works fine but 5.10.46 if i'm only start sway (wayland window manager) my gpu usage is at 100% without doing anything.

It's a vega 56.
Comment 2 Martin 2021-06-25 12:34:53 UTC
In my case it was watching a video that made the gpu reach 70°C
Comment 3 James 2021-06-27 06:09:03 UTC
This is a legitimate bug which is present starting 5.12.13 and the issue was said to have been fixed starting 5.13-rc8. I wanted to comment out of reassurance that 70°C edge temperature for that GPU cannot damage it. Notice "crit = +97.0°C" which is the throttle temperature.

The computer should shut down at the "emerg" temperature which is not present in your sensors output, but should be +5.0°C over "crit" for your GPU.
Comment 4 Frank Kruger 2021-06-27 07:14:50 UTC
(In reply to miloog from comment #1)
> I can confirm.
> 
> But in a different scenario. I'm using debian bullseye with lts kernel and
> latest amdgpu firmware. I don't change any fan control mechanism.
> 
> 5.10.44 and 5.10.45 works fine but 5.10.46 if i'm only start sway (wayland
> window manager) my gpu usage is at 100% without doing anything.
> 
> It's a vega 56.

You are probably hit by a recent regression introduced with kernel 5.10.46 and 5.12.13 (cf. https://bugzilla.kernel.org/show_bug.cgi?id=213561), where patches are on its way (https://lists.freedesktop.org/archives/amd-gfx/2021-June/065612.html). This is not related to the original bug report here, I presume.
Comment 5 Martin 2021-06-28 13:09:44 UTC
(In reply to James from comment #3)
> This is a legitimate bug which is present starting 5.12.13 and the issue was
> said to have been fixed starting 5.13-rc8. I wanted to comment out of
> reassurance that 70°C edge temperature for that GPU cannot damage it. Notice
> "crit = +97.0°C" which is the throttle temperature.
> 
> The computer should shut down at the "emerg" temperature which is not
> present in your sensors output, but should be +5.0°C over "crit" for your
> GPU.

Thank you for explanation. I've never seen 70°C on my gpu before so to me it looked scary.

Before those changes landed in 5.11 the usual temperature on my gpu would be around 40°C. The fan would be around 1000rpm which on my gpu doesn't produce any  perceivable sound.