Bug 213569 - Amdgpu temperature reaching dangerous levels
Summary: Amdgpu temperature reaching dangerous levels
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-24 15:32 UTC by Martin
Modified: 2021-07-17 11:52 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.13
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Martin 2021-06-24 15:32:43 UTC
Ever since going to 5.11 version and later 5.12 the fan speed on my Radeon RX550 is erratic causing the temperature to reach dangerous level.

sensors output:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      825.00 mV 
fan1:         200 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +69.0°C  (crit = +97.0°C, hyst = -273.1°C)
power1:        7.03 W  (cap =  36.00 W)


I'm afraid it'll eventually kill my gpu.

I've already reported another bug for 5.11: 
https://bugzilla.kernel.org/show_bug.cgi?id=212107

From what I gather there were changes in fan control in 5.11. Is it possible to disable those changes?
There were no issues on 5.10. Fan went to roughly 1000rpm, it was cool and quiet.

The behaviour from 5.11 onward is dangerous, can cause hardware destruction.
Comment 1 miloog 2021-06-24 20:58:53 UTC
I can confirm.

But in a different scenario. I'm using debian bullseye with lts kernel and latest amdgpu firmware. I don't change any fan control mechanism.

5.10.44 and 5.10.45 works fine but 5.10.46 if i'm only start sway (wayland window manager) my gpu usage is at 100% without doing anything.

It's a vega 56.
Comment 2 Martin 2021-06-25 12:34:53 UTC
In my case it was watching a video that made the gpu reach 70°C
Comment 3 James 2021-06-27 06:09:03 UTC
This is a legitimate bug which is present starting 5.12.13 and the issue was said to have been fixed starting 5.13-rc8. I wanted to comment out of reassurance that 70°C edge temperature for that GPU cannot damage it. Notice "crit = +97.0°C" which is the throttle temperature.

The computer should shut down at the "emerg" temperature which is not present in your sensors output, but should be +5.0°C over "crit" for your GPU.
Comment 4 Frank Kruger 2021-06-27 07:14:50 UTC
(In reply to miloog from comment #1)
> I can confirm.
> 
> But in a different scenario. I'm using debian bullseye with lts kernel and
> latest amdgpu firmware. I don't change any fan control mechanism.
> 
> 5.10.44 and 5.10.45 works fine but 5.10.46 if i'm only start sway (wayland
> window manager) my gpu usage is at 100% without doing anything.
> 
> It's a vega 56.

You are probably hit by a recent regression introduced with kernel 5.10.46 and 5.12.13 (cf. https://bugzilla.kernel.org/show_bug.cgi?id=213561), where patches are on its way (https://lists.freedesktop.org/archives/amd-gfx/2021-June/065612.html). This is not related to the original bug report here, I presume.
Comment 5 Martin 2021-06-28 13:09:44 UTC
(In reply to James from comment #3)
> This is a legitimate bug which is present starting 5.12.13 and the issue was
> said to have been fixed starting 5.13-rc8. I wanted to comment out of
> reassurance that 70°C edge temperature for that GPU cannot damage it. Notice
> "crit = +97.0°C" which is the throttle temperature.
> 
> The computer should shut down at the "emerg" temperature which is not
> present in your sensors output, but should be +5.0°C over "crit" for your
> GPU.

Thank you for explanation. I've never seen 70°C on my gpu before so to me it looked scary.

Before those changes landed in 5.11 the usual temperature on my gpu would be around 40°C. The fan would be around 1000rpm which on my gpu doesn't produce any  perceivable sound.

Note You need to log in before you can comment on or make changes to this bug.