Bug 112491 (radeon_heat)

Summary: Radeon: HD 7400G / A4-4355M System overheats with active graphics card use.
Product: Drivers Reporter: Dionisus Torimens (djtm)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexdeucher, szg00000, vedran
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg 4.2
dmesg 4.2 bapm=1 hang on radeon load.
Temperature Graph 1 (crash not heat related in this case)
Temperature Graph BAPM=1
Temperature Graph Hard off radeon.fastfb=1 radeon.pcie_gen2=1 radeon.audio=0 radeon.hard_reset=1
Temperature Graph with nomodeset radeon.modeset=1 and redshift
Temperature Graph with nomodeset radeon.modeset=1 without redshift

Description Dionisus Torimens 2016-02-15 15:56:11 UTC
I've already tried several radeon boot flags, such as DPM on and off, BAPM on and off, without success. The problem does not occur with the fglrx or Windows driver. I'm using glamor acceleration. I'm using an external 4k display in FullHD resolution, the internal laptop display is disabled.

I've read similar bugs here:
https://bugzilla.kernel.org/show_bug.cgi?id=63101
https://bugzilla.kernel.org/show_bug.cgi?id=68571
https://bugs.freedesktop.org/show_bug.cgi?id=73053

But I'm not sure which of the tips are still relevant in the current tree?

The results are:
Playing a simple car racing game, I can do:
dpm=0 -> 2 laps, then the system shuts down due to thermal zone overheating (regular shutdown, not suddenly).
dpm=-1 -> ~1 lap, the system suddenly turns off hard without warning.
bapm=1 -> ~1 lap, hard turn off
Comment 1 Dionisus Torimens 2016-02-15 15:58:05 UTC
Created attachment 203661 [details]
dmesg 4.2
Comment 2 Dionisus Torimens 2016-02-15 16:06:38 UTC
Created attachment 203671 [details]
dmesg 4.2 bapm=1 hang on radeon load.

This is a time where the system hung up already during boot with bapm active. That doesn't always happen, though.
Comment 3 Dionisus Torimens 2016-02-15 16:15:47 UTC
Also note I can use the system pretty much without problems if I don't use 3D.
Comment 4 Alex Deucher 2016-02-15 17:45:11 UTC
3D is used for everything (even "2D" apps).  Is the problem specific to this car racing game?  It might be a bug in the mesa driver which causes the GPU to lock up.  Also make sure the fan and heatsink are free of dust.
Comment 5 Dionisus Torimens 2016-02-16 06:04:37 UTC
No the problem occurs with any 3d load, just not with 2d load usually. Maybe even glxgears is enough. 

Yes I've cleaned the fan and heatsink area. Didn't help. :/
Comment 6 Dionisus Torimens 2016-02-16 15:15:38 UTC
Actually, I forgot to mention I used to have problems without active 3D use as well, but I'm now switching the DPM profile to battery during boot. And that solved that part mostly. Also, high CPU use by itself is not a problem. Where not using DPM, I didn't switch to the battery/low power profile.
Comment 7 Dionisus Torimens 2016-02-17 12:40:34 UTC
Created attachment 203781 [details]
Temperature Graph 1 (crash not heat related in this case)

I'm having some doubts about the temperature hypothesis now. With DPM active, there seem to be higher temperatures during boot up than during the freezes or reboots. (visible by the gaps, 5 seconds intervals between measurements).

So with DPM there seems to be another issue than without it. I've tried disabling hyperz, to no avail.
Comment 8 Dionisus Torimens 2016-02-17 14:31:50 UTC
Created attachment 203791 [details]
Temperature Graph BAPM=1

It looks like with BAPM it might more likely be overheating. But I can't reproduce the lockups/sudden reboots at all at the moment.
Comment 9 Dionisus Torimens 2016-02-17 14:37:31 UTC
Booting with radeon.hard_reset=1 I get this error at the point where I usually get a hang:
GPU lockup (current fence id 0x0000000000004aa1 last fence id 0x0000000000004aa8 on ring 0)
Comment 10 Dionisus Torimens 2016-02-17 16:10:18 UTC
Created attachment 203801 [details]
Temperature Graph Hard off radeon.fastfb=1 radeon.pcie_gen2=1 radeon.audio=0 radeon.hard_reset=1

Ok, hard turning off reproduced. It reaches almost 110°.
Comment 11 Dionisus Torimens 2016-02-26 05:52:12 UTC
If you'd like any information please let me know now. Because it seems there is not much interest in finding the problem. So I'll have to and will switch back to the proprietary driver.

(It turned out that I used a different version of the game which crashed the card instead. I get the hard off also without any parameters btw.)
Comment 12 Dionisus Torimens 2016-02-27 14:42:33 UTC
[Wonderful, fglrx doesn't work at all with kernel 4.2... ...]

The fastest way to get the system to overheat is to
- enable redshift (or probably xgamma)
- disable vsync
- set dpm to preformance*
echo performance | sudo tee /sys/class/drm/card0/device/power_dpm_state
- activate cpu turbo mode*
echo 1 | sudo tee /sys/devices/system/cpu/cpufreq/boost
- activate BAPM
- activate DRI3
- stay in the game menu (tested with blazrush, kotor)

* Here the effect is not that certain/serious.

Things that help to avoid overheating:
- boot with nomodeset radeon.modeset=1

#vblank_mode=0 glmark2 --run-forever 
does not cause a hang, some of the tests seem less demanding, so the temperature does up and down.

GALLIUM_HUD=temperature is helpful to watch how fast the temperature clims.
Comment 13 Dionisus Torimens 2016-02-27 14:45:35 UTC
Created attachment 206301 [details]
Temperature Graph with nomodeset radeon.modeset=1 and redshift
Comment 14 Dionisus Torimens 2016-02-27 14:46:13 UTC
Created attachment 206311 [details]
Temperature Graph with nomodeset radeon.modeset=1 without redshift
Comment 15 Dionisus Torimens 2016-05-15 08:22:56 UTC
So the problems are worse as summer approaches. Still present in 4.6. The system also shuts down hard with vdpau video playback. The weird thing is that the hard shutdowns occur at a lower temperature if dpm is active than if it isn't. 

Any hints for debugging this?
Comment 16 Vedran Miletić 2016-05-15 11:20:02 UTC
So, OpenGL and VDPAU crash the GPU. Any chance you could also test OpenCL? Not sure if [1] works on R600 OpenCL, but [2] does.

[1] https://github.com/matszpk/clgpustress
[2] https://github.com/lachesis/scallion
Comment 17 Dionisus Torimens 2016-05-15 16:43:46 UTC
I think I've solved it. 
The kernel parameter 

radeon.runpm=0

seems to work around the issue. The performance is degraded, but the temperatures mostly stay below 80°C. Generally the system appears to stay much cooler.
Comment 18 Alex Deucher 2016-05-16 13:43:57 UTC
(In reply to Dionisus Torimens from comment #17)
> I think I've solved it. 
> The kernel parameter 
> 
> radeon.runpm=0
> 
> seems to work around the issue. The performance is degraded, but the
> temperatures mostly stay below 80°C. Generally the system appears to stay
> much cooler.

Is this a multi-GPU notebook?  That option only affects Hybrid laptops with multiple GPUs.
Comment 19 Dionisus Torimens 2016-05-30 09:07:19 UTC
Ok, true, the issue is still there. No, not multi-GPU.