Bug 207673

Summary: amdgpu/radeon: crash due to over temperature
Product: Drivers Reporter: phileimer (phil)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: phil
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.6.x, 5.7.x Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel log + lspci + glxinfo + patch
radeon: lower the high temperature limit
amdgpu: lower the temperature limit to avoid kernel crash
amdgpu: kernel log when over temperature crash

Description phileimer 2020-05-10 13:43:24 UTC
Created attachment 289045 [details]
kernel log + lspci + glxinfo + patch

The radeon driver crashes because of an over temperature of my
AMD Cape Verde Pro graphic card.

On my system, there's no overclocking, and the power management mode is the default one, with power_method dpm, power_dpm_state balanced and power_dpm_force_performance_level auto.
This GPU is used for display and opencl computing.

The default over temperature value in r600_dpm is 120C, which seems to be too high for this chip/card.
I patched my system to have a 100C limit, and I've no crash anymore.
(I tried 110C, and it's still too high).

Attached are the full kernel log of the crash event, the lspci and glxinfo for the graphic card, and the proposed patch.
Comment 1 phileimer 2020-05-10 13:46:41 UTC
Created attachment 289047 [details]
radeon: lower the high temperature limit

Limit the chip temperature to 100C, instead of 120C.
Comment 2 phileimer 2020-05-13 12:20:57 UTC
I can give more information about the over temperature problem :

* if I keep the 120C limit, the card runs at power level 3 until the driver crashes

* limiting at 100C allows the driver to decrease power level to 2 after a small overshoot, i.e. the temperature reaches 103/104C

* once at power level 2, the temperature stabilizes around 96C

* to test further, I decreased the case fan speed, and then, even with the 100C limit, the card continues to run at power level 2 until the driver crashes around 112C

So, there seems to be 2 problems :
* the default 120C is clearly too high, at least for this board/chip
* the temperature limit is used to go from PWL 3 to PWL 2, but there's no decrease to a lower PWL (1 or 0), as a safe measure
Comment 3 phileimer 2020-06-22 14:14:48 UTC
Created attachment 289807 [details]
amdgpu: lower the temperature limit to avoid kernel crash
Comment 4 phileimer 2020-06-22 14:16:32 UTC
I modified my kernel configuration to use the new amdgpu driver for this SI chip, instead of the legacy radeon.
The same problem occurs: to avoid frequent kernel crashes, I must apply a patch to lower the maximum temperature allowed.
Comment 5 phileimer 2020-06-27 12:44:27 UTC
Created attachment 289897 [details]
amdgpu: kernel log when over temperature crash