Created attachment 289045 [details]
kernel log + lspci + glxinfo + patch
The radeon driver crashes because of an over temperature of my
AMD Cape Verde Pro graphic card.
On my system, there's no overclocking, and the power management mode is the default one, with power_method dpm, power_dpm_state balanced and power_dpm_force_performance_level auto.
This GPU is used for display and opencl computing.
The default over temperature value in r600_dpm is 120C, which seems to be too high for this chip/card.
I patched my system to have a 100C limit, and I've no crash anymore.
(I tried 110C, and it's still too high).
Attached are the full kernel log of the crash event, the lspci and glxinfo for the graphic card, and the proposed patch.
Created attachment 289047 [details]
radeon: lower the high temperature limit
Limit the chip temperature to 100C, instead of 120C.
I can give more information about the over temperature problem :
* if I keep the 120C limit, the card runs at power level 3 until the driver crashes
* limiting at 100C allows the driver to decrease power level to 2 after a small overshoot, i.e. the temperature reaches 103/104C
* once at power level 2, the temperature stabilizes around 96C
* to test further, I decreased the case fan speed, and then, even with the 100C limit, the card continues to run at power level 2 until the driver crashes around 112C
So, there seems to be 2 problems :
* the default 120C is clearly too high, at least for this board/chip
* the temperature limit is used to go from PWL 3 to PWL 2, but there's no decrease to a lower PWL (1 or 0), as a safe measure
Created attachment 289807 [details]
amdgpu: lower the temperature limit to avoid kernel crash
I modified my kernel configuration to use the new amdgpu driver for this SI chip, instead of the legacy radeon.
The same problem occurs: to avoid frequent kernel crashes, I must apply a patch to lower the maximum temperature allowed.
Created attachment 289897 [details]
amdgpu: kernel log when over temperature crash