Bug 201539 - AMDGPU R9 390 automatic fan speed control in Linux 4.19/4.20/5.0
Summary: AMDGPU R9 390 automatic fan speed control in Linux 4.19/4.20/5.0
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-27 12:32 UTC by Jan Ziak
Modified: 2022-02-05 21:59 UTC (History)
18 users (show)

See Also:
Kernel Version: 4.19, 4.20, 5.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
pwm1-4.18.16 (1.97 KB, text/plain)
2018-10-27 12:34 UTC, Jan Ziak
Details
pwm1-4.19.0 (5.48 KB, text/plain)
2018-10-27 12:35 UTC, Jan Ziak
Details
patch to fix pwm1_enable being stuck to AUTO for some gpu smu7 and vega10 (1.43 KB, patch)
2020-12-01 21:57 UTC, fawz
Details | Diff
possible fix (1.07 KB, patch)
2020-12-01 22:47 UTC, Alex Deucher
Details | Diff
possible fix (5.77 KB, patch)
2020-12-02 16:21 UTC, Alex Deucher
Details | Diff

Description Jan Ziak 2018-10-27 12:32:41 UTC
GPU: R9 390
Kernel module: amdgpu
Application (example): Rise of the Tomb Raider benchmark

In Linux 4.18 the fan speed of the GPU gradually adapts to GPU load. Maximum pwm is 155 during the benchmark.

In Linux 4.19 the overall fan speed is lower (which is more silent than 4.18, with maximum at 112) but the fan spikes to full speed (pwm=255) for 1-2 seconds which is extremely loud.

The 4.18 and 4.19 kernels are using the same firmware.
Comment 1 Jan Ziak 2018-10-27 12:34:51 UTC
Created attachment 279205 [details]
pwm1-4.18.16

while true; do cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1; sleep 0.1s; done
Comment 2 Jan Ziak 2018-10-27 12:35:17 UTC
Created attachment 279207 [details]
pwm1-4.19.0

while true; do cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1; sleep 0.1s; done
Comment 3 Jan Ziak 2018-10-27 12:45:37 UTC
4.18.16:
$ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
2

4.19.0:
$ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
1
Comment 4 Sergey 2018-12-08 07:59:01 UTC
I have exactly the same problem with my R9 290X. I tested on the 4.19.6 kernel. The problem is still present, not how it is not solved. I have to use the 4.18 kernel.x, so as not to damage my graphics card.
Comment 5 Tony Chaveiro 2018-12-21 01:30:43 UTC
I experience the same issue

4.18-20 kernel on Manjaro (amdgpu + opencl taken from amdgpu pro on 3 x R9 390). Everything works fine. Fan rpm increase with load (tested with stak-xmr miner).

4.19-8-2 kernel: Rpm fans always on minimum setting. Once one of the gpus hits 95C, sudden bursts of 100% rpm occur (under 1s duration). Application has to be suspended or hardware damage will occur

checked pwm1_enable which displays "1". Changing to "2" has no effect. pwmconfig drives fans to 100% during config but is unable to provide control.

pwm1 reads different values (from 86 to 127) but fans don't appear to change speed at all.
Comment 6 Tony Chaveiro 2018-12-21 02:16:14 UTC
kernel parameters: amdgpu.cik_support=1 amdgpu.dc=0
Comment 7 Alex Deucher 2018-12-21 15:06:16 UTC
Does setting amdgpu.dpm=1 on 4.19 help?  The default dpm implementation changed between 4.18 and 4.19.
Comment 8 Tony Chaveiro 2018-12-21 15:56:09 UTC
(In reply to Alex Deucher from comment #7)
> Does setting amdgpu.dpm=1 on 4.19 help?  The default dpm implementation
> changed between 4.18 and 4.19.

amdgpu.dpm=1 produces black screen (unable to boot) in both 4.18 and 4.19. Using amdgpu.dc=1 produces error and freeze on boot
Comment 9 Jan Ziak 2018-12-24 11:55:44 UTC
Linux 4.20 behaves the same as Linux 4.19.
Comment 10 danglingpointerexception@gmail.com 2018-12-31 09:06:06 UTC
This bug is likely related.  Mine's a R9-290X...

https://bugs.freedesktop.org/show_bug.cgi?id=108781

I'm stuck on kernel 4.18.20
Comment 11 danglingpointerexception@gmail.com 2019-01-05 13:41:54 UTC
I found a solution for my problem and detailed it in the previous link.
My R9-290X now fully functions on 4.20.0
Comment 12 Rudolf Kastl 2019-03-09 02:44:32 UTC
Comment 10 and 11 are not related to the reported fan control issue.

I am still seeing the same problem on a r290x with amdgpu driver (and the correct firmware available in the initramfs) on 5.1.0-0.rc0.git4.2
Comment 13 Alex Smith 2019-04-17 10:37:11 UTC
I am also able to reproduce the fan control issue on an R9 290X with Fedora's kernel 5.0.3-200.fc29.x86_64.

The fans on the card do not spin up until the temperature reaches 95C, at which point they jump straight to 100%.

Same as comment 8, amdgpu.dpm=1 just gives me a blank screen at boot.
Comment 14 Steffen Klee 2019-06-06 11:18:21 UTC
Reproducible on an R9 390 with kernel 5.1.6-arch1-1-ARCH using amdgpu driver.
Setting amdgpu.dpm=1 or changing amdgpu.dc does not have any effect.

Also, sensors does not report the current fan RPM:
amdgpu-pci-1d00
Adapter: PCI adapter
vddgfx:       +1.00 V  
fan1:             N/A  (min =    0 RPM, max =    0 RPM)
temp1:        +59.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:       47.24 W  (cap = 230.00 W)
Comment 15 danglingpointerexception@gmail.com 2019-06-06 11:44:17 UTC
@Rudolf - Have you tried my solution in the link I provided above?  I'm on 5.1.6 mainline and have no issues whatsoever with R9-290X

@Alex Smith - I've got a liquid-cooled card so don't know if my solution solves your fan problem but try following my steps in the link.

@Steffen - Try my steps in the link mate, it may solve your problem.  Alex Ducher himself gave me the tip on fixing it.
Comment 16 Steffen Klee 2019-06-06 12:18:17 UTC
(In reply to danglingpointerexception@gmail.com from comment #15)
> @Steffen - Try my steps in the link mate, it may solve your problem.  Alex
> Ducher himself gave me the tip on fixing it.

As far as I understand, you had issues with firmware loading. However, firmware is loading fine on my end:
[drm] Found UVD firmware Version: 1.64 Family ID: 9
[drm] Found VCE firmware Version: 50.10 Binary ID: 2
Comment 17 artheg 2019-06-07 01:41:39 UTC
I'm having the very same issues with my R9 290x as well.
Arch Linux, 5.1.7-arch1-1-ARCH.

(In reply to danglingpointerexception@gmail.com from comment #15)
I've tried your solution, but unfortunately it didn't work for me.
Comment 18 Sean Birkholz 2019-06-18 17:42:56 UTC
I am not a kernel developer and haven't done much programming as of late, so I am not really in a position to actually test this hypothesis.  However - from the bit of research I've done trying to figure this problem out for myself I believe the following explains the overheating and burst of fan speed instead of proper cooling behavior.

Here is my sensors bit from kernel 4.18.x - I have the R9-290.

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:           N/A  
fan1:           0 RPM
temp1:        +65.0°C  (crit = +120.0°C, hyst = +90.0°C)

Take note that this displays the proper critical and hysteresis values for my card.  If you look at the post on comment 14 which is how sensors display the crit/hyst value for kernels beyond 4.18.x you notice the critical value is about 19x the temperature of the surface of the sun and the hyst value is absolute zero.  These values are hard coded into kernel source code in some file, forgive me as I do not recall where I saw the code snippet.  But I strongly believe that correcting the values in the file or changing it to detect proper crit/hyst values based on card will correct this issue.  I simply do not have the means to do this, nor do I know how to submit kernel bug fixes and hope someone with more experience could give it a shot and see if the resulting kernel functions properly.
Comment 19 Sean Birkholz 2019-06-24 02:18:44 UTC
I've done a bit of digging and I've managed to get a proper hysteresis value to appear in a 5.1.14 kernel built from source.

I now have this output from sensors:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       +1.00 V  
fan1:             N/A  (min =    0 RPM, max =    0 RPM)
temp1:        +66.0°C  (crit = +104000.0°C, hyst = +90.0°C)
power1:       29.02 W  (cap = 208.00 W)

I don't know why proper values are not set automatically because I've found the correct values in tons of source files but none of the #defines appear to be used?  And much of the source doesn't appear to differ between 5.1.14 and 4.18.x

I modified (kernel src)/drivers/gpu/drm/amd/powerplay/inc/pp_thermal.h and changed the values of -273150 to 90000.  This corrects the hysteresis value but I'm still searching for where the critical temp value is actually set.

I *think* fixing these values may fix the fan problem because why would a fan spin up if its nowhere near the critical or hysteresis values?  No need.  Except when the critical value is 19x the temp of the sun, the card gets so hot it protects itself by maxing the fans for a short burst.  That is my theory anyway, hope to be able to test it soon but no promises.
Comment 20 artheg 2019-10-05 13:19:08 UTC
To anyone who's still struggling with this, perhaps this would be of help:
I'm using this script (https://github.com/grmat/amdgpu-fancontrol) as a service, with these params: 

TEMPS=( 65000 75000 80000 90000 )
PWMS=(      0   190   200   255 )

Perhaps someone could tweak this a little bit better, but this works for me.
My gpu still sounds like an airplane when I'm running a benchmark like Unigine-Heaven, but at least fans are spinning now.
Comment 21 MasterCATZ 2019-11-03 04:08:50 UTC
Having simular issues with 5.3.8-050308-generic 

when it starts happening this is being spammed in dmesg

amdgpu: [powerplay] 
                failed to send message 282 ret is 254


I loose write acess to 
/sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable

it is stuck on 2 , and runs the fans super low @ 20% causing the GPU to reach thermalmelt down 96 deg when the fan will do blips of 100% 
my bios was modded to even have a minimum fan speed of 50% and even this is being over written 

/sys/class/drm/card1/device/hwmon/hwmon1/pwm1
also can not adjust 



GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1 amdgpu.gpu_recovery=1 amdgpu.cik_support=1 amdgpu.dpm=1 radeon.cik_support=0"


if I reboot it works for a little while allowing me to change GPU speeds and fan speeds then .. I loose fan speed control and can not get it back off auto , which seems to be setup with fans speeds way too low 


 GL_RENDERER:   AMD Radeon R9 200 Series (HAWAII, DRM 3.33.0, 5.3.8-050308-generic, LLVM 9.0.0)
    GL_VERSION:    4.5 (Compatibility Profile) Mesa 19.3.0-devel (git-ff6e148 2019-10-29 bionic-oibaf-ppa)


if I disable amdgpu.dpm I can control the fans but then I can not do Auto GPU speeds and can not manually do my speeds 


my only guess is the firmware being loaded by kernel is the place containing the info for fan speeds ?
Comment 22 MasterCATZ 2019-11-03 04:16:43 UTC

(In reply to Sean Birkholz from comment #19)
> I've done a bit of digging and I've managed to get a proper hysteresis value
> to appear in a 5.1.14 kernel built from source.


> I modified (kernel src)/drivers/gpu/drm/amd/powerplay/inc/pp_thermal.h and
> changed the values of -273150 to 90000.  This corrects the hysteresis value
> but I'm still searching for where the critical temp value is actually set.


I think you hit the nail on the head 

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       +0.90 V  
fan1:             N/A  (min =    0 RPM, max =    0 RPM)
edge:         +50.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:       11.03 W  (cap = 208.00 W)
Comment 23 MasterCATZ 2019-11-03 04:29:35 UTC
the numbers used in the 
linux/drivers/gpu/drm/amd/powerplay
are correct as they are the values the bios uses 
but Linux is reading/using the values differently ...
Comment 24 MasterCATZ 2019-11-03 04:37:50 UTC
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

guess one of them should be able to find the issue
Comment 25 MasterCATZ 2019-11-05 23:47:54 UTC
5.4.0-050400rc6-generic

24 hours and I still have fan control 

99% chance I have just jinxed my self now
Comment 26 MasterCATZ 2019-11-06 09:35:20 UTC
around 12 hrs later lost fan control again
Comment 27 MasterCATZ 2019-11-16 00:42:35 UTC
pwmconfig 
seems to be the only thing that allows me to get manual mode back on 
I wounder if this is the actual program giving grieve
Comment 28 MasterCATZ 2019-11-16 00:57:34 UTC
hmm maybe not 
it lets me briefly access manual 


Found the following PWM controls:
   hwmon1/pwm1           current value: 68
hwmon1/pwm1 is currently setup for automatic speed control.
In general, automatic mode is preferred over manual mode, as
it is more efficient and it reacts faster. Are you sure that
you want to setup this output for manual control? (n) y
hwmon1/pwm1_enable stuck to 2
Manual control mode not supported, skipping hwmon1/pwm1.

wish I knew what the heck keeps locking pwm1_enable to auto @ low speeds :@
Comment 29 MasterCATZ 2019-11-16 12:07:35 UTC
from what I can work out the only difference between the kernel versions 
was they added extra thermal readings to support the newer cards with thermal junction sensors 


{-273150,  99000},
{ 120000, 120000},

has been in their since Jan 2018 ... 

looks like its reading the max temp settings from the bios 
I will confirm this tomorrow I will flash a custom bios 

/torvalds/linux/blob/master/drivers/gpu/drm/amd/powerplay/inc/hwmgr.h


/* The temperature, in 0.01 centigrades, below which we just run at a minimal PWM. */


so maybe it is thinking it can do 1000C ? 


anyhow as I don't want to run an altered bios as that would force fan 100% on boot  , what I decided to do was rip out all of AMD's new thermal code ...
Comment 30 MasterCATZ 2019-11-16 13:36:01 UTC
found them hard coded here for the R9 290 hawaii / Sea Islands chip sets 

so that will be a dirty way to get it to go 100% throttle sooner I'll set mine to 85000 and see how it goes , hopefully the rest follows  


linux/drivers/gpu/drm/amd/amdgpu/ci_dpm.c


if (adev->asic_type == CHIP_HAWAII) {
		pi->thermal_temp_setting.temperature_low = 94500;
		pi->thermal_temp_setting.temperature_high = 95000;
		pi->thermal_temp_setting.temperature_shutdown = 104000;
	} else {
		pi->thermal_temp_setting.temperature_low = 99500;
		pi->thermal_temp_setting.temperature_high = 100000;
		pi->thermal_temp_setting.temperature_shutdown = 104000;
	}
Comment 31 MasterCATZ 2019-11-16 14:15:58 UTC
oh I love it they know the drivers file is crap 

anyhow it looks like the real issue is in the GPU driver 

fan speeds temps everything is in their , ofcause this would not be an issue if 
pwm1_enable  was NOT STUCK  ON AUTO 


#if 0
		/* XXX: need to figure out how to handle this properly */
		tmp = RREG32_SMC(ixCG_THERMAL_CTRL);
		tmp &= DPM_EVENT_SRC_MASK;
		tmp |= DPM_EVENT_SRC(dpm_event_src);
		WREG32_SMC(ixCG_THERMAL_CTRL, tmp);
#endif
Comment 32 MasterCATZ 2019-11-16 22:35:07 UTC
apparently I was looking through kernel 4.7 code on my pc and not master
linux/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
looks like the new file name 

as they relocated ci_dpm.c to 
/home/aio/Programs/linux/drivers/gpu/drm/radeon/ci_dpm.c
Comment 33 MasterCATZ 2019-11-17 00:43:11 UTC
another plan 

drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c 

hwmgr->dpm_level = AMD_DPM_FORCED_LEVEL_AUTO;
	hwmgr_init_default_caps(hwmgr);
	hwmgr_init_default_caps(hwmgr);
	hwmgr_set_user_specify_caps(hwmgr);
	hwmgr_set_user_specify_caps(hwmgr);
	hwmgr->fan_ctrl_is_in_default_mode = true;

change to false to disable auto .. not like its going to be any worse for us 

then GPU's thermal system will run and you can actually manually run the fans 
but unsure if this will stop auto core speed power save features as well
Comment 34 MasterCATZ 2019-11-17 07:46:02 UTC
success 

drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c 

hwmgr->fan_ctrl_is_in_default_mode = false;

it will now boot up in manual mode


finally I have fan control "AMD_DPM_FORCED_LEVEL_AUTO"  
I am wondering just how "FORCED" that "AUTO" is meant to be ....

how ever once you put it back to "2" "AUTO" it takes control again ... and will overwrite your "0" card control and "1" manual  

echo 2 >  /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable 
don't do it unless you want to reboot with a hot GPU :P 



also the crit temp for "Sea Island" cards like my R9 290 is defiantly being retrieved from 
drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c 

thermal_temp_setting.temperature_shutdown


if (hwmgr->chip_id  == CHIP_HAWAII) {
		data->thermal_temp_setting.temperature_low = 74500;
		data->thermal_temp_setting.temperature_high = 80000;
		data->thermal_temp_setting.temperature_shutdown = 98000;




and the fans still spin slow regardless how low I set it .. sooo .. somethings broken ... so looks like I will be doing a custom kernel on every update for a while now to disable AUTO fan control 

and for some reason AMD devs feel 120 deg is NORMAL for a GPU and users want quite fans ... I give up ...
Comment 35 michael 2019-12-05 17:00:31 UTC
I discovered a workaround that works for my R9-290 and Debian 5.3.0 kernel:

  root@joyola:~# echo "2" >>/sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable 
  root@joyola:~# echo "0" >>/sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable 

pwm1_enable will still be 2 afterwards, but (after spinning the fans at max for a bit) automatic fan control works for me. I also have to do the same pwm1_enable prodding after resuming from suspend.

(If it matters, I boot with radeon.cik_support=0 amdgpu.cik_support=1 radeon.si_support=0 amdgpu.si_support=1 amdgpu.dc_log=1 amdgpu.gpu_recovery=1)

I still have the same brokenness as reported in comment 14 though.
Comment 36 MasterCATZ 2019-12-06 02:58:05 UTC
after having good fan control for a few weeks
 5.4.2-050402-generic is now having a melt down back to trying to run the cards @
 (crit = +104000.0°C, hyst = -273.1°C)

and this is whats got me stumped , it seems to go auto when it hits high temp ~ 70 then starts dropping the fan speed I can exit a game set a high fan speed it will sit their @ 60 deg for a good 20 mins with ~ 60% , decide to go back into game .. hits 70 .. fan speeds keep dropping until its 20% and blipping 100% @ 95 deg

I am very close to going back to liquid cooling ... or connecting the fan to a manual speed controller ( if someone knows of a way I can still have the fan connected dor driver control and monitoring with a manual device override for PWM I am all ears , would it be safe for me to just use a thermostat to just send voltage to the fan ? ie) 2x input power sources 



my guess base or asic  is what its reading now about to hack away at those modules and try again 

/home/aio/Programs/linux/drivers/gpu/drm/i915/oa/i915_oa_tgl.c
/home/aio/Programs/linux/drivers/gpu/drm/amd/include/asic_reg/vce/vce_4_0_default.h
/home/aio/Programs/linux/drivers/gpu/drm/nouveau/nvkm/engine/ce/gf100.c
/home/aio/Programs/linux/drivers/gpu/drm/nouveau/nvkm/engine/ce/gt215.c
/home/aio/Programs/linux/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
Comment 37 MasterCATZ 2019-12-06 03:04:09 UTC
well its neither of those modules 
I should have looked at the files after I scanned for files containing 104000


I can not even force run the cards in performance mode anymore with 100% fan speed stuck on 

if i could just find the setting to tell amdgpu / hwmon / powerplay what temp I call hot this would be solved
Comment 38 Robert M. Muncrief 2019-12-06 03:22:25 UTC
(In reply to MasterCATZ from comment #37)
> well its neither of those modules 
> I should have looked at the files after I scanned for files containing 104000
> 
> 
> I can not even force run the cards in performance mode anymore with 100% fan
> speed stuck on 
> 
> if i could just find the setting to tell amdgpu / hwmon / powerplay what
> temp I call hot this would be solved

Here is a slightly modified version of a fan control script, along with the service to run it, from the Arch Wiki. I don't know what distribution you use but hopefully this will at least get you started. Unfortunately it doesn't seem like the kernel devs are interested in fixing this, so after a long time I just had to use this kludgey solution.

1. Create a file with the following contents named "amdgpu-fancontrol" in "/usr/local/bin" and make it executable.

--------------- Start amdgpu-fancontrol ---------------

#!/bin/bash

HYSTERESIS=6000   # in mK
SLEEP_INTERVAL=1  # in s
DEBUG=true

# set temps (in degrees C * 1000) and corresponding pwm values in ascending order and with the same amount of values
TEMPS=( 40000  50000  65000 75000 80000 90000 )
PWMS=(      0  100     140   190   200   255 )

# hwmon paths, hardcoded for one amdgpu card, adjust as needed
HWMON=$(ls /sys/class/drm/card0/device/hwmon)
FILE_PWM=$(echo /sys/class/drm/card0/device/hwmon/$HWMON/pwm1)
FILE_FANMODE=$(echo /sys/class/drm/card0/device/hwmon/$HWMON/pwm1_enable)
FILE_TEMP=$(echo /sys/class/drm/card0/device/hwmon/$HWMON/temp1_input)
# might want to use this later
#FILE_TEMP_CRIT=$(echo /sys/class/hwmon/hwmon?/temp1_crit_hyst)
[[ -f "$FILE_PWM" && -f "$FILE_FANMODE" && -f "$FILE_TEMP" ]] || { echo "invalid hwmon files" ; exit 1; }

# load configuration file if present
[ -f /etc/amdgpu-fancontrol.cfg ] && . /etc/amdgpu-fancontrol.cfg

# check if amount of temps and pwm values match
if [ "${#TEMPS[@]}" -ne "${#PWMS[@]}" ]
then
  echo "Amount of temperature and pwm values does not match"
  exit 1
fi

# checking for privileges
if [ $UID -ne 0 ]
then
  echo "Writing to sysfs requires privileges, relaunch as root"
  exit 1
fi

function debug {
  if $DEBUG; then
    echo $1
  fi
}

# set fan mode to max(0), manual(1) or auto(2)
function set_fanmode {
  echo "setting fan mode to $1"
  echo "$1" > "$FILE_FANMODE"
}

function set_pwm {
  NEW_PWM=$1
  OLD_PWM=$(cat $FILE_PWM)

  echo "current pwm: $OLD_PWM, requested to set pwm to $NEW_PWM"
  debug "current pwm: $OLD_PWM, requested to set pwm to $NEW_PWM"
  if [ $(cat ${FILE_FANMODE}) -ne 1 ]
  then
    echo "Fanmode not set to manual."
    set_fanmode 1
  fi

  if [ "$NEW_PWM" -gt "$OLD_PWM" ] || [ -z "$TEMP_AT_LAST_PWM_CHANGE" ] || [ $(($(cat $FILE_TEMP) + HYSTERESIS)) -le "$TEMP_AT_LAST_PWM_CHANGE" ]; then
    $DEBUG || echo "current temp: $TEMP"
    echo "temp at last change was $TEMP_AT_LAST_PWM_CHANGE"
    echo "changing pwm to $NEW_PWM"
    echo "$NEW_PWM" > "$FILE_PWM"
    TEMP_AT_LAST_PWM_CHANGE=$(cat $FILE_TEMP)
  else
    debug "not changing pwm, we just did at $TEMP_AT_LAST_PWM_CHANGE, next change when below $((TEMP_AT_LAST_PWM_CHANGE - HYSTERESIS))"
  fi
}

function interpolate_pwm {
  i=0
  TEMP=$(cat $FILE_TEMP)

  debug "current temp: $TEMP"

  if [[ $TEMP -le ${TEMPS[0]} ]]; then
    # below first point in list, set to min speed
    set_pwm "${PWMS[i]}"
    return
  fi

  for i in "${!TEMPS[@]}"; do
    if [[ $i -eq $((${#TEMPS[@]}-1)) ]]; then
      # hit last point in list, set to max speed
      set_pwm "${PWMS[i]}"
      return
    elif [[ $TEMP -gt ${TEMPS[$i]} ]]; then
      continue
    fi

    # interpolate linearly
    LOWERTEMP=${TEMPS[i-1]}
    HIGHERTEMP=${TEMPS[i]}
    LOWERPWM=${PWMS[i-1]}
    HIGHERPWM=${PWMS[i]}
    PWM=$(echo "( ( $TEMP - $LOWERTEMP ) * ( $HIGHERPWM - $LOWERPWM ) / ( $HIGHERTEMP - $LOWERTEMP ) ) + $LOWERPWM" | bc)
    debug "interpolated pwm value for temperature $TEMP is: $PWM"
    set_pwm "$PWM"
    return
  done
}

function reset_on_fail {
  echo "exiting, resetting fan to auto control..."
  set_fanmode 2
  exit 1
}

# always try to reset fans on exit
trap "reset_on_fail" SIGINT SIGTERM

function run_daemon {
  while :; do
    interpolate_pwm
    debug
    sleep $SLEEP_INTERVAL
  done
}

# set fan control to manual
set_fanmode 1

# finally start the loop
run_daemon

--------------- End amdgpu-fancontrol ---------------


2. Create a file with the following contents named "amdgpu-fancontrol.service" in /etc/systemd/system.

--------------- Start amdgpu-fancontrol.service ---------------

[Unit]
Description=amdgpu-fancontrol

[Service]
Type=simple
ExecStart=/usr/local/bin/amdgpu-fancontrol

[Install]
WantedBy=multi-user.target

--------------- End amdgpu-fancontrol.service ---------------

3. Here's how to enable, disable, and get the status of the fan control service:

sudo systemctl enable amdgpu-fancontrol
sudo systemctl start amdgpu-fancontrol
sudo systemctl status amdgpu-fancontrol
Comment 39 MasterCATZ 2019-12-06 04:30:52 UTC
will not work , /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable 
is locked to Auto 




[28455.094113] manual fan speed control should be enabled first
[28473.077182] manual fan speed control should be enabled first
[28480.086754] manual fan speed control should be enabled first
[28498.073701] manual fan speed control should be enabled first
[28499.095753] manual fan speed control should be enabled first
[28512.086404] manual fan speed control should be enabled first
[28525.077255] manual fan speed control should be enabled first
[28529.080955] manual fan speed control should be enabled first
[28530.070058] manual fan speed control should be enabled first
[28839.107591] manual fan speed control should be enabled first
[28840.099633] manual fan speed control should be enabled first
[28842.083214] manual fan speed control should be enabled first
[28890.089742] manual fan speed control should be enabled first
[28896.099884] manual fan speed control should be enabled first
[28902.081972] manual fan speed control should be enabled first
[28909.093220] manual fan speed control should be enabled first
[28927.107978] manual fan speed control should be enabled first
[28950.085450] manual fan speed control should be enabled first
[28979.116690] manual fan speed control should be enabled first
[28982.086568] manual fan speed control should be enabled first
[29004.103327] manual fan speed control should be enabled first
[29040.104962] manual fan speed control should be enabled first
[29066.095979] manual fan speed control should be enabled first
[29077.113080] manual fan speed control should be enabled first
[29086.091060] manual fan speed control should be enabled first
[29096.113497] manual fan speed control should be enabled first
[29111.123447] manual fan speed control should be enabled first
[29123.117578] manual fan speed control should be enabled first
[29126.092675] manual fan speed control should be enabled first
[29148.109806] manual fan speed control should be enabled first
[29155.119475] manual fan speed control should be enabled first
[29168.111159] manual fan speed control should be enabled first
[29170.094539] manual fan speed control should be enabled first
[29187.119961] manual fan speed control should be enabled first
[29196.113113] manual fan speed control should be enabled first
[29199.119590] manual fan speed control should be enabled first
[29211.126157] manual fan speed control should be enabled first
[29214.098257] manual fan speed control should be enabled first
[29217.107755] manual fan speed control should be enabled first
[29229.115177] manual fan speed control should be enabled first
[29242.097319] manual fan speed control should be enabled first
[29325.114063] manual fan speed control should be enabled first
[29333.108686] manual fan speed control should be enabled first
[29449.116469] manual fan speed control should be enabled first
[29455.132518] manual fan speed control should be enabled first
[29471.129284] manual fan speed control should be enabled first
[29480.121633] manual fan speed control should be enabled first
[29640.125839] manual fan speed control should be enabled first
[29981.128248] manual fan speed control should be enabled first
[30199.151363] manual fan speed control should be enabled first
[30204.143080] manual fan speed control should be enabled first
[30211.154484] manual fan speed control should be enabled first
[30226.128368] manual fan speed control should be enabled first
[30228.145612] manual fan speed control should be enabled first
[30236.144778] manual fan speed control should be enabled first
[30243.149198] manual fan speed control should be enabled first
[30245.134568] manual fan speed control should be enabled first
[30248.140668] manual fan speed control should be enabled first
[30362.126900] manual fan speed control should be enabled first
[30909.144940] manual fan speed control should be enabled first
[30910.137533] manual fan speed control should be enabled first
[30920.163730] manual fan speed control should be enabled first
[30931.161975] manual fan speed control should be enabled first
[30932.158340] manual fan speed control should be enabled first
[30933.147783] manual fan speed control should be enabled first
[30944.159956] manual fan speed control should be enabled first
[30958.138767] manual fan speed control should be enabled first
[30977.151665] manual fan speed control should be enabled first
[30996.157518] manual fan speed control should be enabled first
[31025.147100] manual fan speed control should be enabled first
[31029.149391] manual fan speed control should be enabled first
[31030.148760] manual fan speed control should be enabled first


and the echo 0 >  /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable 
to 100% only works in low power state as soon as core speeds go up fan speeds drop ...
Comment 40 MasterCATZ 2019-12-06 04:32:46 UTC
aio@aio:~$ sudo pwmconfig
[sudo] password for aio: 
# pwmconfig revision $Revision$ ($Date$)
This program will search your sensors for pulse width modulation (pwm)
controls, and test each one to see if it controls a fan on
your motherboard. Note that many motherboards do not have pwm
circuitry installed, even if your sensor chip supports pwm.

We will attempt to briefly stop each fan using the pwm controls.
The program will attempt to restore each fan to full speed
after testing. However, it is ** very important ** that you
physically verify that the fans have been to full speed
after the program has completed.

Found the following devices:
   hwmon0 is acpitz
   hwmon1 is amdgpu
   hwmon2 is coretemp
   hwmon3 is it8620
   hwmon4 is it8792

Found the following PWM controls:
   hwmon1/pwm1           current value: 104
hwmon1/pwm1 is currently setup for automatic speed control.
In general, automatic mode is preferred over manual mode, as
it is more efficient and it reacts faster. Are you sure that
you want to setup this output for manual control? (n) y
hwmon1/pwm1_enable stuck to 2
Manual control mode not supported, skipping hwmon1/pwm1.
Comment 41 MasterCATZ 2019-12-06 04:45:45 UTC
aio@aio:/usr/local/bin$ sudo systemctl status amdgpu-fancontrol
● amdgpu-fancontrol.service - amdgpu-fancontrol
   Loaded: loaded (/etc/systemd/system/amdgpu-fancontrol.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-12-06 14:45:07 AEST; 3s ago
 Main PID: 23922 (amdgpu-fancontr)
    Tasks: 2 (limit: 4915)
   Memory: 3.3M
   CGroup: /system.slice/amdgpu-fancontrol.service
           ├─23922 /bin/bash /usr/local/bin/amdgpu-fancontrol
           └─23979 sleep 1

Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: changing pwm to 175
Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current temp: 62000
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: interpolated pwm value for temperature 62000 is: 175
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current pwm: 104, requested to set pwm to 175
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: Fanmode not set to manual.
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: setting fan mode to 1
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: temp at last change was 62000
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: changing pwm to 175
Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument
Comment 42 Robert M. Muncrief 2019-12-06 05:01:57 UTC
(In reply to MasterCATZ from comment #41)
> aio@aio:/usr/local/bin$ sudo systemctl status amdgpu-fancontrol
> ● amdgpu-fancontrol.service - amdgpu-fancontrol
>    Loaded: loaded (/etc/systemd/system/amdgpu-fancontrol.service; enabled;
> vendor preset: enabled)
>    Active: active (running) since Fri 2019-12-06 14:45:07 AEST; 3s ago
>  Main PID: 23922 (amdgpu-fancontr)
>     Tasks: 2 (limit: 4915)
>    Memory: 3.3M
>    CGroup: /system.slice/amdgpu-fancontrol.service
>            ├─23922 /bin/bash /usr/local/bin/amdgpu-fancontrol
>            └─23979 sleep 1
> 
> Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: changing pwm to 175
> Dec 06 14:45:08 aio amdgpu-fancontrol[23922]:
> /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid
> argument
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current temp: 62000
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: interpolated pwm value for
> temperature 62000 is: 175
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current pwm: 104, requested to
> set pwm to 175
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: Fanmode not set to manual.
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: setting fan mode to 1
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: temp at last change was 62000
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: changing pwm to 175
> Dec 06 14:45:09 aio amdgpu-fancontrol[23922]:
> /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid
> argument

I was about to call it a day when I got your email notifications. The line it's talking about is:

echo "$NEW_PWM" > "$FILE_PWM"

So it looks like the "$FILE_PWM" variable is not valid. Remember, you have to change the variables under the comment "hwmon paths, hardcoded for one amdgpu card, adjust as needed" to whatever your system requires. To debug the variables I would execute the 4 lines that set HWMON, FILE_PWM, FILE_FANMODE, and FILE_TEMP from terminal and see where things are going wrong. I have to go now but I'll try to help you more tomorrow if you're still having problems. But once you have those variables set correctly the script should work. Here's what the service status output looks like on my system:

   Loaded: loaded (/etc/systemd/system/amdgpu-fancontrol.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-12-05 18:16:27 PST; 2h 28min ago
 Main PID: 836 (amdgpu-fancontr)
    Tasks: 2 (limit: 4915)
   Memory: 7.7M
   CGroup: /system.slice/amdgpu-fancontrol.service
           ├─  836 /bin/bash /usr/local/bin/amdgpu-fancontrol
           └─14235 sleep 1

Dec 05 20:44:46 Entropod amdgpu-fancontrol[836]: changing pwm to 80
Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: current temp: 49000
Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: interpolated pwm value for temperature 49000 is: 90
Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: current pwm: 76, requested to set pwm to 90
Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: temp at last change was 48000
Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: changing pwm to 90
Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: current temp: 48000
Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: interpolated pwm value for temperature 48000 is: 80
Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: current pwm: 86, requested to set pwm to 80
Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: not changing pwm, we just did at 49000, next change when below 43000
Comment 43 MasterCATZ 2019-12-06 05:52:49 UTC
the file is correct .. and you can tell that because its reading the temp "current pwm: 76"

error is because NOTHING is being allowed to edit pwm1_enable it is stuck on auto so nothing can manually change pwm1



but if their is an error in my adjustments let me know 


# hwmon paths, hardcoded for one amdgpu card, adjust as needed
HWMON=$(ls /sys/class/drm/card1/device/hwmon/hwmon1)
FILE_PWM=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1)
FILE_FANMODE=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable)
FILE_TEMP=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input)
Comment 44 MasterCATZ 2019-12-06 05:56:50 UTC
aio@aio:~$ ls /sys/class/drm/card1/device/hwmon/hwmon1
device       freq1_input  name            pwm1         temp1_crit_hyst
fan1_enable  freq1_label  power           pwm1_enable  temp1_input
fan1_input   freq2_input  power1_average  pwm1_max     temp1_label
fan1_max     freq2_label  power1_cap      pwm1_min     uevent
fan1_min     in0_input    power1_cap_max  subsystem
fan1_target  in0_label    power1_cap_min  temp1_crit
aio@aio:~$ cat  /sys/class/drm/card1/device/hwmon/hwmon1/pwm1
68
aio@aio:~$ cat  /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable
2
aio@aio:~$ cat  /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input
54000
aio@aio:~$
Comment 45 Robert M. Muncrief 2019-12-07 19:48:10 UTC
(In reply to MasterCATZ from comment #43)
> the file is correct .. and you can tell that because its reading the temp
> "current pwm: 76"
> 
> error is because NOTHING is being allowed to edit pwm1_enable it is stuck on
> auto so nothing can manually change pwm1
> 
> 
> 
> but if their is an error in my adjustments let me know 
> 
> 
> # hwmon paths, hardcoded for one amdgpu card, adjust as needed
> HWMON=$(ls /sys/class/drm/card1/device/hwmon/hwmon1)
> FILE_PWM=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1)
> FILE_FANMODE=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable)
> FILE_TEMP=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input)

Your variables are set wrong. If your GPU is card1 they should be:

HWMON=$(ls /sys/class/drm/card1/device/hwmon)
FILE_PWM=$(echo /sys/class/drm/card1/device/hwmon/$HWMON/pwm1)
FILE_FANMODE=$(echo /sys/class/drm/card1/device/hwmon/$HWMON/pwm1_enable)
FILE_TEMP=$(echo /sys/class/drm/card1/device/hwmon/$HWMON/temp1_input)


The "HWMON" variable is there to determine which actual hardware monitor is being used because it can change whenever you boot. One time it could be hwmon1, the next time hwmon3, etc. So you can't hard-code it as you're doing. You have to use the $HWMON variable to set FILE_PWM, FILE_FANMODE, and FILE_TEMP.
Comment 46 Jan Ziak 2019-12-07 20:05:22 UTC
There is also the possibility to use question marks in the path:

/sys/class/drm/card?/device/hwmon/hwmon?
Comment 47 Robert M. Muncrief 2019-12-07 20:20:08 UTC
(In reply to Jan Ziak (http://atom-symbol.net) from comment #46)
> There is also the possibility to use question marks in the path:
> 
> /sys/class/drm/card?/device/hwmon/hwmon?

Thank you for mentioning that. If you only have one GPU that will indeed work. I have multiple GPUs, one Nvidia and one AMD, so I have to hard-code the card.
Comment 48 Jan Ziak 2019-12-07 20:46:34 UTC
(In reply to muncrief from comment #47)
> (In reply to Jan Ziak (http://atom-symbol.net) from comment #46)
> > There is also the possibility to use question marks in the path:
> > 
> > /sys/class/drm/card?/device/hwmon/hwmon?
> 
> Thank you for mentioning that. If you only have one GPU that will indeed
> work. I have multiple GPUs, one Nvidia and one AMD, so I have to hard-code
> the card.

Maybe you can use the PCI ID of the device:

FOUND=false
for CARD in /sys/class/drm/card?; do
  DEVICE="$(cat "$CARD/device/device")"
  if [[ "${DEVICE,,}" == 0x67b1 ]]; then
    FOUND=true
    break
  fi
done
$FOUND || exit 1
HWMON=$CARD/device/hwmon/hwmon?
echo $HWMON
Comment 49 Robert M. Muncrief 2019-12-07 21:04:07 UTC
(In reply to Jan Ziak (http://atom-symbol.net) from comment #48)
> (In reply to muncrief from comment #47)
> > (In reply to Jan Ziak (http://atom-symbol.net) from comment #46)
> > > There is also the possibility to use question marks in the path:
> > > 
> > > /sys/class/drm/card?/device/hwmon/hwmon?
> > 
> > Thank you for mentioning that. If you only have one GPU that will indeed
> > work. I have multiple GPUs, one Nvidia and one AMD, so I have to hard-code
> > the card.
> 
> Maybe you can use the PCI ID of the device:
> 
> FOUND=false
> for CARD in /sys/class/drm/card?; do
>   DEVICE="$(cat "$CARD/device/device")"
>   if [[ "${DEVICE,,}" == 0x67b1 ]]; then
>     FOUND=true
>     break
>   fi
> done
> $FOUND || exit 1
> HWMON=$CARD/device/hwmon/hwmon?
> echo $HWMON

Well, my system works great the way it is and I don't really have time to do any further debugging or redesign. I'm just trying to help MasterCATZ get things going. However that's another great way to determine where a specific card is, thank you for the multiple great suggestions!

It's great to see so many people trying to help, we need more of that in Linux, especially with Arch and its derivative distros. It's very irritating and frustrating when I see experienced users simply tell others to "read the wiki", or expect them to use Linux for two years before they can have a usable installation.

In fact that kind of old, outdated, and downright mean attitude is one of the reasons Linux still has such a low share of the desktop market. So whenever I see someone who needs help I try to make it as easy as I can for them, and have even been insulted numerous times by the cruel people who are angered that I don't just tell others to get a PhD or something :)
Comment 50 MasterCATZ 2019-12-09 00:33:12 UTC
Thanks for correction, I was unsure if $HWMON knew to go to hwmon1

works until GPU hits 70 deg then something forces "pwm1_enable" to auto and starts ramping the fan speed down until its 20% @ 90+ deg and bliping 100% @ 95 deg 



for now all I can do is run custom bios with 800 memory speed and 850 core and keep toggling between standard and performance mode on to reset fan speed to 100% and redo that every time its drops back below 40% and set /sys/class/drm/card1/device/hwmon/hwmon1/power1_cap to under 140w so the GPU does not cook 

so unless its "Radeon Profile" doing something to get locked out I have no idea 

its fan profile should be 
over 70 deg 1:1 ratio 
under 60 deg 50%
under 50 deg 10%
under 40 deg 5%
under 20 deg 0%

any way to find out what is accessing pwm1_enable ?
Comment 51 MasterCATZ 2019-12-09 00:55:47 UTC
current temp: 61000
interpolated pwm value for temperature 61000 is: 170
current pwm: 165, requested to set pwm to 170
current pwm: 165, requested to set pwm to 170
temp at last change was 61000
changing pwm to 170

current temp: 71000
current pwm: 255, requested to set pwm to 255
current pwm: 255, requested to set pwm to 255
not changing pwm, we just did at 71000, next change when below 66000


current temp: 73000
current pwm: 68, requested to set pwm to 255
current pwm: 68, requested to set pwm to 255
Fanmode not set to manual.
setting fan mode to 1
temp at last change was 73000
changing pwm to 255
/usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument




current temp: 87000
current pwm: 124, requested to set pwm to 255
current pwm: 124, requested to set pwm to 255
Fanmode not set to manual.
setting fan mode to 1
temp at last change was 87000
changing pwm to 255
/usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument
Comment 52 Robert M. Muncrief 2019-12-09 21:26:54 UTC
(In reply to MasterCATZ from comment #51)
> current temp: 61000
> interpolated pwm value for temperature 61000 is: 170
> current pwm: 165, requested to set pwm to 170
> current pwm: 165, requested to set pwm to 170
> temp at last change was 61000
> changing pwm to 170
> 
> current temp: 71000
> current pwm: 255, requested to set pwm to 255
> current pwm: 255, requested to set pwm to 255
> not changing pwm, we just did at 71000, next change when below 66000
> 
> 
> current temp: 73000
> current pwm: 68, requested to set pwm to 255
> current pwm: 68, requested to set pwm to 255
> Fanmode not set to manual.
> setting fan mode to 1
> temp at last change was 73000
> changing pwm to 255
> /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid
> argument
> 
> 
> 
> 
> current temp: 87000
> current pwm: 124, requested to set pwm to 255
> current pwm: 124, requested to set pwm to 255
> Fanmode not set to manual.
> setting fan mode to 1
> temp at last change was 87000
> changing pwm to 255
> /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid
> argument

Well, that's certainly quite bizarre. I wish I could think of something else but I'm stumped. I've never experienced that problem on my system, and I don't know why yours isn't allowing the write. Is it possible there was some error in copying the script? It seems unlikely but that's all I can come up with at this point. If you have somewhere I can send my actual script and service files I'd be happy to send them to you. Otherwise I'm just out of ideas.
Comment 53 MasterCATZ 2019-12-10 01:03:57 UTC
its been like this since mid  kernel 4's, just wish I knew whats locking that file root has no permissions and it seems to activate @ 70 deg , which even if i run the fan 100% will be reached unless I under clock 

amdgpupro just turns PC into a paperweight so I don't use that 
radeon drivers suck for any gaming  
amdgpu / mesa 
are what I need to use and its been like this since powerplay was introduced 

Ubuntu 18.04, and just upgraded it to 19.10 same issues 
currently using 5.4.2-050402-generic

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xfffd7fff amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1 amdgpu.cik_support=1 radeon.cik_support=0"

featuremasks seem to make no difference 
maybe I should re - add 
radeon.si_support=0  amdgpu.si_support=1

as  in as radeon profile is showing radeonsi is in use ?, 
but I thought R9 290 were Sea Islands = amdgpu.cik_support=1 ?
Comment 54 michael 2020-05-21 15:31:24 UTC
It seems that since kernel 5.6 (or at least Debian's version thereof), I no longer need to fiddle with /sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable. The default value (1) seems to do the right thing now. Progress!

Mind you, lmsensors is still unable to report fan speed, and gives nonsensical values for crit/hyst temperatures. I have a feeling that further improvements to power management may be possible too.

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       +1.00 V  
fan1:             N/A  (min =    0 RPM, max =    0 RPM)
edge:         +73.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:       58.21 W  (cap = 208.00 W)
Comment 55 vovad 2020-06-03 08:26:09 UTC
Tested on Ubuntu with kernel 5.6.15 - looks better now:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      1000.00 mV 
fan1:        1200 RPM  (min =    0 RPM, max = 6000 RPM)
edge:         +72.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:      117.00 W  (cap = 216.00 W)
Comment 56 Sean Birkholz 2020-08-10 00:26:44 UTC
I was going to try and get a fancontrol script working, but I found the following as I started to play around:

I got my results on Arch Linux's 5.7.12-arch1-1 kernel.  

On boot pwm1_enable is set to 1 (manual mode).  On 4.18.x it is normally set to 2 (automatic mode) iirc.  Changing this value to 2 does essentially nothing for me and the fans do not spin up with increasing temp.  

However, I've found that running pwmconfig and not even answering the first question (ie; I can just ctrl+c out) causes the automatic temp control to start functioning.  So after you run pwmconfig, then change pwm1_enable to 2 and everything works again.  So far it appears doing these two things gets me the same functionality I had on 4.18.x and I can finally upgrade my kernel.  No fan control script needed.

It is interesting to note, if you do these in the opposite order; set pwm1_enable to 2 and then run pwmconfig, you must say yes to enabling manual mode on the gpu's fan before they start functioning properly.  This also causes the fan to run full speed (like pwm1_enable is set to 0) and you will need to set 2 in pwm1_enable again.

I dont know what pwmconfig is modifying to cause pwm to work again... I wish i knew so I could set that up with a script and not have to manually start it, but this is good enough for now as I reboot rarely.  Maybe I can make a script to use with systemctl when I'm not lazy.
Comment 57 fawz 2020-12-01 21:54:54 UTC
Hi all!

I'm running a Radeon R9 290 with amdgpu.

I've had the same issue of pwm1_enable being set to MANUAL on boot, and then being stuck to AUTO after switching to AUTO. I've had a quick browse of the code and have a fix that seems to work for me.

See the attached patch for my fix/work-around.

Thoughts and explanations follow.

Some comments and questions on the code. My card seems to use the smu7_* code for handling fan and power related functionality. I'm not sure if this is correct, but it seems that MANUAL is simply the default state for the card at boot, and the software (maybe on purpose? it's unclear) mirrors because there's a variable called fan_ctrl_enabled which is never explicitly initialized, and thus is default-initialized to false, which equates to MANUAL in the get_pwm1_enable() logic, which again means you may set the fan speed manually.

For those who want to take a look themselves, this is roughly what happens when you write 2 (auto) to pwm1_enable:

> amdgpu_pm.c: amdgpu_hwmon_set_pwm1()
> smu7_hwmgr.c: smu7_set_fan_control_mode()
> smu7_thermal.c: smu7_fan_ctrl_set_static_mode()
> smu7_thermal.c: smu7_fan_ctrl_start_smc_fan_control()
> 
> // Send PPSMC_StartFanControl with parameter FAN_CONTROL_TABLE
> smumgr.c: smum_send_msg_to_smc_with_parameter 
> smu7_thermal.c: hwmgr->fan_ctrl_enabled = true;

Note that fan_ctrl_enabled is now true. When reading pwm1_enable, this is the value that's checked.

Now, this happens when we try to write 1 (manualy) to pwm1_enable again:

> amdgpu_pm.c: amdgpu_hwmon_set_pwm1_enable()
> smu7_hwmgr.c: smu7_set_fan_control_mode()
> smu7_hwmgr.c: smu7_fan_ctrl_stop_smc_fan_control
> 
> // Now, a so-called phm platform cap is checked
> // See hardwaremanager.h for its definition
> // Its description is simply "Fan is controlled by the SMC microcode."
> if (phm_cap_enabled(hwmgr->platform_descriptor.platformCaps,
>                       PHM_PlatformCaps_MicrocodeFanControl))
>               smu7_fan_ctrl_stop_smc_fan_control(hwmgr);

If the above check were to succeed, it would continue to send a smum message of PPSMC_StopFanControl and set fan_ctrl_enabled = false, and we would be back in MANUAL land. However, the PHM_PlatformCaps_MicrocodeFanControl cap is never set. AFAICT, this cap is only ever set for vega12 and vega20 cards, in vega20_processpptables.c and vega12_processpptables.c. It's checked in a bunch of places for smu7, but never in a way that explicitly prevents manual fan control once manual fan control is enabled, such as after boot.

Simply commenting out the check above fixed the problem for me, and I have seen no strange side-effects yet. This makes sense to me; after boot, setting fan speed manually works and the code responsible doesn't require the MicrocodeFanControl cap to be set for that. However, I don't know what the purpose of that cap is, whether the only reason for it being present in smu7 and elsewhere is a situation of copy-pasting skeleton code, or what.

From looking at vega10_hwmgr.c, it looks like vega10 (AMDGPU_FAMILY_AI, arctic islands?) cards should have the same problem and I assume the same fix should work, so I included it in the patch. It would be great if someone with an arctic islands card (RX 400 series?) could test and confirm this.

Comments and feedback are very welcome.
Comment 58 fawz 2020-12-01 21:57:13 UTC
Created attachment 293895 [details]
patch to fix pwm1_enable being stuck to AUTO for some gpu smu7 and vega10

Seems to work fine for smu7 (AMD Hawaii PRO Radeon R9 290), needs testing for vega10 (arctic islands).
Comment 59 Alex Deucher 2020-12-01 22:47:13 UTC
Created attachment 293903 [details]
possible fix

The attached patch should fix it.
Comment 60 MasterCATZ 2020-12-01 23:01:48 UTC
seems a bit random for me 5.8.17-050817-generic 

sometimes I can spend weeks with fan control then all of a sudden I find it hitting 100deg because it keeps spinning back down to 20% range 

for me this seems to happen when I am using multiple monitors, does not seem to happen when using a single display however this could be related to memory actually idling 


I will give your code a go and put it into auto mode , its summer here so thermals are reached quickly I have GPU idling 300core 100 memory  @ 47.8% manually this seems to keep it the same temp as the entire system 
just so the dreaded auto is not triggered or else memory goes 800 and fan drops to 20's with the 100% blimps @ 95+ deg
Comment 61 MasterCATZ 2020-12-02 01:33:15 UTC
Now it just runs 100% @ 300mhz core 100mhz memory @ ~60deg 

aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ sensors
k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.38 V  
Vsoc:          1.08 V  
Tctl:         +79.8°C  
Tdie:         +79.8°C  
Tccd1:        +66.2°C  
Tccd2:        +61.0°C  
Icore:        32.00 A  
Isoc:         10.00 A  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +16.8°C  (crit = +20.8°C)
temp2:        +16.8°C  (crit = +20.8°C)

amdgpu-pci-0b00
Adapter: PCI adapter
fan1:        2884 RPM  (min =    0 RPM, max = 6000 RPM)
edge:         +59.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:       12.15 W  (cap = 208.00 W)

aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ pwmconfig
You need to be root to run this script.
aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ sudo pwmconfig
[sudo] password for aio: 
# pwmconfig version 3.6.0
This program will search your sensors for pulse width modulation (pwm)
controls, and test each one to see if it controls a fan on
your motherboard. Note that many motherboards do not have pwm
circuitry installed, even if your sensor chip supports pwm.

We will attempt to briefly stop each fan using the pwm controls.
The program will attempt to restore each fan to full speed
after testing. However, it is ** very important ** that you
physically verify that the fans have been to full speed
after the program has completed.

Found the following devices:
   hwmon0 is acpitz
   hwmon1 is amdgpu
   hwmon2 is k10temp
   hwmon3 is hidpp_battery_2

Found the following PWM controls:
   hwmon1/pwm1           current value: 122

Giving the fans some time to reach full speed...
Found the following fan sensors:
   hwmon1/fan1_input     current speed: 5499 RPM

Warning!!! This program will stop your fans, one at a time,
for approximately 5 seconds each!!!
This may cause your processor temperature to rise!!!
If you do not want to do this hit control-C now!!!
Hit return to continue: 

Testing pwm control hwmon1/pwm1 ...
  hwmon1/fan1_input ... speed was 5499 now 1120
    It appears that fan hwmon1/fan1_input
    is controlled by pwm hwmon1/pwm1
Would you like to generate a detailed correlation (y)? y
Note: If you had gnuplot installed, I could generate a graphical plot.
    PWM 255 FAN 5508
    PWM 240 FAN 5492
    PWM 225 FAN 5245
    PWM 210 FAN 4962
    PWM 195 FAN 4659
    PWM 180 FAN 4328
    PWM 165 FAN 3974
    PWM 150 FAN 3567
    PWM 135 FAN 3140
    PWM 120 FAN 2747
    PWM 105 FAN 2320
    PWM 90 FAN 1892
    PWM 75 FAN 1476
    PWM 60 FAN 1178
    PWM 45 FAN 1092
    PWM 30 FAN 1083
    PWM 28 FAN 1082
    PWM 26 FAN 1081
    PWM 24 FAN 1080
    PWM 22 FAN 1081
    PWM 20 FAN 1080
    PWM 18 FAN 1079
    PWM 16 FAN 1080
    PWM 14 FAN 1079
    PWM 12 FAN 1079
    PWM 10 FAN 1080
    PWM 8 FAN 1080
    PWM 6 FAN 1079
    PWM 4 FAN 1080
    PWM 2 FAN 1079
    PWM 0 FAN 1080


Testing is complete.
Please verify that all fans have returned to their normal speed.

The fancontrol script can automatically respond to temperature changes
of your system by changing fanspeeds.
Do you want to set up its configuration file now (y)? y
What should be the path to your fancontrol config file (/etc/fancontrol)? 

Select fan output to configure, or other action:
1) hwmon1/pwm1	       3) Just quit	      5) Show configuration
2) Change INTERVAL     4) Save and quit
select (1-n): 1

Devices:
hwmon0 is acpitz
hwmon1 is amdgpu
hwmon2 is k10temp
hwmon3 is hidpp_battery_2

Current temperature readings are as follows:
hwmon0/temp1_input	16
hwmon0/temp2_input	16
hwmon1/temp1_input	59
hwmon2/temp1_input	83
hwmon2/temp2_input	83
hwmon2/temp3_input	62
hwmon2/temp4_input	59

Select a temperature sensor as source for hwmon1/pwm1:
1) hwmon0/temp1_input
2) hwmon0/temp2_input
3) hwmon1/temp1_input
4) hwmon2/temp1_input
5) hwmon2/temp2_input
6) hwmon2/temp3_input
7) hwmon2/temp4_input
8) None (Do not affect this PWM output)
select (1-n): 3

Enter the low temperature (degree C)
below which the fan should spin at minimum speed (20): 30

Enter the high temperature (degree C)
over which the fan should spin at maximum speed (60): 70

Enter the PWM value (0-255) to use when the temperature
is over the high temperature limit (255): 250


Select fan output to configure, or other action:
1) hwmon1/pwm1	       3) Just quit	      5) Show configuration
2) Change INTERVAL     4) Save and quit
select (1-n): 4

Saving configuration to /etc/fancontrol...
Configuration saved
aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$
Comment 62 MasterCATZ 2020-12-02 03:30:09 UTC
HEAD is now at 1398820fee51 Linux 5.9.9
aio@aio:/SnapRaidArray/DATA/git/linux-stable$ git apply --stat /SnapRaidArray/DATA/Downloads/
Display all 583 possibilities? (y or n)
aio@aio:/SnapRaidArray/DATA/git/linux-stable$ git apply --stat /SnapRaidArray/DATA/Downloads/dont_check_microcodefancontrol_cap.patch
 .../gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c    |    4 +---
 .../gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c  |    3 +--
 2 files changed, 2 insertions(+), 5 deletions(-)
aio@aio:/SnapRaidArray/DATA/git/linux-stable$ git apply --check /SnapRaidArray/DATA/Downloads/dont_check_microcodefancontrol_cap.patch
error: drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c: No such file or directory
error: drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c: No such file or directory



it seems the path is now 

drivers/gpu/drm/amd/powerplay/hwmgr/

no pm subfolder
Comment 63 Alex Deucher 2020-12-02 03:34:03 UTC
yes, you'll need to adjust the path for pre 5.10 kernels.
Comment 64 fawz 2020-12-02 08:38:46 UTC
Of course, that makes sense! Should've realized that there must be 
correspondig logic for non-vega12/20 hardware.

If this patch works, are you going to submit it or should I? Afterall, 
you found it :)

On 01/12/2020 23.47, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=201539
>
> --- Comment #59 from Alex Deucher (alexdeucher@gmail.com) ---
> Created attachment 293903 [details]
>    --> https://bugzilla.kernel.org/attachment.cgi?id=293903&action=edit
> possible fix
>
> The attached patch should fix it.
>
Comment 65 fawz 2020-12-02 09:57:00 UTC
Unfortunately, your patch leads to a stuck boot. There's some minor 
"corruption" visible on the bottom of the screen while still booting up, 
and then it gets stuck.

I don't think I mentioned this in the previous posts, but I tried 
setting this cap myself, but in the thermal init function instead of in 
the process pp tables one, which had the same effect.

The boot seems to be stuck completely, since I can't ssh into the box 
either. Any suggestions for debugging this crash caused by enabling the 
MicrocodeFanControl cap are appreciated.

On 01/12/2020 23.47, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=201539
>
> --- Comment #59 from Alex Deucher (alexdeucher@gmail.com) ---
> Created attachment 293903 [details]
>    --> https://bugzilla.kernel.org/attachment.cgi?id=293903&action=edit
> possible fix
>
> The attached patch should fix it.
>
Comment 66 Alex Deucher 2020-12-02 16:21:39 UTC
Created attachment 293909 [details]
possible fix

I guess we need to fan control parameters.  How about this patch?
Comment 67 fawz 2020-12-02 19:31:37 UTC
> I guess we need to fan control parameters.  How about this patch?

After some quick testing, your latest patch seems to work great! And new code, ie. something not just taken from the other families?

I'll try to get an RX 480 user to test this, but seems to work wonders for me. A welcome change with this patch is that the default fan control mode at boot is now 2/auto.

Thanks for your help with this! Very much appreciated.
Comment 68 Alex Deucher 2020-12-02 20:16:07 UTC
It's pretty similar to other the code for other smu7 chips (tonga, polaris, etc.).  Note that this change is not relevant to newer smu7 chips (rx480, tonga, etc.).
Comment 69 fawz 2020-12-02 21:11:51 UTC
Well, I'll have a read! And thanks anyways, I'll run this going forward, post if there are issues and am looking forward to seeing this in mainline at some point :)
Comment 70 mirh 2020-12-03 01:04:02 UTC
So.. I was also testing this on my Sapphire R9 290 Tri-X OC. And it seems to work pretty good. 

I noticed an oddity though. The first time I tried it, when I switched to manual fan control, every time I wrote something to pwm1 after one second the thing seemed to reset to the default speed. On subsequent reboots this didn't seem to happen anymore.
Comment 71 mirh 2020-12-13 23:56:04 UTC
Cool! This landed in 5.10. 

By the way, I was wondering, is there any way to override the default minimum 20~26% minimum speed value?
I see that MinimumPWMLimit and zero rpm only landed with later cards, but it seems crazy that not even with a custom bios I can fix this.
Comment 72 MasterCATZ 2020-12-14 23:05:24 UTC
finally this summer the R9 290 GPU's will be manageable 

seems to be working , now I just have to find the old settings I changed when trying to run it at higher rpm , @ 60deg and its doing 80%+ RPM
possibly it is now following my BIOS settings from when I was trying to force higher RPM  when it kept trying to run under 20% 

my manual settings seem to get overwritten a second after setting them 
but at least I am not being locked out like before 


now if someone could solve the issue when it uses more power when running multiple displays ( exactly the same monitors res / hz  )
 I can run the card single display under 10 watts plug in another display and its over 50 watts idle
Comment 73 Alex Deucher 2020-12-14 23:26:04 UTC
(In reply to MasterCATZ from comment #72)
> 
> now if someone could solve the issue when it uses more power when running
> multiple displays ( exactly the same monitors res / hz  )
>  I can run the card single display under 10 watts plug in another display
> and its over 50 watts idle

You can enable mclk switching with identical monitors by setting amdgpu.dcfeaturemask=2
It's enabled by default in 5.11.
Comment 74 MasterCATZ 2020-12-14 23:44:45 UTC
GRUB_CMDLINE_LINUX_DEFAULT="usbcore.autosuspend=-1 amdgpu.dcfeaturemask=2 apparmor=0 amdgpu.ppfeaturemask=0xfffd7fff amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1 amdgpu.cik_support=1 radeon.cik_support=0 radeon.si_support=0 amdgpu.si_support=1"


I am running amdgpu.dcfeaturemask=2 ?
or are the other attempted featuremask's causing issues now ?


Kernel 5.10 is working perfectly 


I turned off fancontrol service and using "marazmista/radeon-profile"

it is following my fan curve perfectly with out being locked out 

it has been years since my R9 was not cooking from 20% fanspeed issue even with the core set @ 300mhz / 100 mhz memory
Comment 75 michael 2021-02-06 05:34:25 UTC
An update. Now on 5.10.0-2-amd64.

Fresh boot, with amdgpu.dc=1, everything is mostly fine. pwm1_enable=2. Except that after resuming from suspend, pwm1_enable=1 and pwm1=255, resulting in maxxed out fans. Subsequently setting pwm1_enable=2 results in old buggy behaviour (2000RPM until 96C). However, if I suspend and resume again, it sometimes goes back to behaving!

amdgpu.dc=0 is a bit of a non-starter, as while fan speeds remain low, so does performance.

In all cases temp1_crit and temp1_crit_hyst still have crazy values (104000000 and -273).
Comment 76 codebugs 2022-02-05 21:59:29 UTC
The later 390 series cards (Grenada Pro) were also affected by an inability to set the correct fan speeds in 5.4.0. Because the cards would not run their fans greater than 20% of their max RPMs under load, it did destroy at least one 390 when it ran without proper cooling at 105 C for an extended period of time.

This patch saved the other R390 card in the other machine. Thank you.

As of 5.11, it still reports temp1_crit and temp1_crit_hyst as 04000000 and -273.15.

Note You need to log in before you can comment on or make changes to this bug.