Bug 201539
Summary: | AMDGPU R9 390 automatic fan speed control in Linux 4.19/4.20/5.0 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jan Ziak (0xe2.0x9a.0x9b) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | alex, alexdeucher, artheg, che666, codebugs, danglingpointerexception, fawz, garretr, kernel.org, lkbugs, mastercatz, mirh, rmuncrief, serg.korobkov, steffen.klee, supasean, tchavei, v.j.dubov |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.19, 4.20, 5.0 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
pwm1-4.18.16
pwm1-4.19.0 patch to fix pwm1_enable being stuck to AUTO for some gpu smu7 and vega10 possible fix possible fix |
Description
Jan Ziak
2018-10-27 12:32:41 UTC
Created attachment 279205 [details]
pwm1-4.18.16
while true; do cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1; sleep 0.1s; done
Created attachment 279207 [details]
pwm1-4.19.0
while true; do cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1; sleep 0.1s; done
4.18.16: $ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable 2 4.19.0: $ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable 1 I have exactly the same problem with my R9 290X. I tested on the 4.19.6 kernel. The problem is still present, not how it is not solved. I have to use the 4.18 kernel.x, so as not to damage my graphics card. I experience the same issue 4.18-20 kernel on Manjaro (amdgpu + opencl taken from amdgpu pro on 3 x R9 390). Everything works fine. Fan rpm increase with load (tested with stak-xmr miner). 4.19-8-2 kernel: Rpm fans always on minimum setting. Once one of the gpus hits 95C, sudden bursts of 100% rpm occur (under 1s duration). Application has to be suspended or hardware damage will occur checked pwm1_enable which displays "1". Changing to "2" has no effect. pwmconfig drives fans to 100% during config but is unable to provide control. pwm1 reads different values (from 86 to 127) but fans don't appear to change speed at all. kernel parameters: amdgpu.cik_support=1 amdgpu.dc=0 Does setting amdgpu.dpm=1 on 4.19 help? The default dpm implementation changed between 4.18 and 4.19. (In reply to Alex Deucher from comment #7) > Does setting amdgpu.dpm=1 on 4.19 help? The default dpm implementation > changed between 4.18 and 4.19. amdgpu.dpm=1 produces black screen (unable to boot) in both 4.18 and 4.19. Using amdgpu.dc=1 produces error and freeze on boot Linux 4.20 behaves the same as Linux 4.19. This bug is likely related. Mine's a R9-290X... https://bugs.freedesktop.org/show_bug.cgi?id=108781 I'm stuck on kernel 4.18.20 I found a solution for my problem and detailed it in the previous link. My R9-290X now fully functions on 4.20.0 Comment 10 and 11 are not related to the reported fan control issue. I am still seeing the same problem on a r290x with amdgpu driver (and the correct firmware available in the initramfs) on 5.1.0-0.rc0.git4.2 I am also able to reproduce the fan control issue on an R9 290X with Fedora's kernel 5.0.3-200.fc29.x86_64. The fans on the card do not spin up until the temperature reaches 95C, at which point they jump straight to 100%. Same as comment 8, amdgpu.dpm=1 just gives me a blank screen at boot. Reproducible on an R9 390 with kernel 5.1.6-arch1-1-ARCH using amdgpu driver. Setting amdgpu.dpm=1 or changing amdgpu.dc does not have any effect. Also, sensors does not report the current fan RPM: amdgpu-pci-1d00 Adapter: PCI adapter vddgfx: +1.00 V fan1: N/A (min = 0 RPM, max = 0 RPM) temp1: +59.0°C (crit = +104000.0°C, hyst = -273.1°C) power1: 47.24 W (cap = 230.00 W) @Rudolf - Have you tried my solution in the link I provided above? I'm on 5.1.6 mainline and have no issues whatsoever with R9-290X @Alex Smith - I've got a liquid-cooled card so don't know if my solution solves your fan problem but try following my steps in the link. @Steffen - Try my steps in the link mate, it may solve your problem. Alex Ducher himself gave me the tip on fixing it. (In reply to danglingpointerexception@gmail.com from comment #15) > @Steffen - Try my steps in the link mate, it may solve your problem. Alex > Ducher himself gave me the tip on fixing it. As far as I understand, you had issues with firmware loading. However, firmware is loading fine on my end: [drm] Found UVD firmware Version: 1.64 Family ID: 9 [drm] Found VCE firmware Version: 50.10 Binary ID: 2 I'm having the very same issues with my R9 290x as well. Arch Linux, 5.1.7-arch1-1-ARCH. (In reply to danglingpointerexception@gmail.com from comment #15) I've tried your solution, but unfortunately it didn't work for me. I am not a kernel developer and haven't done much programming as of late, so I am not really in a position to actually test this hypothesis. However - from the bit of research I've done trying to figure this problem out for myself I believe the following explains the overheating and burst of fan speed instead of proper cooling behavior. Here is my sensors bit from kernel 4.18.x - I have the R9-290. amdgpu-pci-0100 Adapter: PCI adapter vddgfx: N/A fan1: 0 RPM temp1: +65.0°C (crit = +120.0°C, hyst = +90.0°C) Take note that this displays the proper critical and hysteresis values for my card. If you look at the post on comment 14 which is how sensors display the crit/hyst value for kernels beyond 4.18.x you notice the critical value is about 19x the temperature of the surface of the sun and the hyst value is absolute zero. These values are hard coded into kernel source code in some file, forgive me as I do not recall where I saw the code snippet. But I strongly believe that correcting the values in the file or changing it to detect proper crit/hyst values based on card will correct this issue. I simply do not have the means to do this, nor do I know how to submit kernel bug fixes and hope someone with more experience could give it a shot and see if the resulting kernel functions properly. I've done a bit of digging and I've managed to get a proper hysteresis value to appear in a 5.1.14 kernel built from source. I now have this output from sensors: amdgpu-pci-0100 Adapter: PCI adapter vddgfx: +1.00 V fan1: N/A (min = 0 RPM, max = 0 RPM) temp1: +66.0°C (crit = +104000.0°C, hyst = +90.0°C) power1: 29.02 W (cap = 208.00 W) I don't know why proper values are not set automatically because I've found the correct values in tons of source files but none of the #defines appear to be used? And much of the source doesn't appear to differ between 5.1.14 and 4.18.x I modified (kernel src)/drivers/gpu/drm/amd/powerplay/inc/pp_thermal.h and changed the values of -273150 to 90000. This corrects the hysteresis value but I'm still searching for where the critical temp value is actually set. I *think* fixing these values may fix the fan problem because why would a fan spin up if its nowhere near the critical or hysteresis values? No need. Except when the critical value is 19x the temp of the sun, the card gets so hot it protects itself by maxing the fans for a short burst. That is my theory anyway, hope to be able to test it soon but no promises. To anyone who's still struggling with this, perhaps this would be of help: I'm using this script (https://github.com/grmat/amdgpu-fancontrol) as a service, with these params: TEMPS=( 65000 75000 80000 90000 ) PWMS=( 0 190 200 255 ) Perhaps someone could tweak this a little bit better, but this works for me. My gpu still sounds like an airplane when I'm running a benchmark like Unigine-Heaven, but at least fans are spinning now. Having simular issues with 5.3.8-050308-generic when it starts happening this is being spammed in dmesg amdgpu: [powerplay] failed to send message 282 ret is 254 I loose write acess to /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable it is stuck on 2 , and runs the fans super low @ 20% causing the GPU to reach thermalmelt down 96 deg when the fan will do blips of 100% my bios was modded to even have a minimum fan speed of 50% and even this is being over written /sys/class/drm/card1/device/hwmon/hwmon1/pwm1 also can not adjust GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1 amdgpu.gpu_recovery=1 amdgpu.cik_support=1 amdgpu.dpm=1 radeon.cik_support=0" if I reboot it works for a little while allowing me to change GPU speeds and fan speeds then .. I loose fan speed control and can not get it back off auto , which seems to be setup with fans speeds way too low GL_RENDERER: AMD Radeon R9 200 Series (HAWAII, DRM 3.33.0, 5.3.8-050308-generic, LLVM 9.0.0) GL_VERSION: 4.5 (Compatibility Profile) Mesa 19.3.0-devel (git-ff6e148 2019-10-29 bionic-oibaf-ppa) if I disable amdgpu.dpm I can control the fans but then I can not do Auto GPU speeds and can not manually do my speeds my only guess is the firmware being loaded by kernel is the place containing the info for fan speeds ? (In reply to Sean Birkholz from comment #19) > I've done a bit of digging and I've managed to get a proper hysteresis value > to appear in a 5.1.14 kernel built from source. > I modified (kernel src)/drivers/gpu/drm/amd/powerplay/inc/pp_thermal.h and > changed the values of -273150 to 90000. This corrects the hysteresis value > but I'm still searching for where the critical temp value is actually set. I think you hit the nail on the head amdgpu-pci-0100 Adapter: PCI adapter vddgfx: +0.90 V fan1: N/A (min = 0 RPM, max = 0 RPM) edge: +50.0°C (crit = +104000.0°C, hyst = -273.1°C) power1: 11.03 W (cap = 208.00 W) the numbers used in the linux/drivers/gpu/drm/amd/powerplay are correct as they are the values the bios uses but Linux is reading/using the values differently ... Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> guess one of them should be able to find the issue 5.4.0-050400rc6-generic 24 hours and I still have fan control 99% chance I have just jinxed my self now around 12 hrs later lost fan control again pwmconfig seems to be the only thing that allows me to get manual mode back on I wounder if this is the actual program giving grieve hmm maybe not it lets me briefly access manual Found the following PWM controls: hwmon1/pwm1 current value: 68 hwmon1/pwm1 is currently setup for automatic speed control. In general, automatic mode is preferred over manual mode, as it is more efficient and it reacts faster. Are you sure that you want to setup this output for manual control? (n) y hwmon1/pwm1_enable stuck to 2 Manual control mode not supported, skipping hwmon1/pwm1. wish I knew what the heck keeps locking pwm1_enable to auto @ low speeds :@ from what I can work out the only difference between the kernel versions was they added extra thermal readings to support the newer cards with thermal junction sensors {-273150, 99000}, { 120000, 120000}, has been in their since Jan 2018 ... looks like its reading the max temp settings from the bios I will confirm this tomorrow I will flash a custom bios /torvalds/linux/blob/master/drivers/gpu/drm/amd/powerplay/inc/hwmgr.h /* The temperature, in 0.01 centigrades, below which we just run at a minimal PWM. */ so maybe it is thinking it can do 1000C ? anyhow as I don't want to run an altered bios as that would force fan 100% on boot , what I decided to do was rip out all of AMD's new thermal code ... found them hard coded here for the R9 290 hawaii / Sea Islands chip sets so that will be a dirty way to get it to go 100% throttle sooner I'll set mine to 85000 and see how it goes , hopefully the rest follows linux/drivers/gpu/drm/amd/amdgpu/ci_dpm.c if (adev->asic_type == CHIP_HAWAII) { pi->thermal_temp_setting.temperature_low = 94500; pi->thermal_temp_setting.temperature_high = 95000; pi->thermal_temp_setting.temperature_shutdown = 104000; } else { pi->thermal_temp_setting.temperature_low = 99500; pi->thermal_temp_setting.temperature_high = 100000; pi->thermal_temp_setting.temperature_shutdown = 104000; } oh I love it they know the drivers file is crap anyhow it looks like the real issue is in the GPU driver fan speeds temps everything is in their , ofcause this would not be an issue if pwm1_enable was NOT STUCK ON AUTO #if 0 /* XXX: need to figure out how to handle this properly */ tmp = RREG32_SMC(ixCG_THERMAL_CTRL); tmp &= DPM_EVENT_SRC_MASK; tmp |= DPM_EVENT_SRC(dpm_event_src); WREG32_SMC(ixCG_THERMAL_CTRL, tmp); #endif apparently I was looking through kernel 4.7 code on my pc and not master linux/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c looks like the new file name as they relocated ci_dpm.c to /home/aio/Programs/linux/drivers/gpu/drm/radeon/ci_dpm.c another plan drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c hwmgr->dpm_level = AMD_DPM_FORCED_LEVEL_AUTO; hwmgr_init_default_caps(hwmgr); hwmgr_init_default_caps(hwmgr); hwmgr_set_user_specify_caps(hwmgr); hwmgr_set_user_specify_caps(hwmgr); hwmgr->fan_ctrl_is_in_default_mode = true; change to false to disable auto .. not like its going to be any worse for us then GPU's thermal system will run and you can actually manually run the fans but unsure if this will stop auto core speed power save features as well success drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c hwmgr->fan_ctrl_is_in_default_mode = false; it will now boot up in manual mode finally I have fan control "AMD_DPM_FORCED_LEVEL_AUTO" I am wondering just how "FORCED" that "AUTO" is meant to be .... how ever once you put it back to "2" "AUTO" it takes control again ... and will overwrite your "0" card control and "1" manual echo 2 > /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable don't do it unless you want to reboot with a hot GPU :P also the crit temp for "Sea Island" cards like my R9 290 is defiantly being retrieved from drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c thermal_temp_setting.temperature_shutdown if (hwmgr->chip_id == CHIP_HAWAII) { data->thermal_temp_setting.temperature_low = 74500; data->thermal_temp_setting.temperature_high = 80000; data->thermal_temp_setting.temperature_shutdown = 98000; and the fans still spin slow regardless how low I set it .. sooo .. somethings broken ... so looks like I will be doing a custom kernel on every update for a while now to disable AUTO fan control and for some reason AMD devs feel 120 deg is NORMAL for a GPU and users want quite fans ... I give up ... I discovered a workaround that works for my R9-290 and Debian 5.3.0 kernel: root@joyola:~# echo "2" >>/sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable root@joyola:~# echo "0" >>/sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable pwm1_enable will still be 2 afterwards, but (after spinning the fans at max for a bit) automatic fan control works for me. I also have to do the same pwm1_enable prodding after resuming from suspend. (If it matters, I boot with radeon.cik_support=0 amdgpu.cik_support=1 radeon.si_support=0 amdgpu.si_support=1 amdgpu.dc_log=1 amdgpu.gpu_recovery=1) I still have the same brokenness as reported in comment 14 though. after having good fan control for a few weeks 5.4.2-050402-generic is now having a melt down back to trying to run the cards @ (crit = +104000.0°C, hyst = -273.1°C) and this is whats got me stumped , it seems to go auto when it hits high temp ~ 70 then starts dropping the fan speed I can exit a game set a high fan speed it will sit their @ 60 deg for a good 20 mins with ~ 60% , decide to go back into game .. hits 70 .. fan speeds keep dropping until its 20% and blipping 100% @ 95 deg I am very close to going back to liquid cooling ... or connecting the fan to a manual speed controller ( if someone knows of a way I can still have the fan connected dor driver control and monitoring with a manual device override for PWM I am all ears , would it be safe for me to just use a thermostat to just send voltage to the fan ? ie) 2x input power sources my guess base or asic is what its reading now about to hack away at those modules and try again /home/aio/Programs/linux/drivers/gpu/drm/i915/oa/i915_oa_tgl.c /home/aio/Programs/linux/drivers/gpu/drm/amd/include/asic_reg/vce/vce_4_0_default.h /home/aio/Programs/linux/drivers/gpu/drm/nouveau/nvkm/engine/ce/gf100.c /home/aio/Programs/linux/drivers/gpu/drm/nouveau/nvkm/engine/ce/gt215.c /home/aio/Programs/linux/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c well its neither of those modules I should have looked at the files after I scanned for files containing 104000 I can not even force run the cards in performance mode anymore with 100% fan speed stuck on if i could just find the setting to tell amdgpu / hwmon / powerplay what temp I call hot this would be solved (In reply to MasterCATZ from comment #37) > well its neither of those modules > I should have looked at the files after I scanned for files containing 104000 > > > I can not even force run the cards in performance mode anymore with 100% fan > speed stuck on > > if i could just find the setting to tell amdgpu / hwmon / powerplay what > temp I call hot this would be solved Here is a slightly modified version of a fan control script, along with the service to run it, from the Arch Wiki. I don't know what distribution you use but hopefully this will at least get you started. Unfortunately it doesn't seem like the kernel devs are interested in fixing this, so after a long time I just had to use this kludgey solution. 1. Create a file with the following contents named "amdgpu-fancontrol" in "/usr/local/bin" and make it executable. --------------- Start amdgpu-fancontrol --------------- #!/bin/bash HYSTERESIS=6000 # in mK SLEEP_INTERVAL=1 # in s DEBUG=true # set temps (in degrees C * 1000) and corresponding pwm values in ascending order and with the same amount of values TEMPS=( 40000 50000 65000 75000 80000 90000 ) PWMS=( 0 100 140 190 200 255 ) # hwmon paths, hardcoded for one amdgpu card, adjust as needed HWMON=$(ls /sys/class/drm/card0/device/hwmon) FILE_PWM=$(echo /sys/class/drm/card0/device/hwmon/$HWMON/pwm1) FILE_FANMODE=$(echo /sys/class/drm/card0/device/hwmon/$HWMON/pwm1_enable) FILE_TEMP=$(echo /sys/class/drm/card0/device/hwmon/$HWMON/temp1_input) # might want to use this later #FILE_TEMP_CRIT=$(echo /sys/class/hwmon/hwmon?/temp1_crit_hyst) [[ -f "$FILE_PWM" && -f "$FILE_FANMODE" && -f "$FILE_TEMP" ]] || { echo "invalid hwmon files" ; exit 1; } # load configuration file if present [ -f /etc/amdgpu-fancontrol.cfg ] && . /etc/amdgpu-fancontrol.cfg # check if amount of temps and pwm values match if [ "${#TEMPS[@]}" -ne "${#PWMS[@]}" ] then echo "Amount of temperature and pwm values does not match" exit 1 fi # checking for privileges if [ $UID -ne 0 ] then echo "Writing to sysfs requires privileges, relaunch as root" exit 1 fi function debug { if $DEBUG; then echo $1 fi } # set fan mode to max(0), manual(1) or auto(2) function set_fanmode { echo "setting fan mode to $1" echo "$1" > "$FILE_FANMODE" } function set_pwm { NEW_PWM=$1 OLD_PWM=$(cat $FILE_PWM) echo "current pwm: $OLD_PWM, requested to set pwm to $NEW_PWM" debug "current pwm: $OLD_PWM, requested to set pwm to $NEW_PWM" if [ $(cat ${FILE_FANMODE}) -ne 1 ] then echo "Fanmode not set to manual." set_fanmode 1 fi if [ "$NEW_PWM" -gt "$OLD_PWM" ] || [ -z "$TEMP_AT_LAST_PWM_CHANGE" ] || [ $(($(cat $FILE_TEMP) + HYSTERESIS)) -le "$TEMP_AT_LAST_PWM_CHANGE" ]; then $DEBUG || echo "current temp: $TEMP" echo "temp at last change was $TEMP_AT_LAST_PWM_CHANGE" echo "changing pwm to $NEW_PWM" echo "$NEW_PWM" > "$FILE_PWM" TEMP_AT_LAST_PWM_CHANGE=$(cat $FILE_TEMP) else debug "not changing pwm, we just did at $TEMP_AT_LAST_PWM_CHANGE, next change when below $((TEMP_AT_LAST_PWM_CHANGE - HYSTERESIS))" fi } function interpolate_pwm { i=0 TEMP=$(cat $FILE_TEMP) debug "current temp: $TEMP" if [[ $TEMP -le ${TEMPS[0]} ]]; then # below first point in list, set to min speed set_pwm "${PWMS[i]}" return fi for i in "${!TEMPS[@]}"; do if [[ $i -eq $((${#TEMPS[@]}-1)) ]]; then # hit last point in list, set to max speed set_pwm "${PWMS[i]}" return elif [[ $TEMP -gt ${TEMPS[$i]} ]]; then continue fi # interpolate linearly LOWERTEMP=${TEMPS[i-1]} HIGHERTEMP=${TEMPS[i]} LOWERPWM=${PWMS[i-1]} HIGHERPWM=${PWMS[i]} PWM=$(echo "( ( $TEMP - $LOWERTEMP ) * ( $HIGHERPWM - $LOWERPWM ) / ( $HIGHERTEMP - $LOWERTEMP ) ) + $LOWERPWM" | bc) debug "interpolated pwm value for temperature $TEMP is: $PWM" set_pwm "$PWM" return done } function reset_on_fail { echo "exiting, resetting fan to auto control..." set_fanmode 2 exit 1 } # always try to reset fans on exit trap "reset_on_fail" SIGINT SIGTERM function run_daemon { while :; do interpolate_pwm debug sleep $SLEEP_INTERVAL done } # set fan control to manual set_fanmode 1 # finally start the loop run_daemon --------------- End amdgpu-fancontrol --------------- 2. Create a file with the following contents named "amdgpu-fancontrol.service" in /etc/systemd/system. --------------- Start amdgpu-fancontrol.service --------------- [Unit] Description=amdgpu-fancontrol [Service] Type=simple ExecStart=/usr/local/bin/amdgpu-fancontrol [Install] WantedBy=multi-user.target --------------- End amdgpu-fancontrol.service --------------- 3. Here's how to enable, disable, and get the status of the fan control service: sudo systemctl enable amdgpu-fancontrol sudo systemctl start amdgpu-fancontrol sudo systemctl status amdgpu-fancontrol will not work , /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable is locked to Auto [28455.094113] manual fan speed control should be enabled first [28473.077182] manual fan speed control should be enabled first [28480.086754] manual fan speed control should be enabled first [28498.073701] manual fan speed control should be enabled first [28499.095753] manual fan speed control should be enabled first [28512.086404] manual fan speed control should be enabled first [28525.077255] manual fan speed control should be enabled first [28529.080955] manual fan speed control should be enabled first [28530.070058] manual fan speed control should be enabled first [28839.107591] manual fan speed control should be enabled first [28840.099633] manual fan speed control should be enabled first [28842.083214] manual fan speed control should be enabled first [28890.089742] manual fan speed control should be enabled first [28896.099884] manual fan speed control should be enabled first [28902.081972] manual fan speed control should be enabled first [28909.093220] manual fan speed control should be enabled first [28927.107978] manual fan speed control should be enabled first [28950.085450] manual fan speed control should be enabled first [28979.116690] manual fan speed control should be enabled first [28982.086568] manual fan speed control should be enabled first [29004.103327] manual fan speed control should be enabled first [29040.104962] manual fan speed control should be enabled first [29066.095979] manual fan speed control should be enabled first [29077.113080] manual fan speed control should be enabled first [29086.091060] manual fan speed control should be enabled first [29096.113497] manual fan speed control should be enabled first [29111.123447] manual fan speed control should be enabled first [29123.117578] manual fan speed control should be enabled first [29126.092675] manual fan speed control should be enabled first [29148.109806] manual fan speed control should be enabled first [29155.119475] manual fan speed control should be enabled first [29168.111159] manual fan speed control should be enabled first [29170.094539] manual fan speed control should be enabled first [29187.119961] manual fan speed control should be enabled first [29196.113113] manual fan speed control should be enabled first [29199.119590] manual fan speed control should be enabled first [29211.126157] manual fan speed control should be enabled first [29214.098257] manual fan speed control should be enabled first [29217.107755] manual fan speed control should be enabled first [29229.115177] manual fan speed control should be enabled first [29242.097319] manual fan speed control should be enabled first [29325.114063] manual fan speed control should be enabled first [29333.108686] manual fan speed control should be enabled first [29449.116469] manual fan speed control should be enabled first [29455.132518] manual fan speed control should be enabled first [29471.129284] manual fan speed control should be enabled first [29480.121633] manual fan speed control should be enabled first [29640.125839] manual fan speed control should be enabled first [29981.128248] manual fan speed control should be enabled first [30199.151363] manual fan speed control should be enabled first [30204.143080] manual fan speed control should be enabled first [30211.154484] manual fan speed control should be enabled first [30226.128368] manual fan speed control should be enabled first [30228.145612] manual fan speed control should be enabled first [30236.144778] manual fan speed control should be enabled first [30243.149198] manual fan speed control should be enabled first [30245.134568] manual fan speed control should be enabled first [30248.140668] manual fan speed control should be enabled first [30362.126900] manual fan speed control should be enabled first [30909.144940] manual fan speed control should be enabled first [30910.137533] manual fan speed control should be enabled first [30920.163730] manual fan speed control should be enabled first [30931.161975] manual fan speed control should be enabled first [30932.158340] manual fan speed control should be enabled first [30933.147783] manual fan speed control should be enabled first [30944.159956] manual fan speed control should be enabled first [30958.138767] manual fan speed control should be enabled first [30977.151665] manual fan speed control should be enabled first [30996.157518] manual fan speed control should be enabled first [31025.147100] manual fan speed control should be enabled first [31029.149391] manual fan speed control should be enabled first [31030.148760] manual fan speed control should be enabled first and the echo 0 > /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable to 100% only works in low power state as soon as core speeds go up fan speeds drop ... aio@aio:~$ sudo pwmconfig [sudo] password for aio: # pwmconfig revision $Revision$ ($Date$) This program will search your sensors for pulse width modulation (pwm) controls, and test each one to see if it controls a fan on your motherboard. Note that many motherboards do not have pwm circuitry installed, even if your sensor chip supports pwm. We will attempt to briefly stop each fan using the pwm controls. The program will attempt to restore each fan to full speed after testing. However, it is ** very important ** that you physically verify that the fans have been to full speed after the program has completed. Found the following devices: hwmon0 is acpitz hwmon1 is amdgpu hwmon2 is coretemp hwmon3 is it8620 hwmon4 is it8792 Found the following PWM controls: hwmon1/pwm1 current value: 104 hwmon1/pwm1 is currently setup for automatic speed control. In general, automatic mode is preferred over manual mode, as it is more efficient and it reacts faster. Are you sure that you want to setup this output for manual control? (n) y hwmon1/pwm1_enable stuck to 2 Manual control mode not supported, skipping hwmon1/pwm1. aio@aio:/usr/local/bin$ sudo systemctl status amdgpu-fancontrol ● amdgpu-fancontrol.service - amdgpu-fancontrol Loaded: loaded (/etc/systemd/system/amdgpu-fancontrol.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2019-12-06 14:45:07 AEST; 3s ago Main PID: 23922 (amdgpu-fancontr) Tasks: 2 (limit: 4915) Memory: 3.3M CGroup: /system.slice/amdgpu-fancontrol.service ├─23922 /bin/bash /usr/local/bin/amdgpu-fancontrol └─23979 sleep 1 Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: changing pwm to 175 Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current temp: 62000 Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: interpolated pwm value for temperature 62000 is: 175 Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current pwm: 104, requested to set pwm to 175 Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: Fanmode not set to manual. Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: setting fan mode to 1 Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: temp at last change was 62000 Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: changing pwm to 175 Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument (In reply to MasterCATZ from comment #41) > aio@aio:/usr/local/bin$ sudo systemctl status amdgpu-fancontrol > ● amdgpu-fancontrol.service - amdgpu-fancontrol > Loaded: loaded (/etc/systemd/system/amdgpu-fancontrol.service; enabled; > vendor preset: enabled) > Active: active (running) since Fri 2019-12-06 14:45:07 AEST; 3s ago > Main PID: 23922 (amdgpu-fancontr) > Tasks: 2 (limit: 4915) > Memory: 3.3M > CGroup: /system.slice/amdgpu-fancontrol.service > ├─23922 /bin/bash /usr/local/bin/amdgpu-fancontrol > └─23979 sleep 1 > > Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: changing pwm to 175 > Dec 06 14:45:08 aio amdgpu-fancontrol[23922]: > /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid > argument > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current temp: 62000 > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: interpolated pwm value for > temperature 62000 is: 175 > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: current pwm: 104, requested to > set pwm to 175 > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: Fanmode not set to manual. > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: setting fan mode to 1 > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: temp at last change was 62000 > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: changing pwm to 175 > Dec 06 14:45:09 aio amdgpu-fancontrol[23922]: > /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid > argument I was about to call it a day when I got your email notifications. The line it's talking about is: echo "$NEW_PWM" > "$FILE_PWM" So it looks like the "$FILE_PWM" variable is not valid. Remember, you have to change the variables under the comment "hwmon paths, hardcoded for one amdgpu card, adjust as needed" to whatever your system requires. To debug the variables I would execute the 4 lines that set HWMON, FILE_PWM, FILE_FANMODE, and FILE_TEMP from terminal and see where things are going wrong. I have to go now but I'll try to help you more tomorrow if you're still having problems. But once you have those variables set correctly the script should work. Here's what the service status output looks like on my system: Loaded: loaded (/etc/systemd/system/amdgpu-fancontrol.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2019-12-05 18:16:27 PST; 2h 28min ago Main PID: 836 (amdgpu-fancontr) Tasks: 2 (limit: 4915) Memory: 7.7M CGroup: /system.slice/amdgpu-fancontrol.service ├─ 836 /bin/bash /usr/local/bin/amdgpu-fancontrol └─14235 sleep 1 Dec 05 20:44:46 Entropod amdgpu-fancontrol[836]: changing pwm to 80 Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: current temp: 49000 Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: interpolated pwm value for temperature 49000 is: 90 Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: current pwm: 76, requested to set pwm to 90 Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: temp at last change was 48000 Dec 05 20:44:47 Entropod amdgpu-fancontrol[836]: changing pwm to 90 Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: current temp: 48000 Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: interpolated pwm value for temperature 48000 is: 80 Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: current pwm: 86, requested to set pwm to 80 Dec 05 20:44:48 Entropod amdgpu-fancontrol[836]: not changing pwm, we just did at 49000, next change when below 43000 the file is correct .. and you can tell that because its reading the temp "current pwm: 76" error is because NOTHING is being allowed to edit pwm1_enable it is stuck on auto so nothing can manually change pwm1 but if their is an error in my adjustments let me know # hwmon paths, hardcoded for one amdgpu card, adjust as needed HWMON=$(ls /sys/class/drm/card1/device/hwmon/hwmon1) FILE_PWM=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1) FILE_FANMODE=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable) FILE_TEMP=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input) aio@aio:~$ ls /sys/class/drm/card1/device/hwmon/hwmon1 device freq1_input name pwm1 temp1_crit_hyst fan1_enable freq1_label power pwm1_enable temp1_input fan1_input freq2_input power1_average pwm1_max temp1_label fan1_max freq2_label power1_cap pwm1_min uevent fan1_min in0_input power1_cap_max subsystem fan1_target in0_label power1_cap_min temp1_crit aio@aio:~$ cat /sys/class/drm/card1/device/hwmon/hwmon1/pwm1 68 aio@aio:~$ cat /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable 2 aio@aio:~$ cat /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input 54000 aio@aio:~$ (In reply to MasterCATZ from comment #43) > the file is correct .. and you can tell that because its reading the temp > "current pwm: 76" > > error is because NOTHING is being allowed to edit pwm1_enable it is stuck on > auto so nothing can manually change pwm1 > > > > but if their is an error in my adjustments let me know > > > # hwmon paths, hardcoded for one amdgpu card, adjust as needed > HWMON=$(ls /sys/class/drm/card1/device/hwmon/hwmon1) > FILE_PWM=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1) > FILE_FANMODE=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable) > FILE_TEMP=$(echo /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input) Your variables are set wrong. If your GPU is card1 they should be: HWMON=$(ls /sys/class/drm/card1/device/hwmon) FILE_PWM=$(echo /sys/class/drm/card1/device/hwmon/$HWMON/pwm1) FILE_FANMODE=$(echo /sys/class/drm/card1/device/hwmon/$HWMON/pwm1_enable) FILE_TEMP=$(echo /sys/class/drm/card1/device/hwmon/$HWMON/temp1_input) The "HWMON" variable is there to determine which actual hardware monitor is being used because it can change whenever you boot. One time it could be hwmon1, the next time hwmon3, etc. So you can't hard-code it as you're doing. You have to use the $HWMON variable to set FILE_PWM, FILE_FANMODE, and FILE_TEMP. There is also the possibility to use question marks in the path: /sys/class/drm/card?/device/hwmon/hwmon? (In reply to Jan Ziak (http://atom-symbol.net) from comment #46) > There is also the possibility to use question marks in the path: > > /sys/class/drm/card?/device/hwmon/hwmon? Thank you for mentioning that. If you only have one GPU that will indeed work. I have multiple GPUs, one Nvidia and one AMD, so I have to hard-code the card. (In reply to muncrief from comment #47) > (In reply to Jan Ziak (http://atom-symbol.net) from comment #46) > > There is also the possibility to use question marks in the path: > > > > /sys/class/drm/card?/device/hwmon/hwmon? > > Thank you for mentioning that. If you only have one GPU that will indeed > work. I have multiple GPUs, one Nvidia and one AMD, so I have to hard-code > the card. Maybe you can use the PCI ID of the device: FOUND=false for CARD in /sys/class/drm/card?; do DEVICE="$(cat "$CARD/device/device")" if [[ "${DEVICE,,}" == 0x67b1 ]]; then FOUND=true break fi done $FOUND || exit 1 HWMON=$CARD/device/hwmon/hwmon? echo $HWMON (In reply to Jan Ziak (http://atom-symbol.net) from comment #48) > (In reply to muncrief from comment #47) > > (In reply to Jan Ziak (http://atom-symbol.net) from comment #46) > > > There is also the possibility to use question marks in the path: > > > > > > /sys/class/drm/card?/device/hwmon/hwmon? > > > > Thank you for mentioning that. If you only have one GPU that will indeed > > work. I have multiple GPUs, one Nvidia and one AMD, so I have to hard-code > > the card. > > Maybe you can use the PCI ID of the device: > > FOUND=false > for CARD in /sys/class/drm/card?; do > DEVICE="$(cat "$CARD/device/device")" > if [[ "${DEVICE,,}" == 0x67b1 ]]; then > FOUND=true > break > fi > done > $FOUND || exit 1 > HWMON=$CARD/device/hwmon/hwmon? > echo $HWMON Well, my system works great the way it is and I don't really have time to do any further debugging or redesign. I'm just trying to help MasterCATZ get things going. However that's another great way to determine where a specific card is, thank you for the multiple great suggestions! It's great to see so many people trying to help, we need more of that in Linux, especially with Arch and its derivative distros. It's very irritating and frustrating when I see experienced users simply tell others to "read the wiki", or expect them to use Linux for two years before they can have a usable installation. In fact that kind of old, outdated, and downright mean attitude is one of the reasons Linux still has such a low share of the desktop market. So whenever I see someone who needs help I try to make it as easy as I can for them, and have even been insulted numerous times by the cruel people who are angered that I don't just tell others to get a PhD or something :) Thanks for correction, I was unsure if $HWMON knew to go to hwmon1 works until GPU hits 70 deg then something forces "pwm1_enable" to auto and starts ramping the fan speed down until its 20% @ 90+ deg and bliping 100% @ 95 deg for now all I can do is run custom bios with 800 memory speed and 850 core and keep toggling between standard and performance mode on to reset fan speed to 100% and redo that every time its drops back below 40% and set /sys/class/drm/card1/device/hwmon/hwmon1/power1_cap to under 140w so the GPU does not cook so unless its "Radeon Profile" doing something to get locked out I have no idea its fan profile should be over 70 deg 1:1 ratio under 60 deg 50% under 50 deg 10% under 40 deg 5% under 20 deg 0% any way to find out what is accessing pwm1_enable ? current temp: 61000 interpolated pwm value for temperature 61000 is: 170 current pwm: 165, requested to set pwm to 170 current pwm: 165, requested to set pwm to 170 temp at last change was 61000 changing pwm to 170 current temp: 71000 current pwm: 255, requested to set pwm to 255 current pwm: 255, requested to set pwm to 255 not changing pwm, we just did at 71000, next change when below 66000 current temp: 73000 current pwm: 68, requested to set pwm to 255 current pwm: 68, requested to set pwm to 255 Fanmode not set to manual. setting fan mode to 1 temp at last change was 73000 changing pwm to 255 /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument current temp: 87000 current pwm: 124, requested to set pwm to 255 current pwm: 124, requested to set pwm to 255 Fanmode not set to manual. setting fan mode to 1 temp at last change was 87000 changing pwm to 255 /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid argument (In reply to MasterCATZ from comment #51) > current temp: 61000 > interpolated pwm value for temperature 61000 is: 170 > current pwm: 165, requested to set pwm to 170 > current pwm: 165, requested to set pwm to 170 > temp at last change was 61000 > changing pwm to 170 > > current temp: 71000 > current pwm: 255, requested to set pwm to 255 > current pwm: 255, requested to set pwm to 255 > not changing pwm, we just did at 71000, next change when below 66000 > > > current temp: 73000 > current pwm: 68, requested to set pwm to 255 > current pwm: 68, requested to set pwm to 255 > Fanmode not set to manual. > setting fan mode to 1 > temp at last change was 73000 > changing pwm to 255 > /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid > argument > > > > > current temp: 87000 > current pwm: 124, requested to set pwm to 255 > current pwm: 124, requested to set pwm to 255 > Fanmode not set to manual. > setting fan mode to 1 > temp at last change was 87000 > changing pwm to 255 > /usr/local/bin/amdgpu-fancontrol: line 65: echo: write error: Invalid > argument Well, that's certainly quite bizarre. I wish I could think of something else but I'm stumped. I've never experienced that problem on my system, and I don't know why yours isn't allowing the write. Is it possible there was some error in copying the script? It seems unlikely but that's all I can come up with at this point. If you have somewhere I can send my actual script and service files I'd be happy to send them to you. Otherwise I'm just out of ideas. its been like this since mid kernel 4's, just wish I knew whats locking that file root has no permissions and it seems to activate @ 70 deg , which even if i run the fan 100% will be reached unless I under clock amdgpupro just turns PC into a paperweight so I don't use that radeon drivers suck for any gaming amdgpu / mesa are what I need to use and its been like this since powerplay was introduced Ubuntu 18.04, and just upgraded it to 19.10 same issues currently using 5.4.2-050402-generic GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xfffd7fff amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1 amdgpu.cik_support=1 radeon.cik_support=0" featuremasks seem to make no difference maybe I should re - add radeon.si_support=0 amdgpu.si_support=1 as in as radeon profile is showing radeonsi is in use ?, but I thought R9 290 were Sea Islands = amdgpu.cik_support=1 ? It seems that since kernel 5.6 (or at least Debian's version thereof), I no longer need to fiddle with /sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable. The default value (1) seems to do the right thing now. Progress! Mind you, lmsensors is still unable to report fan speed, and gives nonsensical values for crit/hyst temperatures. I have a feeling that further improvements to power management may be possible too. amdgpu-pci-0100 Adapter: PCI adapter vddgfx: +1.00 V fan1: N/A (min = 0 RPM, max = 0 RPM) edge: +73.0°C (crit = +104000.0°C, hyst = -273.1°C) power1: 58.21 W (cap = 208.00 W) Tested on Ubuntu with kernel 5.6.15 - looks better now: amdgpu-pci-0100 Adapter: PCI adapter vddgfx: 1000.00 mV fan1: 1200 RPM (min = 0 RPM, max = 6000 RPM) edge: +72.0°C (crit = +104000.0°C, hyst = -273.1°C) power1: 117.00 W (cap = 216.00 W) I was going to try and get a fancontrol script working, but I found the following as I started to play around: I got my results on Arch Linux's 5.7.12-arch1-1 kernel. On boot pwm1_enable is set to 1 (manual mode). On 4.18.x it is normally set to 2 (automatic mode) iirc. Changing this value to 2 does essentially nothing for me and the fans do not spin up with increasing temp. However, I've found that running pwmconfig and not even answering the first question (ie; I can just ctrl+c out) causes the automatic temp control to start functioning. So after you run pwmconfig, then change pwm1_enable to 2 and everything works again. So far it appears doing these two things gets me the same functionality I had on 4.18.x and I can finally upgrade my kernel. No fan control script needed. It is interesting to note, if you do these in the opposite order; set pwm1_enable to 2 and then run pwmconfig, you must say yes to enabling manual mode on the gpu's fan before they start functioning properly. This also causes the fan to run full speed (like pwm1_enable is set to 0) and you will need to set 2 in pwm1_enable again. I dont know what pwmconfig is modifying to cause pwm to work again... I wish i knew so I could set that up with a script and not have to manually start it, but this is good enough for now as I reboot rarely. Maybe I can make a script to use with systemctl when I'm not lazy. Hi all! I'm running a Radeon R9 290 with amdgpu. I've had the same issue of pwm1_enable being set to MANUAL on boot, and then being stuck to AUTO after switching to AUTO. I've had a quick browse of the code and have a fix that seems to work for me. See the attached patch for my fix/work-around. Thoughts and explanations follow. Some comments and questions on the code. My card seems to use the smu7_* code for handling fan and power related functionality. I'm not sure if this is correct, but it seems that MANUAL is simply the default state for the card at boot, and the software (maybe on purpose? it's unclear) mirrors because there's a variable called fan_ctrl_enabled which is never explicitly initialized, and thus is default-initialized to false, which equates to MANUAL in the get_pwm1_enable() logic, which again means you may set the fan speed manually. For those who want to take a look themselves, this is roughly what happens when you write 2 (auto) to pwm1_enable: > amdgpu_pm.c: amdgpu_hwmon_set_pwm1() > smu7_hwmgr.c: smu7_set_fan_control_mode() > smu7_thermal.c: smu7_fan_ctrl_set_static_mode() > smu7_thermal.c: smu7_fan_ctrl_start_smc_fan_control() > > // Send PPSMC_StartFanControl with parameter FAN_CONTROL_TABLE > smumgr.c: smum_send_msg_to_smc_with_parameter > smu7_thermal.c: hwmgr->fan_ctrl_enabled = true; Note that fan_ctrl_enabled is now true. When reading pwm1_enable, this is the value that's checked. Now, this happens when we try to write 1 (manualy) to pwm1_enable again: > amdgpu_pm.c: amdgpu_hwmon_set_pwm1_enable() > smu7_hwmgr.c: smu7_set_fan_control_mode() > smu7_hwmgr.c: smu7_fan_ctrl_stop_smc_fan_control > > // Now, a so-called phm platform cap is checked > // See hardwaremanager.h for its definition > // Its description is simply "Fan is controlled by the SMC microcode." > if (phm_cap_enabled(hwmgr->platform_descriptor.platformCaps, > PHM_PlatformCaps_MicrocodeFanControl)) > smu7_fan_ctrl_stop_smc_fan_control(hwmgr); If the above check were to succeed, it would continue to send a smum message of PPSMC_StopFanControl and set fan_ctrl_enabled = false, and we would be back in MANUAL land. However, the PHM_PlatformCaps_MicrocodeFanControl cap is never set. AFAICT, this cap is only ever set for vega12 and vega20 cards, in vega20_processpptables.c and vega12_processpptables.c. It's checked in a bunch of places for smu7, but never in a way that explicitly prevents manual fan control once manual fan control is enabled, such as after boot. Simply commenting out the check above fixed the problem for me, and I have seen no strange side-effects yet. This makes sense to me; after boot, setting fan speed manually works and the code responsible doesn't require the MicrocodeFanControl cap to be set for that. However, I don't know what the purpose of that cap is, whether the only reason for it being present in smu7 and elsewhere is a situation of copy-pasting skeleton code, or what. From looking at vega10_hwmgr.c, it looks like vega10 (AMDGPU_FAMILY_AI, arctic islands?) cards should have the same problem and I assume the same fix should work, so I included it in the patch. It would be great if someone with an arctic islands card (RX 400 series?) could test and confirm this. Comments and feedback are very welcome. Created attachment 293895 [details]
patch to fix pwm1_enable being stuck to AUTO for some gpu smu7 and vega10
Seems to work fine for smu7 (AMD Hawaii PRO Radeon R9 290), needs testing for vega10 (arctic islands).
Created attachment 293903 [details]
possible fix
The attached patch should fix it.
seems a bit random for me 5.8.17-050817-generic sometimes I can spend weeks with fan control then all of a sudden I find it hitting 100deg because it keeps spinning back down to 20% range for me this seems to happen when I am using multiple monitors, does not seem to happen when using a single display however this could be related to memory actually idling I will give your code a go and put it into auto mode , its summer here so thermals are reached quickly I have GPU idling 300core 100 memory @ 47.8% manually this seems to keep it the same temp as the entire system just so the dreaded auto is not triggered or else memory goes 800 and fan drops to 20's with the 100% blimps @ 95+ deg Now it just runs 100% @ 300mhz core 100mhz memory @ ~60deg aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ sensors k10temp-pci-00c3 Adapter: PCI adapter Vcore: 1.38 V Vsoc: 1.08 V Tctl: +79.8°C Tdie: +79.8°C Tccd1: +66.2°C Tccd2: +61.0°C Icore: 32.00 A Isoc: 10.00 A acpitz-acpi-0 Adapter: ACPI interface temp1: +16.8°C (crit = +20.8°C) temp2: +16.8°C (crit = +20.8°C) amdgpu-pci-0b00 Adapter: PCI adapter fan1: 2884 RPM (min = 0 RPM, max = 6000 RPM) edge: +59.0°C (crit = +104000.0°C, hyst = -273.1°C) power1: 12.15 W (cap = 208.00 W) aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ pwmconfig You need to be root to run this script. aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ sudo pwmconfig [sudo] password for aio: # pwmconfig version 3.6.0 This program will search your sensors for pulse width modulation (pwm) controls, and test each one to see if it controls a fan on your motherboard. Note that many motherboards do not have pwm circuitry installed, even if your sensor chip supports pwm. We will attempt to briefly stop each fan using the pwm controls. The program will attempt to restore each fan to full speed after testing. However, it is ** very important ** that you physically verify that the fans have been to full speed after the program has completed. Found the following devices: hwmon0 is acpitz hwmon1 is amdgpu hwmon2 is k10temp hwmon3 is hidpp_battery_2 Found the following PWM controls: hwmon1/pwm1 current value: 122 Giving the fans some time to reach full speed... Found the following fan sensors: hwmon1/fan1_input current speed: 5499 RPM Warning!!! This program will stop your fans, one at a time, for approximately 5 seconds each!!! This may cause your processor temperature to rise!!! If you do not want to do this hit control-C now!!! Hit return to continue: Testing pwm control hwmon1/pwm1 ... hwmon1/fan1_input ... speed was 5499 now 1120 It appears that fan hwmon1/fan1_input is controlled by pwm hwmon1/pwm1 Would you like to generate a detailed correlation (y)? y Note: If you had gnuplot installed, I could generate a graphical plot. PWM 255 FAN 5508 PWM 240 FAN 5492 PWM 225 FAN 5245 PWM 210 FAN 4962 PWM 195 FAN 4659 PWM 180 FAN 4328 PWM 165 FAN 3974 PWM 150 FAN 3567 PWM 135 FAN 3140 PWM 120 FAN 2747 PWM 105 FAN 2320 PWM 90 FAN 1892 PWM 75 FAN 1476 PWM 60 FAN 1178 PWM 45 FAN 1092 PWM 30 FAN 1083 PWM 28 FAN 1082 PWM 26 FAN 1081 PWM 24 FAN 1080 PWM 22 FAN 1081 PWM 20 FAN 1080 PWM 18 FAN 1079 PWM 16 FAN 1080 PWM 14 FAN 1079 PWM 12 FAN 1079 PWM 10 FAN 1080 PWM 8 FAN 1080 PWM 6 FAN 1079 PWM 4 FAN 1080 PWM 2 FAN 1079 PWM 0 FAN 1080 Testing is complete. Please verify that all fans have returned to their normal speed. The fancontrol script can automatically respond to temperature changes of your system by changing fanspeeds. Do you want to set up its configuration file now (y)? y What should be the path to your fancontrol config file (/etc/fancontrol)? Select fan output to configure, or other action: 1) hwmon1/pwm1 3) Just quit 5) Show configuration 2) Change INTERVAL 4) Save and quit select (1-n): 1 Devices: hwmon0 is acpitz hwmon1 is amdgpu hwmon2 is k10temp hwmon3 is hidpp_battery_2 Current temperature readings are as follows: hwmon0/temp1_input 16 hwmon0/temp2_input 16 hwmon1/temp1_input 59 hwmon2/temp1_input 83 hwmon2/temp2_input 83 hwmon2/temp3_input 62 hwmon2/temp4_input 59 Select a temperature sensor as source for hwmon1/pwm1: 1) hwmon0/temp1_input 2) hwmon0/temp2_input 3) hwmon1/temp1_input 4) hwmon2/temp1_input 5) hwmon2/temp2_input 6) hwmon2/temp3_input 7) hwmon2/temp4_input 8) None (Do not affect this PWM output) select (1-n): 3 Enter the low temperature (degree C) below which the fan should spin at minimum speed (20): 30 Enter the high temperature (degree C) over which the fan should spin at maximum speed (60): 70 Enter the PWM value (0-255) to use when the temperature is over the high temperature limit (255): 250 Select fan output to configure, or other action: 1) hwmon1/pwm1 3) Just quit 5) Show configuration 2) Change INTERVAL 4) Save and quit select (1-n): 4 Saving configuration to /etc/fancontrol... Configuration saved aio@aio:/sys/class/drm/card0/device/hwmon/hwmon1$ HEAD is now at 1398820fee51 Linux 5.9.9 aio@aio:/SnapRaidArray/DATA/git/linux-stable$ git apply --stat /SnapRaidArray/DATA/Downloads/ Display all 583 possibilities? (y or n) aio@aio:/SnapRaidArray/DATA/git/linux-stable$ git apply --stat /SnapRaidArray/DATA/Downloads/dont_check_microcodefancontrol_cap.patch .../gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 4 +--- .../gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c | 3 +-- 2 files changed, 2 insertions(+), 5 deletions(-) aio@aio:/SnapRaidArray/DATA/git/linux-stable$ git apply --check /SnapRaidArray/DATA/Downloads/dont_check_microcodefancontrol_cap.patch error: drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c: No such file or directory error: drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c: No such file or directory it seems the path is now drivers/gpu/drm/amd/powerplay/hwmgr/ no pm subfolder yes, you'll need to adjust the path for pre 5.10 kernels. Of course, that makes sense! Should've realized that there must be correspondig logic for non-vega12/20 hardware. If this patch works, are you going to submit it or should I? Afterall, you found it :) On 01/12/2020 23.47, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=201539 > > --- Comment #59 from Alex Deucher (alexdeucher@gmail.com) --- > Created attachment 293903 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=293903&action=edit > possible fix > > The attached patch should fix it. > Unfortunately, your patch leads to a stuck boot. There's some minor "corruption" visible on the bottom of the screen while still booting up, and then it gets stuck. I don't think I mentioned this in the previous posts, but I tried setting this cap myself, but in the thermal init function instead of in the process pp tables one, which had the same effect. The boot seems to be stuck completely, since I can't ssh into the box either. Any suggestions for debugging this crash caused by enabling the MicrocodeFanControl cap are appreciated. On 01/12/2020 23.47, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=201539 > > --- Comment #59 from Alex Deucher (alexdeucher@gmail.com) --- > Created attachment 293903 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=293903&action=edit > possible fix > > The attached patch should fix it. > Created attachment 293909 [details]
possible fix
I guess we need to fan control parameters. How about this patch?
> I guess we need to fan control parameters. How about this patch?
After some quick testing, your latest patch seems to work great! And new code, ie. something not just taken from the other families?
I'll try to get an RX 480 user to test this, but seems to work wonders for me. A welcome change with this patch is that the default fan control mode at boot is now 2/auto.
Thanks for your help with this! Very much appreciated.
It's pretty similar to other the code for other smu7 chips (tonga, polaris, etc.). Note that this change is not relevant to newer smu7 chips (rx480, tonga, etc.). Well, I'll have a read! And thanks anyways, I'll run this going forward, post if there are issues and am looking forward to seeing this in mainline at some point :) So.. I was also testing this on my Sapphire R9 290 Tri-X OC. And it seems to work pretty good. I noticed an oddity though. The first time I tried it, when I switched to manual fan control, every time I wrote something to pwm1 after one second the thing seemed to reset to the default speed. On subsequent reboots this didn't seem to happen anymore. Cool! This landed in 5.10. By the way, I was wondering, is there any way to override the default minimum 20~26% minimum speed value? I see that MinimumPWMLimit and zero rpm only landed with later cards, but it seems crazy that not even with a custom bios I can fix this. finally this summer the R9 290 GPU's will be manageable seems to be working , now I just have to find the old settings I changed when trying to run it at higher rpm , @ 60deg and its doing 80%+ RPM possibly it is now following my BIOS settings from when I was trying to force higher RPM when it kept trying to run under 20% my manual settings seem to get overwritten a second after setting them but at least I am not being locked out like before now if someone could solve the issue when it uses more power when running multiple displays ( exactly the same monitors res / hz ) I can run the card single display under 10 watts plug in another display and its over 50 watts idle (In reply to MasterCATZ from comment #72) > > now if someone could solve the issue when it uses more power when running > multiple displays ( exactly the same monitors res / hz ) > I can run the card single display under 10 watts plug in another display > and its over 50 watts idle You can enable mclk switching with identical monitors by setting amdgpu.dcfeaturemask=2 It's enabled by default in 5.11. GRUB_CMDLINE_LINUX_DEFAULT="usbcore.autosuspend=-1 amdgpu.dcfeaturemask=2 apparmor=0 amdgpu.ppfeaturemask=0xfffd7fff amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1 amdgpu.cik_support=1 radeon.cik_support=0 radeon.si_support=0 amdgpu.si_support=1" I am running amdgpu.dcfeaturemask=2 ? or are the other attempted featuremask's causing issues now ? Kernel 5.10 is working perfectly I turned off fancontrol service and using "marazmista/radeon-profile" it is following my fan curve perfectly with out being locked out it has been years since my R9 was not cooking from 20% fanspeed issue even with the core set @ 300mhz / 100 mhz memory An update. Now on 5.10.0-2-amd64. Fresh boot, with amdgpu.dc=1, everything is mostly fine. pwm1_enable=2. Except that after resuming from suspend, pwm1_enable=1 and pwm1=255, resulting in maxxed out fans. Subsequently setting pwm1_enable=2 results in old buggy behaviour (2000RPM until 96C). However, if I suspend and resume again, it sometimes goes back to behaving! amdgpu.dc=0 is a bit of a non-starter, as while fan speeds remain low, so does performance. In all cases temp1_crit and temp1_crit_hyst still have crazy values (104000000 and -273). The later 390 series cards (Grenada Pro) were also affected by an inability to set the correct fan speeds in 5.4.0. Because the cards would not run their fans greater than 20% of their max RPMs under load, it did destroy at least one 390 when it ran without proper cooling at 105 C for an extended period of time. This patch saved the other R390 card in the other machine. Thank you. As of 5.11, it still reports temp1_crit and temp1_crit_hyst as 04000000 and -273.15. |