Bug 29842

Summary: Radeon runs very hot
Product: Drivers Reporter: Phillip Susi (phill)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEEDINFO ---    
Severity: normal CC: akpm, alan, alexdeucher, igor, mjg59-kernel, mjmeehan, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output for 3.0-rc5 kernel

Description Phillip Susi 2011-02-24 19:24:15 UTC
The newer kernels read the temperature of my Radeon card, and report that it is running at around 80 C on an idle desktop.  For comparison, my CPU is only 34 C.
Comment 1 Phillip Susi 2011-02-25 14:54:00 UTC
To make sure something in the desktop wasn't causing it, I booted the kernel with init=/bin/bash and the temperature still rose to 83 C.  My guess is this is a defect in the firmware, or the driver interface to it, and it is running in an infinite loop instead of going idle.
Comment 2 Andrew Morton 2011-03-01 00:40:31 UTC
What kernel versions were OK?  2.6.37?
Comment 3 Phillip Susi 2011-03-01 03:19:07 UTC
It isn't a regression.  Older kernels did not report the temperature at all.  I noticed an odd pattern today as well.  After a cold boot, the temperature runs up to 80+, but after suspending and resuming, it remains around 66.
Comment 4 Andrew Morton 2011-03-01 03:43:47 UTC
Is this the same as bug #29572?
Comment 5 Phillip Susi 2011-03-01 03:49:19 UTC
No.
Comment 6 Alex Deucher 2011-03-01 21:06:17 UTC
Rafael, why is this marked as a regression?  The reporter explicitly stated it was not.
Comment 7 Rafael J. Wysocki 2011-03-01 21:09:18 UTC
Presumably by mistake.  Sorry.
Comment 8 Mike Meehan 2011-04-12 01:22:03 UTC
My system is also impacted, noticed since upgrading to Ubuntu Natty. Kernel version 2.6.38-8.41-generic. Under no load my video card is reporting 82 degrees Celsius. The graphics card is a ATI Technologies Inc Barts PRO [ATI Radeon HD 6800 Series]. I'm using the radeon kernel module with the radeondrmfb frame buffer device.

I think it's related to putting the console in framebuffer mode, the card is quiet in text mode.
Comment 9 Mike Meehan 2011-04-12 02:26:49 UTC
# echo low > /sys/class/drm/card0/device/power_profile 
"resolves" the issue. Default power management settings for KMS put the card in high performance mode on AC power.

# echo dynpm > /sys/class/drm/card0/device/power_method
Dynamic frequency scaling may work for you, though the screen flashes when power levels change. Still seems to run too hot. I'm sticking to low for most purposes.

This page was very helpful: https://wiki.archlinux.org/index.php?title=ATI&oldid=135045
Comment 10 Igor Rudchenko 2011-04-12 16:35:47 UTC
This seems to be regression in my case.

Mobile FireGL V5250, temperature reading from thinkpad-acpi:

2.6.37.2, KMS, profile=default/high - temperature=67
2.6.37.2, KMS, profile=mid - temperature=64

2.6.38.2, KMS, profile=default/high - temperature=71
2.6.38.2, KMS, profile=mid - temperature=64

Some older kernels and windows with default clocks for GPU - temperature=67
Comment 11 Igor Rudchenko 2011-06-27 12:44:42 UTC
High temperature of mobile radeon is back to normal with pcie_aspm=force.
Comment 12 Rafael J. Wysocki 2011-06-27 19:12:07 UTC
Can you attach dmesg output from your system with the 3.0-rc4 kernel, please?
Comment 13 Igor Rudchenko 2011-07-02 11:57:02 UTC
Created attachment 64452 [details]
dmesg output for 3.0-rc5 kernel
Comment 14 Igor Rudchenko 2012-02-22 16:56:23 UTC
Commit "PCI: Rework ASPM disable code" added in 3.0.20 and 3.2.25 has worsened the situation. I can't enable ASPM on ThinkPad T60 now even with "pcie_aspm=force" kernel parameter. So radeon is always hot now.


3.2.4 kernel:

# dmesg | grep ASPM
[    0.000000] PCIe ASPM is forcibly enabled
[    0.161612] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    3.612673] e1000e 0000:02:00.0: Disabling ASPM L0s L1

# lspci -vv -s 01:00.0 | grep ASPM
LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+

# cat /proc/acpi/ibm/thermal
temperatures:	49 41 37 68 36 -128 33 -128 42 54 55 -128 -128 -128 -128 -128


3.2.5 kernel:

# dmesg | grep ASPM
[    0.000000] PCIe ASPM is forcibly enabled
[    0.161614] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    3.523647] e1000e 0000:02:00.0: Disabling ASPM L0s L1

# lspci -vv -s 01:00.0 | grep ASPM
LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+

# cat /proc/acpi/ibm/thermal
temperatures:	51 41 37 72 36 -128 33 -128 43 55 59 -128 -128 -128 -128 -128


Already tested kernels 3.2.7 and 3.3-rc4 - same problem.
Comment 15 Igor Rudchenko 2012-03-19 10:50:43 UTC
Tested 3.3.0 kernel today and nothing changes. So I look deeper into ASPM registers:

----

Windows XP and Linux kernel prior 2.6.38:

root complex
00:01.0 0xB0 == 0x03   (L1 and L0s)

video card
01:00.0 0x68 == 0x43   (L1 and L0s)

----

Linux 3.2.4:

00:01.0 0xB0 == 00   (L0 only)
01:00.0 0x68 == 40   (L0 only)

with pcie_aspm=force:

00:01.0 0xB0.b=43   (L1 and L0s)
01:00.0 0x68.b=43   (L1 and L0s)

----

Linux 3.2.5 and 3.3.0:

00:01.0 0xB0.b=40   (L0 only)
01:00.0 0x68.b=40   (L0 only)

with pcie_aspm=force:

00:01.0 0xB0.b=40   (L0 only)
01:00.0 0x68.b=40   (L0 only)

----

Also I have working ASPM for network devices (ethernet and wireless) with Windows XP and kernels prior 2.6.38. But after 2.6.38 ASPM doesn't turn on even with force key for network devices. And after another rework of ASPM code in 3.0.20, 3.2.5 and 3.3 kernels ASPM doesn't turn on for video card despite force key.

I can enable ASPM on my devices with setpci:

setpci -s 00:01.0 0xB0.b=0x3:3
setpci -s 01:00.0 0x68.b=0x3:3

It works without problems, like it works prior 2.6.38 kernel. But, in my opinion, ASPM handling code in Linux definately needs another rework.
Comment 16 Matthew Garrett 2012-03-19 14:01:30 UTC
Your network driver is explicitly turning off ASPM, so that's completely unrelated to the core ASPM handling code. pcie_aspm=force will only enable ASPM handling, it won't change the policy. If your BIOS didn't enable L1 and you want L1 enabled, you have to set the policy to powersave.
Comment 17 Igor Rudchenko 2012-03-19 16:35:11 UTC
To Matthew Garrett:

I agree about network cards, but current situation with video card worries me much. Prior 3.0.20, 3.2.5 and 3.3 kernels, users of ThinkPads T60 can simply add key "pcie_aspm=force" to kernel and get ASPM working for their radeon card. But after your last patch to ASPM code we can't get ASPM working simply by keys or sysfs. "pcie_aspm=force" does nothing, but the ability to change policy. But changing policy also does nothing! I tried to change to powersave and then watch into registers - I got the same 0x40 values. So now direct change registers with setpci is our only choice to get ASPM working. And so it should not be.