Bug 29572

Summary: Radeon card reports wrong temperature when switched off
Product: Drivers Reporter: Marco Trevisan (Treviño) (mail)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED OBSOLETE    
Severity: normal CC: akpm, alan, alexdeucher, mail, oldium.pro
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc5 Subsystem:
Regression: Yes Bisected commit-id:

Description Marco Trevisan (Treviño) 2011-02-21 10:15:54 UTC
In previous versions of the kernel (I've tested the 2.6.37 version) when the I switched off the radeon card using vga_switcheroo libsensors was correctly reporting that the radeon temperature was 0° (or invalid).

This doesn't happen any more using the latest kernel, in fact (after the commit 20d391d72519527d2266a0166490118b40ff998d, I figure) when my radeon card has been switched off (or after a suspend/resume cycle) sensors indicates:

radeon-pci-0100
Adapter: PCI adapter
temp1:      +2147355.6°C

This is obiouvsly impossible.
Comment 1 Alex Deucher 2011-02-21 15:38:00 UTC
When the GPU is powered down, the temperature is undefined as the hw sensor only works when the GPU is powered up and the mmio bar is mapped.  0°C or +2147355.6°C are both wrong.
Comment 2 Marco Trevisan (Treviño) 2011-02-21 16:24:51 UTC
Ok, both are wrong... But I'd prefer that 0° would be shown (also as a confirmation that the card is OFF) instead of an invalid value...
Comment 3 Andrew Morton 2011-02-24 00:13:17 UTC
Retaining the old behavior is desirable.
Comment 4 Alex Deucher 2011-02-24 04:36:28 UTC
The old behavior was wrong.  The temperature value in the register was interpreted incorrectly prior to my recent patch (improper handling of signed values).  Also, if the card is disabled, the value of the mmio registers is undefined.
Comment 5 Andrew Morton 2011-02-24 05:14:11 UTC
(In reply to comment #4)
> The old behavior was wrong.

Don't care really. We shouldn't change interfaces.
 
>  The temperature value in the register was
> interpreted incorrectly prior to my recent patch (improper handling of signed
> values).

That seems unrelated.

>  Also, if the card is disabled, the value of the mmio registers is
> undefined.

So reads should have returned -EINVAL from day one.  Too late to fix that.  The best thing to do now would be to detect this situation and to return zero, preserving the API.
Comment 6 Alex Deucher 2011-02-24 06:19:12 UTC
The previous behavior was undefined; it just happened to be 0 for one user.  It's reading back a register from an MMIO aperture on a disabled PCI device.  It might read back as 50 for someone else in the same situation.
Comment 7 Alex Deucher 2011-02-24 06:21:01 UTC
None of the temperatures potentially returned are accurate when the device is disabled.
Comment 8 Andrew Morton 2011-02-24 08:14:46 UTC
The interface changed.  Why is this so hard to understand?  Change it back!  It's two lines of code, I expect.

I bet everyone's machine was previously reading zero.  Now it's reading random crap.  Random crap which can lead userspace to think that the machine is overheating, which could have fairly serious consequences.

We should never have made that temperature readable when the hardware is disabled.  Now we have done so, we should return a safe and predictable result.  Not random non-back-compatible crap!
Comment 9 Oldřich Jedlička 2011-02-24 10:54:32 UTC
I'm personally not convinced that number 0 is the way to go, because it is quite close to normal temperatures. It looks like a bug in interface and that it should return -EINVAL. I know that the interface shouldn't change whenever possible, but this looks like a real interface bug. I also guess it is nowhere defined that 0°C means "OFF".

And, by the way, think about the cooling with liquid nitrogen (on some advanced gamer PCs). Is 0°C out of range? Another, maybe more important, question - is 0°C out of the valid value range of the sensor?

Just my 2 cents, ignore me if I said something wrong :-)
Comment 10 Alex Deucher 2011-02-24 17:36:13 UTC
(In reply to comment #8)
> I bet everyone's machine was previously reading zero.  Now it's reading
> random
> crap.

It was always reading back random random crap, that's my point!  There was never a special case to return 0 when the card is disabled.  Now we could return some fixed value when the card is disabled, but as Oldřich noted, is 0 really a reasonable value?
Comment 11 Andrew Morton 2011-02-24 18:59:33 UTC
(In reply to comment #10)
> is 0 really a reasonable value?

Well no, not really.  I assume that a machine will work OK in -30C ambient, in which case the chip might actually be running at 0C.  That doesn't seem terribly harmful though.
Comment 12 Oldřich Jedlička 2011-02-25 06:17:11 UTC
If you use the value 0°C as "OFF" while it is really a valid value, then you pass the decision whether the card is off outside of the kernel to the software reading the temperature. So that the software could not trust the value 0 which would have double meaning and it would have to verify it from other sources.