Kernel Bug Tracker – Bug 29572
Radeon card reports wrong temperature when switched off
Last modified: 2012-08-16 11:03:31 UTC
In previous versions of the kernel (I've tested the 2.6.37 version) when the I switched off the radeon card using vga_switcheroo libsensors was correctly reporting that the radeon temperature was 0° (or invalid).
This doesn't happen any more using the latest kernel, in fact (after the commit 20d391d72519527d2266a0166490118b40ff998d, I figure) when my radeon card has been switched off (or after a suspend/resume cycle) sensors indicates:
Adapter: PCI adapter
This is obiouvsly impossible.
When the GPU is powered down, the temperature is undefined as the hw sensor only works when the GPU is powered up and the mmio bar is mapped. 0°C or +2147355.6°C are both wrong.
Ok, both are wrong... But I'd prefer that 0° would be shown (also as a confirmation that the card is OFF) instead of an invalid value...
Retaining the old behavior is desirable.
The old behavior was wrong. The temperature value in the register was interpreted incorrectly prior to my recent patch (improper handling of signed values). Also, if the card is disabled, the value of the mmio registers is undefined.
(In reply to comment #4)
> The old behavior was wrong.
Don't care really. We shouldn't change interfaces.
> The temperature value in the register was
> interpreted incorrectly prior to my recent patch (improper handling of signed
That seems unrelated.
> Also, if the card is disabled, the value of the mmio registers is
So reads should have returned -EINVAL from day one. Too late to fix that. The best thing to do now would be to detect this situation and to return zero, preserving the API.
The previous behavior was undefined; it just happened to be 0 for one user. It's reading back a register from an MMIO aperture on a disabled PCI device. It might read back as 50 for someone else in the same situation.
None of the temperatures potentially returned are accurate when the device is disabled.
The interface changed. Why is this so hard to understand? Change it back! It's two lines of code, I expect.
I bet everyone's machine was previously reading zero. Now it's reading random crap. Random crap which can lead userspace to think that the machine is overheating, which could have fairly serious consequences.
We should never have made that temperature readable when the hardware is disabled. Now we have done so, we should return a safe and predictable result. Not random non-back-compatible crap!
I'm personally not convinced that number 0 is the way to go, because it is quite close to normal temperatures. It looks like a bug in interface and that it should return -EINVAL. I know that the interface shouldn't change whenever possible, but this looks like a real interface bug. I also guess it is nowhere defined that 0°C means "OFF".
And, by the way, think about the cooling with liquid nitrogen (on some advanced gamer PCs). Is 0°C out of range? Another, maybe more important, question - is 0°C out of the valid value range of the sensor?
Just my 2 cents, ignore me if I said something wrong :-)
(In reply to comment #8)
> I bet everyone's machine was previously reading zero. Now it's reading random
It was always reading back random random crap, that's my point! There was never a special case to return 0 when the card is disabled. Now we could return some fixed value when the card is disabled, but as Oldřich noted, is 0 really a reasonable value?
(In reply to comment #10)
> is 0 really a reasonable value?
Well no, not really. I assume that a machine will work OK in -30C ambient, in which case the chip might actually be running at 0C. That doesn't seem terribly harmful though.
If you use the value 0°C as "OFF" while it is really a valid value, then you pass the decision whether the card is off outside of the kernel to the software reading the temperature. So that the software could not trust the value 0 which would have double meaning and it would have to verify it from other sources.