Bug 8866

Summary: k8temp sensor displays wrong temperature
Product: Drivers Reporter: Andrey Panov (panov)
Component: Hardware MonitoringAssignee: Rudolf Marek (r.marek)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, drescherjm, herrmann.der.user, jdelvare, just.for.lkml, w.kless
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22.1 Subsystem:
Regression: --- Bisected commit-id:

Description Andrey Panov 2007-08-08 16:35:44 UTC
I have Athlon64 x2 processor:

cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping        : 1
cpu MHz         : 2300.000
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy misalignsse
bogomips        : 4602.11
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps


The k8temp sensor displays too low temperatures for CPU (less than room temperature):

sensors
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp:
             +17C
Core0 Temp:
              +3C
Core1 Temp:
             +21C
Core1 Temp:
              +5C


At that time sensors from M2V Asus motherboard show something more reasonable:

it8712-isa-0d00
Adapter: ISA adapter
CPU Temp:    +41C  (low  =    -1C, high =  +127C)   sensor = thermistor

I suspect that temperatures from k8temp should be increased by 30 degrees.
I use slamd64-current linux distribution (64 bit), libsensors and config from lm_sensors-2.10.4.
Comment 1 Torsten Kaiser 2008-04-13 09:03:09 UTC
I have the same problem with a BE-2400.

Even under load the display form k8temp stays at ~24..26C and I'm using a passive cooler that on touch feels definitely warmer that 37C.

I looked for documentation of the PCI function 0xE4 from Device 0x1103 and found the PDFs at amd.com

#32559, don't know the URL, I downloaded it nearly a year ago.
It shows on page 175..177 under section 4.6.23 that bits 24 to 28 are used for a "TjOffset" to calculate the real temperature as Tcontrol = CurTmp - TjOffset*2 - 49 with CurTmp being the bits that are currently used by k8temp.
As all bits 24..28 are zero for my CPU this did not help.

#31116 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116.pdf

Says about 0xE4 on page 242:
14:8 DiodeOffset. Read-only. Reset: value varies by product. This field is used to specify the correction value applied to thermal diode measurements. See also section 2.10.2 [Thermal Diode] on page 110.
     It is encoded as follows:
       00h is undefined.
       01h to 3Fh: correction = +11C - DiodeOffset, or {01h to 3Fh} = {+10C to -52C}.
       40h to 7Fh: undefined.

The DiodeOffset is also metioned in #32559, but that PDF is a little unclear about this and in that version bit 14 is not included in it.

If I change this line in k8temp from:
#define TEMP_FROM_REG(val)        (((((val) >> 16) & 0xff) - 49) * 1000)
to:
#define TEMP_FROM_REG(val)        (((((val) >> 16) & 0xff) + 11 - (((val)>>8) & 0x3f) ) * 1000)

I get much better values:
temp1_input:56000
temp1_input_raw:4e613a
temp2_input:47000
temp2_input_raw:45617a
temp3_input:53000
temp3_input_raw:4b613e
temp4_input:46000
temp4_input_raw:44a17e

(values direct from sysfs. The _raw version is direct value as read with pci_read_config_dword)

The above change is obvious wrong for DiodeOffset==0 and might be wrong for the barcelona / phenom cores if bit 14 is set, but at least for a significant number of current K8 cores such an adjustment for the DiodeOffest!=0 case would make much sense.

There is even an extra fat note the k8temp is partly broken on http://www.lm-sensors.org/wiki/Devices , so I suspect a large number of people fall about that.

my /proc/cpuinfo:
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) X2 Dual Core Processor BE-2400
stepping        : 2
cpu MHz         : 2300.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
bogomips        : 4727.07
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) X2 Dual Core Processor BE-2400
stepping        : 2
cpu MHz         : 2300.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
bogomips        : 4727.07
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

Just ask, if more information is needed.
Comment 2 Rudolf Marek 2008-04-13 09:15:36 UTC
Hi,

Well it seems that putting diode offset may get values back to reasonable values, question is if this correction applies for the digital thermal sensor. I never succeeded to get any reasonable answer from AMD. 

Another problem is that internal connection to sensor is broken in microprocessor and this is true for all revF CPUs so far (and even barcelonas). Please check the the Erratum 141. In the meanwhile I will try to contact AMD again and get some feedback. 

Rudolf
Comment 3 Rudolf Marek 2008-05-21 15:42:43 UTC
Hi again,

I tried to get more information out of AMD, and there is no reliable way how to fix it with the diode offset, at least I will change the driver to print some warning. 
Rudolf
Comment 4 Jean Delvare 2008-09-24 06:49:31 UTC
Code fix? Where? I can't see any code change upstream.
Comment 5 Alan 2008-09-24 07:41:43 UTC
Sorry was under the impression the driver warning had been implemented
Comment 6 Jean Delvare 2008-09-25 05:49:14 UTC
I am not aware of any such patch. If it exists, it's not in mainline. Please reopen this bug.
Comment 7 Rudolf Marek 2008-10-01 14:08:25 UTC
I'm working on that. Expect this soon. Rudolf
Comment 8 John M. Drescher 2009-02-02 12:38:34 UTC
Any progress on this. I saw this today on a 2.6.26 kernel and an Asus M2N with an X2 cpu 5600+ 65W version. 

 # sensors
k8temp-pci-00c3
Adapter: PCI adapter
core0 temp:   +6.0 C
temp2:        +0.0 C
core1 temp:   +7.0 C
temp4:        -9.0 C

it8716-isa-0290
Adapter: ISA adapter
VCore:            +0.99 V  (min =  +0.00 V, max =  +4.08 V)
VDDR:             +3.17 V  (min =  +0.00 V, max =  +4.08 V)
+5V:              +4.70 V  (min =  +0.00 V, max =  +6.85 V)
+12V:            +11.71 V  (min =  +0.00 V, max = +16.32 V)
5VSB:             +4.70 V  (min =  +0.00 V, max =  +6.85 V)
VBat:             +2.98 V
CPU Fan:         2566 RPM  (min =    0 RPM)
Chassis Fan 1:      0 RPM  (min =    0 RPM)
Power Supply Fan:   0 RPM  (min =    0 RPM)
CPU Temp:         +25.0 C  (low  =  -1.0 C, high = +127.0 C)  sensor = thermal diode
MB Temp:          +38.0 C  (low  =  -1.0 C, high = +127.0 C)  sensor = transistor
MB Temp:           -6.0 C  (low  =  -1.0 C, high = +127.0 C)  sensor = transistor
cpu0_vid:        +1.100 V
Comment 9 Jean Delvare 2009-02-02 12:51:39 UTC
The k8temp patches that went in 2.6.29 should make the temperature readings look slightly better. Nevertheless the thermal sensors are unreliable on these CPUs, this is a hardware problem and no software workaround is possible.
Comment 10 Torsten Kaiser 2009-02-13 11:40:17 UTC
Looking at k8temp.c from 2.6.29-rc4 I see a fixed temp_offset of 21000, aka 21 °C.

My system is model 107 == 0x6b and would trigger this fix:
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) X2 Dual Core Processor BE-2400
stepping        : 2

With my modified formula from comment #1 I see what I think are correct readings:
f71882fg-isa-0600
Adapter: ISA adapter
CPU:         +34 C  (high =   +85 C, hyst =   +81 C)
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp:             +42 C
Core0 Temp:             +33 C
Core1 Temp:             +38 C
Core1 Temp:             +29 C

The second temperature from Core0 seems to match the CPU-temp from the external sensors, albeit the change is somewhat slower.

But the raw output from the pci register was 37217a at that point where the reading was at 33°C. That means my CPU/Board would need an Adjustment of 27000, not 21000.

I would still strongly suggest to read the adjustment from the pci register instead of hardcoding it.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116.pdf

On page 242 is the formula I'm using, including that the offset should be read from the same pci register as the raw temperature values.

(I would assume this offset is board specific and my MSI K9AG has a different offset then the board of Andreas Herrmann, the author of the patch that added the offset)
Comment 11 Jean Delvare 2009-02-15 02:37:18 UTC
You did notice that Andreas Herrmann works for AMD, didn't you? I bet he knows what he's doing.

Anyway, the bottom line is that these CPU sensors are unreliable. So you can apply pretty much any offset you like to have the readings you want to have, the numbers don't really matter.
Comment 12 herrmann.der.user 2009-02-16 06:58:19 UTC
FYI, DiodeOffset is used
"to correct the measurement made by an external temperature sensor."

There is an on-die thermal diode which can be connected to an external sensor
via 2 chip pins.

To get a correct relation between this external temperature sensor
value and TcontrolMax you need to add DiodeOffset.

Some notes regarding comment #10:

(1) I don't think that the values are correct.
You have
CPU:         +34 C  (high =   +85 C, hyst =   +81 C)
and
Core0 Temp:             +42 C

Using the fix that is in mainline kernel you would have seen
Core0 Temp:             +36 C
which is much closer to the other CPU temperature value reported by an
external sensor (I guess this sensor is connected to the on-die diode,
but I am not sure.)

(2) The referenced document is the BKDG for family 10h CPUs (e.g. Phenom).
AFAIK your CPU is family 0xf (K8).

(3)
Your DiodeOffset is -22 degC.

I guess TcontrolMax for your part is 70 degC. (see
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/33954.pdf)

This means, when your mainboard has an external sensor connected to the
on-die diode and it reports 92degC then you have reached TcontrolMax. (TcontrolMax
"represents the maximum allowed TCONTROL value for the processor to
be within its functional temperature specification.")