Bug 200731

Summary: k10temp erratum #319 blacklist is incomplete for Phenom II X4 B99
Product: Drivers Reporter: Ryan Underwood (nemesis)
Component: Hardware MonitoringAssignee: Jean Delvare (jdelvare)
Status: CLOSED INVALID    
Severity: normal CC: clemens, linux
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: Any Subsystem:
Regression: No Bisected commit-id:

Description Ryan Underwood 2018-08-04 20:46:41 UTC
I have this Phenom II X4 B99 (p/n HDXB99WFK4DGM) which demonstrates the erratum.  It appears to be missed because it's stepping C3.  It's an AM3 CPU running in a AM2+ socket with DDR2 memory.

$ sensors
atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage:      +1.41 V  (min =  +0.85 V, max =  +1.60 V)
+12V Voltage:      +11.90 V  (min = +10.20 V, max = +13.80 V)
+5V Voltage:        +4.84 V  (min =  +4.50 V, max =  +5.50 V)
+3.3V Voltage:      +3.25 V  (min =  +2.97 V, max =  +3.63 V)
CPU FAN Speed:     1985 RPM  (min =  800 RPM, max = 7200 RPM)
Chassis FAN Speed: 2156 RPM  (min =  800 RPM, max = 7200 RPM)
Power Fan Speed:      0 RPM  (min =  800 RPM, max = 7200 RPM)
CPU Temperature:    +58.0°C  (high = +65.0°C, crit = +95.0°C)
MB Temperature:     +31.0°C  (high = +45.0°C, crit = +95.0°C)

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +48.9°C  (high = +70.0°C)
                       (crit = +70.0°C, hyst = +68.0°C)

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 4
Model name:            AMD Phenom(tm) II X4 B99 Processor
Stepping:              3
CPU MHz:               3300.000
CPU max MHz:           3300.0000
CPU min MHz:           800.0000
BogoMIPS:              6600.40
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              6144K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
Comment 1 Jean Delvare 2018-09-18 15:24:48 UTC
My understanding is that 100% reliable detection is not possible because there is an overlap of CPU module+stepping between AM2 and AM3 CPU models. If the system is running DDR3 then we are sure we have an AM3 CPU, but when running DDR2 (which is your case) we can't tell AM2 CPU from AM3 CPU.

That being said, you have an AM3 CPU, and AM3 CPUs are not affected by the errata. And the driver is working for you. So I do not understand what you are complaining about?
Comment 2 Ryan Underwood 2018-09-18 15:31:00 UTC
The driver is "working", but the internal core temperature reports 10+ degrees cooler than the external thermal diode connected to the SB.  As a result the CPU will overheat and crash before throttling ever commences.
Comment 3 Guenter Roeck 2018-09-18 16:17:05 UTC
I must admit that I am kind of puzzled by comment #2. What does the temperature reported by k10temp have to do with CPU overheating ? An overheating CPU suggests that thermal control in the system is not configured correctly. That doesn't have anything to do with the temperature reported by k10temp or, for that matter, by any other driver. If the temperature reported by k10temp is 10 degrees too low, and that temperature is used for thermal control, thermal control temperature limits need to be adjusted accordingly. Even better would be not to use the temperature reported by k10temp for thermal control in the first place and to use the temperature reported by atk0110-acpi-0 instead.
Also, I am not sure I understand what the suggested remedy is supposed to be. Blacklist the CPU because the temperature it provides is inaccurate ? If we go along that route, we would have to blacklist almost all CPUs, since both Intel and AMD CPUs, especially older ones, are notoriously bad in temperature accuracy. I don't really see the point of doing that. Also, it isn't exactly news that those temperatures are only rough guidelines and far from accurate.
Comment 4 Ryan Underwood 2018-09-18 16:45:41 UTC
Sorry, it's been a while since I looked at this.  You're right, that the thermal throttling on Phenom is internal and not ACPI thermal-zone based.  However, this CPU does seem to suffer from the erratum and is evading the blacklist.  If there is no feasible way to improve this, feel free to close.
Comment 5 Guenter Roeck 2018-09-18 17:26:41 UTC
Maybe I am misreading it, but the revision guide for this processor (https://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf) seems to suggest that stepping C3 should not be affected (see Table 27).
Comment 6 Ryan Underwood 2018-09-18 18:13:17 UTC
Perhaps I am misinterpreting the anomalous k10temp output as corresponding to the erratum then?  That is, I do not know if it is expected that the core temperature would read ~10C below the external thermal diode temperature, as well as below ambient room temperature at idle.  However, I do not at present have a known-buggy stepping to compare with - maybe the erratum is actually a much worse situation of some kind?
Comment 7 Guenter Roeck 2018-09-18 19:22:35 UTC
A 10-degrees temperature difference is by itself not anomalous. As mentioned before, inaccurate temperatures are quite common, especially for older CPUs.
Comment 8 Clemens Ladisch 2018-09-18 19:51:29 UTC
AMD's manual says:
> Tctl is the processor temperature control value, used by the platform to
> control cooling systems. Tctl is a non-physical temperature on an
> arbitrary scale measured in degrees. It does _not_ represent an actual
> physical temperature like die or case temperature. Instead, it specifies
> the processor temperature relative to the point at which the system must
> supply the maximum cooling for the processor's specified maximum case
> temperature and maximum thermal power dissipation.

In other words: at the moment shown above, your CPU is 29.1 °C below the point at which it would throttle itself.

The "°C" labels on the reported numbers are not really correct, but the hwmon interface has no mechanism to remove them.
Comment 9 Ryan Underwood 2018-09-18 22:22:31 UTC
>A 10-degrees temperature difference is by itself not anomalous. As mentioned
>before, inaccurate temperatures are quite common, especially for older CPUs.

Okay, then maybe this should be closed if this is not an instance of the erratum.

> In other words: at the moment shown above, your CPU is 29.1 °C below the
> point at which it would throttle itself.

I appreciate the clarification, but help me understand then why would that number rise with core utilization?  If I understand your explanation correctly (i.e. that the value is a delta rather than an absolute temperature), that should mean the value would drop as utilization increased towards the critical temperature, right?
Comment 10 Clemens Ladisch 2018-09-19 06:26:28 UTC
> If I understand your explanation correctly (i.e. that the value is a delta
> rather than an absolute temperature)

Sorry, that were two different explanations.

The "Tctl" mentioned above is the reported value (48.9 in your case). It is an absolute value (but on a scale that might be different from °C).

I've calculated the difference between Tctl and 70 (wrongly, it's actually 21.1 °) because that is the only exact information you get.
Comment 11 Jean Delvare 2018-09-19 09:41:57 UTC
It is pretty old by now and I can't find a post with a sample "sensors" output for one of the CPUs affected by the errata, but from what I seem to remember, yes it was much worse than the offset you are seeing. Offset is pretty common for digital temperature sensors integrated modern CPUs and we don't blacklist CPUs just for that.

As Clemens quoted from the AMD documentation (and the same applies to Intel CPUs), the digital thermal sensors integrated in CPUs report (negative) _margin_ to the maximum temperature reported in _arbitrary_ unit. We convert that to _absolute_ temperature in _°C_ because that's the only thing supported by the hwmon kernel/user-space interface, but by doing that we are lying to the user, twice. Which admittedly causes some confusion.

Still, I don't see any actual bug here that we could fix, so closing this bug report as invalid.
Comment 12 Ryan Underwood 2018-09-19 15:15:05 UTC
Thanks for investigating!