Bug 200731
Summary: | k10temp erratum #319 blacklist is incomplete for Phenom II X4 B99 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Ryan Underwood (nemesis) |
Component: | Hardware Monitoring | Assignee: | Jean Delvare (jdelvare) |
Status: | CLOSED INVALID | ||
Severity: | normal | CC: | clemens, linux |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | Any | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Ryan Underwood
2018-08-04 20:46:41 UTC
My understanding is that 100% reliable detection is not possible because there is an overlap of CPU module+stepping between AM2 and AM3 CPU models. If the system is running DDR3 then we are sure we have an AM3 CPU, but when running DDR2 (which is your case) we can't tell AM2 CPU from AM3 CPU. That being said, you have an AM3 CPU, and AM3 CPUs are not affected by the errata. And the driver is working for you. So I do not understand what you are complaining about? The driver is "working", but the internal core temperature reports 10+ degrees cooler than the external thermal diode connected to the SB. As a result the CPU will overheat and crash before throttling ever commences. I must admit that I am kind of puzzled by comment #2. What does the temperature reported by k10temp have to do with CPU overheating ? An overheating CPU suggests that thermal control in the system is not configured correctly. That doesn't have anything to do with the temperature reported by k10temp or, for that matter, by any other driver. If the temperature reported by k10temp is 10 degrees too low, and that temperature is used for thermal control, thermal control temperature limits need to be adjusted accordingly. Even better would be not to use the temperature reported by k10temp for thermal control in the first place and to use the temperature reported by atk0110-acpi-0 instead. Also, I am not sure I understand what the suggested remedy is supposed to be. Blacklist the CPU because the temperature it provides is inaccurate ? If we go along that route, we would have to blacklist almost all CPUs, since both Intel and AMD CPUs, especially older ones, are notoriously bad in temperature accuracy. I don't really see the point of doing that. Also, it isn't exactly news that those temperatures are only rough guidelines and far from accurate. Sorry, it's been a while since I looked at this. You're right, that the thermal throttling on Phenom is internal and not ACPI thermal-zone based. However, this CPU does seem to suffer from the erratum and is evading the blacklist. If there is no feasible way to improve this, feel free to close. Maybe I am misreading it, but the revision guide for this processor (https://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf) seems to suggest that stepping C3 should not be affected (see Table 27). Perhaps I am misinterpreting the anomalous k10temp output as corresponding to the erratum then? That is, I do not know if it is expected that the core temperature would read ~10C below the external thermal diode temperature, as well as below ambient room temperature at idle. However, I do not at present have a known-buggy stepping to compare with - maybe the erratum is actually a much worse situation of some kind? A 10-degrees temperature difference is by itself not anomalous. As mentioned before, inaccurate temperatures are quite common, especially for older CPUs. AMD's manual says:
> Tctl is the processor temperature control value, used by the platform to
> control cooling systems. Tctl is a non-physical temperature on an
> arbitrary scale measured in degrees. It does _not_ represent an actual
> physical temperature like die or case temperature. Instead, it specifies
> the processor temperature relative to the point at which the system must
> supply the maximum cooling for the processor's specified maximum case
> temperature and maximum thermal power dissipation.
In other words: at the moment shown above, your CPU is 29.1 °C below the point at which it would throttle itself.
The "°C" labels on the reported numbers are not really correct, but the hwmon interface has no mechanism to remove them.
>A 10-degrees temperature difference is by itself not anomalous. As mentioned >before, inaccurate temperatures are quite common, especially for older CPUs. Okay, then maybe this should be closed if this is not an instance of the erratum. > In other words: at the moment shown above, your CPU is 29.1 °C below the > point at which it would throttle itself. I appreciate the clarification, but help me understand then why would that number rise with core utilization? If I understand your explanation correctly (i.e. that the value is a delta rather than an absolute temperature), that should mean the value would drop as utilization increased towards the critical temperature, right? > If I understand your explanation correctly (i.e. that the value is a delta
> rather than an absolute temperature)
Sorry, that were two different explanations.
The "Tctl" mentioned above is the reported value (48.9 in your case). It is an absolute value (but on a scale that might be different from °C).
I've calculated the difference between Tctl and 70 (wrongly, it's actually 21.1 °) because that is the only exact information you get.
It is pretty old by now and I can't find a post with a sample "sensors" output for one of the CPUs affected by the errata, but from what I seem to remember, yes it was much worse than the offset you are seeing. Offset is pretty common for digital temperature sensors integrated modern CPUs and we don't blacklist CPUs just for that. As Clemens quoted from the AMD documentation (and the same applies to Intel CPUs), the digital thermal sensors integrated in CPUs report (negative) _margin_ to the maximum temperature reported in _arbitrary_ unit. We convert that to _absolute_ temperature in _°C_ because that's the only thing supported by the hwmon kernel/user-space interface, but by doing that we are lying to the user, twice. Which admittedly causes some confusion. Still, I don't see any actual bug here that we could fix, so closing this bug report as invalid. Thanks for investigating! |