Bug 210121 - [bisected] k10temp on zen 2 cpus (mis)reports dangerously high voltage at idle
Summary: [bisected] k10temp on zen 2 cpus (mis)reports dangerously high voltage at idle
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Hardware Monitoring (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Guenter Roeck
URL:
Keywords:
Depends on:
Blocks: 210685
  Show dependency tree
 
Reported: 2020-11-09 18:12 UTC by Linux_Chemist
Modified: 2020-12-26 00:15 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.10 release candidates
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
unusual sensor output after the commit (9.51 KB, image/png)
2020-11-09 18:12 UTC, Linux_Chemist
Details
typical sensor output from before the commit (9.46 KB, image/png)
2020-11-09 18:13 UTC, Linux_Chemist
Details
Patch to fix for Ryzen9 (1014 bytes, patch)
2020-12-20 20:14 UTC, Malcolm Lashley
Details | Diff

Description Linux_Chemist 2020-11-09 18:12:10 UTC
Created attachment 293595 [details]
unusual sensor output after the commit

Hello there, this issue was brought to light on Phoronix by user mumar1
https://www.phoronix.com/forums/forum/software/general-linux-open-source/1216655-linux-5-10-rc2-released-following-the-intel-mic-removal

The issue affects all kernel 5.10 release candidates and wasn't present in 5.9.0.
It may affect all zen 2 cpus as it affects mumar1's 3900x and my 3700x.


Issue itself:

Using lm-sensors, k10temp-pci-00c3 reports alarmingly high Vcore voltage of 1.55V even at idle. Note that the report of VCore and Vsoc also swap around after this issue, but even so 1.55V on Vsoc is equally as alarming.

It is either misreporting a high voltage or is accurately reporting a damagingly high voltage that will (under prolonged usage of the kernel), degrade the cpu.


I have bisected the issue to:

d6144a40041a70ecce410a07d3a2b6906f6e7da9 is the first bad commit
commit d6144a40041a70ecce410a07d3a2b6906f6e7da9
Author: Wei Huang <wei.huang2@amd.com>
Date:   Thu Aug 27 00:42:42 2020 -0500

    hwmon: (k10temp) Define SVI telemetry and current factors for Zen2 CPUs
    
    The voltage telemetry registers for Zen2 are different from Zen1. Also
    the factors of CPU current values are changed on Zen2. Add new definitions
    for these register.
    
    Signed-off-by: Wei Huang <wei.huang2@amd.com>
    Link: https://lore.kernel.org/r/20200827054242.2347-2-wei.huang2@amd.com
    Signed-off-by: Guenter Roeck <linux@roeck-us.net>


I will attach some pics of idle sensor readings before and after the commit for comparison and would recommend the patch is reverted immediately if it IS damaging zen 2 cpus that test out a 5.10 release candidate kernel in testing. 

Thank you for your time
Comment 1 Linux_Chemist 2020-11-09 18:13:50 UTC
Created attachment 293597 [details]
typical sensor output from before the commit

Note the arrangement of Vcore and Vsoc in addition to the values.
Comment 2 Linux_Chemist 2020-11-09 18:15:53 UTC
*Correction, they are in the same order, not sure why I made a note that they weren't. 

Hope there's enough to go on here!
Comment 3 Linux_Chemist 2020-11-09 18:34:01 UTC
Could it be as simple as due to: 

data->svi_addr[0] = F17H_M01H_SVI_TEL_PLANE1;
data->svi_addr[1] = F17H_M01H_SVI_TEL_PLANE0;

becoming 

data->svi_addr[0] = F17H_M31H_SVI_TEL_PLANE0;
data->svi_addr[1] = F17H_M31H_SVI_TEL_PLANE1;

instead of

data->svi_addr[0] = F17H_M31H_SVI_TEL_PLANE1;
data->svi_addr[1] = F17H_M31H_SVI_TEL_PLANE0;


In which case the misreporting is harmless? ;)
Comment 4 Linux_Chemist 2020-11-09 19:00:07 UTC
That revert alone returns Vcore to a sensible number (.931V) but then VSoc is giving the nonsensical 1.55V instead, as you would expect. 

My takeaway from this is that the voltage reporting is just being miscalculated and isn't indicative of the true voltage, thus there is no physical damage being done to the chip but perhaps there would be considerations to throttling/power management? 

In any case, I am content to watch and take note.
Comment 5 Guenter Roeck 2020-11-18 03:18:25 UTC
Thanks a lot for reporting this problem. I must admit that it is quite news to me that just _reporting_ a wrong voltage might damage a CPU. That makes me quite curious - how is this supposed to happen ?
As for the specific problem, unfortunately I am physically away from my systems and won't be able to test/verify this anytime soon. If you can confirm that the change in #3 fixes the problem, I would suggest to submit a patch to implement it. Please make sure to copy the author of the offending patch.
Comment 6 Linux_Chemist 2020-11-18 12:57:02 UTC
Not being familiar with the ins and outs, it's possible to initially think that sensor readings are accurate (dare the innocent user doubt the sensor? lol) and that they are reporting a deeper problem (e.g. that something else is overvolting the cpu or stressing it to an extreme degree even outside of what is set in the bios. 1.5V is recommended only to be hit transiently on ryzen, so 1.55V on idle is quite scary-looking.) - many things are possible, I don't know lol. 
It's only natural to have a hypothesis when one goes looking to solve a problem and for it to fall by the wayside when more data is gathered. All part of the learning process :)

As per comment #4, if you tweak back only the changes in comment #3, you get a sensible number for VCore but VSoc then gets the 1.55V no matter what. It doesn't fix the problem - the offending patch itself rather drastically changed elements of the calculation for zen2. I'd be willing to bet you get the 1.55V reading on any zen2 cpu.
I found myself wondering why the change was needed in the first place if the readings are OK before the patch because it's not like the new factors help converge on similar readings to pre-patch (or for example adding the odd 100mv or something): the new readings are _wildly_ different and it's impossible to follow where the e.g. new "ZEN_SVI_BASE + 0x14" came from to make an educated comment. 

However, if the change in the patch was important, then I would imagine the author would prefer to alter the calculation themself minorly rather than revert the entire patch and I would prefer to give them that option, so I'll probably e-mail them directly as soon as I can. Cheers!
Comment 7 Guenter Roeck 2020-11-18 15:03:35 UTC
#6 - agreed, we should give the author the opportunity to respond before taking any action.
Comment 8 Wei Huang 2020-11-18 18:37:29 UTC
Thanks. There are two changes brought by this specific commit:
  1) The planes reading from different addresses
  2) The factor changed from 0.25A to 0.31A

I think the constant value is caused by (1). It is possible that server part might have a different power plane configuration from desktop parts. Let me dig out the documentation and do a comparison. 

BTW I don't think this commit will damage any parts by reporting incorrect info, unless some of service has dependency on the readings (which I am not aware of).

-Wei
Comment 9 Wei Huang 2020-11-19 06:18:29 UTC
I haven't got hold of a desktop part yet. But I did go back to a Zen2 (Rome, CPU model 0x31) server and did some testing. 

If we look at the code of drivers/hwmon/k10temp.c, AMD 3700x (model 0x71) shares the same code path as Rome 0x31. Since it is an EPYC server, I had to instrument the code to force k10temp to report voltages/currents:

diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
index a250481b5a97..0b6887b7c0a8 100644
--- a/drivers/hwmon/k10temp.c
+++ b/drivers/hwmon/k10temp.c
@@ -526,7 +526,7 @@ static int k10temp_probe(struct pci_dev *pdev, const struct pci_device_id *id)
                        break;
                case 0x31:      /* Zen2 Threadripper */
                case 0x71:      /* Zen2 */
-                       data->show_current = !is_threadripper() && !is_epyc();
+                       data->show_current = true;
                        data->cfactor[0] = F17H_M31H_CFACTOR_ICORE;
                        data->cfactor[1] = F17H_M31H_CFACTOR_ISOC;
                        data->svi_addr[0] = F17H_M31H_SVI_TEL_PLANE0;

Then I ran the "sensors" command with and without "make -j256" of Linux kernel compilation. Here are the results:

=== Without Kernel Compilation ===
k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.16 V  
Vsoc:        931.00 mV 
Tctl:         +27.5°C  
Tdie:         +27.5°C  
Tccd1:        +24.2°C  
Tccd2:        +26.2°C  
Tccd3:        +28.0°C  
Tccd4:        +25.5°C  
Tccd5:        +23.2°C  
Tccd6:        +26.5°C  
Tccd7:        +24.8°C  
Tccd8:        +26.2°C  
Icore:         8.00 A  
Isoc:         21.70 A  

k10temp-pci-00cb
Adapter: PCI adapter
Vcore:         1.18 V  
Vsoc:        938.00 mV 
Tctl:         +27.0°C  
Tdie:         +27.0°C  
Tccd1:        +23.8°C  
Tccd2:        +26.5°C  
Tccd3:        +27.2°C  
Tccd4:        +26.5°C  
Tccd5:        +26.2°C  
Tccd6:        +26.8°C  
Tccd7:        +24.8°C  
Tccd8:        +25.2°C  
Icore:         6.00 A  
Isoc:         20.77 A  

=== With Kernel Compilation ===
30s after compilations started

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.15 V  
Vsoc:        931.00 mV 
Tctl:         +42.4°C  
Tdie:         +42.4°C  
Tccd1:        +39.5°C  
Tccd2:        +41.8°C  
Tccd3:        +42.8°C  
Tccd4:        +38.2°C  
Tccd5:        +36.8°C  
Tccd6:        +39.8°C  
Tccd7:        +38.2°C  
Tccd8:        +39.8°C  
Icore:        76.00 A  
Isoc:         25.73 A  

k10temp-pci-00cb
Adapter: PCI adapter
Vcore:         1.14 V  
Vsoc:        938.00 mV 
Tctl:         +44.4°C  
Tdie:         +44.4°C  
Tccd1:        +39.5°C  
Tccd2:        +40.8°C  
Tccd3:        +43.2°C  
Tccd4:        +43.8°C  
Tccd5:        +43.0°C  
Tccd6:        +42.0°C  
Tccd7:        +41.5°C  
Tccd8:        +41.8°C  
Icore:        91.00 A  
Isoc:         25.73 A  

The Vcore values are reasonable (< 1.55) and other readings (temperature, current) increase along with the system load. I think the values look reasonable.

Anyway this is just FYI. I need to find to a desktop part for the same testing.
Comment 10 Malcolm Lashley 2020-12-20 20:11:28 UTC
One other defect that I am seeing on 3950X (and that is also present in the original-reporters link) is that Icore is 0 with the 5.10 version of k10temp.

I don't have processor manuals, but I can only assume that Model=71 actually *does* use the same addrs as F17H_M01H.

Here's the output of the 5.10.1 src hwmon, followed by my patched version.

/usr/src/linux-5.10.1-gentoo # modprobe k10temp ; sensors -u k10temp-* ; rmmod k10temp ; echo "Patched" ; insmod drivers/hwmon/k10temp.ko ; sensors -u k10temp-* ; rmmod k10temp
k10temp-pci-00c3
Adapter: PCI adapter
Vcore:
  in0_input: 1.550
Vsoc:
  in1_input: 1.450
Tctl:
  temp1_input: 37.000
Tdie:
  temp2_input: 37.000
Tccd1:
  temp3_input: 33.500
Tccd2:
  temp4_input: 32.000
Icore:
  curr1_input: 0.000
Isoc:
  curr2_input: 11.780

Patched
======
k10temp-pci-00c3
Adapter: PCI adapter
Vcore:
  in0_input: 1.438
Vsoc:
  in1_input: 1.069
Tctl:
  temp1_input: 37.000
Tdie:
  temp2_input: 37.000
Tccd1:
  temp3_input: 35.500
Tccd2:
  temp4_input: 31.000
Icore:
  curr1_input: 33.000
Isoc:
  curr2_input: 10.500


# lscpu | grep -A3 ^Vendor
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3950X 16-Core Processor

I patched to split model 31/71 and return the latter to the old/working behaviour as below:

--- /mnt/btrfs/duality-backup/root.20201220/usr/src/linux-5.10.1-gentoo/drivers/hwmon/k10temp.c 2020-12-13 22:41:30.000000000 +0000
+++ drivers/hwmon/k10temp.c     2020-12-20 19:58:00.838631684 +0000
@@ -525,7 +525,6 @@
                        k10temp_get_ccd_support(pdev, data, 4);
                        break;
                case 0x31:      /* Zen2 Threadripper */
-               case 0x71:      /* Zen2 */
                        data->show_current = !is_threadripper() && !is_epyc();
                        data->cfactor[0] = F17H_M31H_CFACTOR_ICORE;
                        data->cfactor[1] = F17H_M31H_CFACTOR_ISOC;
@@ -533,6 +532,14 @@
                        data->svi_addr[1] = F17H_M31H_SVI_TEL_PLANE1;
                        k10temp_get_ccd_support(pdev, data, 8);
                        break;
+               case 0x71:      /* Zen2 */
+                       data->show_current = !is_threadripper() && !is_epyc();
+                       data->cfactor[0] = F17H_M01H_CFACTOR_ICORE;
+                       data->cfactor[1] = F17H_M01H_CFACTOR_ISOC;
+                       data->svi_addr[0] = F17H_M01H_SVI_TEL_PLANE1;
+                       data->svi_addr[1] = F17H_M01H_SVI_TEL_PLANE0;
+                       k10temp_get_ccd_support(pdev, data, 8);
+                       break;
                }
        } else if (boot_cpu_data.x86 == 0x19) {
                data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK;


Maybe others with desktop parts can give this a try and report back?
Comment 11 Malcolm Lashley 2020-12-20 20:14:42 UTC
Created attachment 294243 [details]
Patch to fix for Ryzen9
Comment 12 Guenter Roeck 2020-12-20 22:26:19 UTC
Support for displaying currents and voltages for AMD Ryzen CPUs is, due to lack of documentation, unsupportable and will be removed either with Linux v5.11 or v5.12.
Comment 13 Linux_Chemist 2020-12-21 23:18:28 UTC
I'm very sorry to hear that and I hope AMD release the documentation as soon as they can in 2021.
At any rate, I wish you all a merry xmas and thank you for your efforts!
Comment 14 Duncan Roe 2020-12-23 23:30:51 UTC
I have a similar problem with `sensors` on my AMD Ryzen 7 3700X 8-Core Processor.
At **v5.10** `Icore` shows 0.00 A.
Changing `F17H_M31H_SVI_TEL_PLANE0` from `ZEN_SVI_BASE + 0x14` back to `ZEN_SVI_BASE + 0xc` gives a nonzero value which looks to be nonsense: 25.00 A idle and 29.00 A during kernel `make -j17`. Values at **v5.9.y** looked sensible.
Comment 15 Guenter Roeck 2020-12-23 23:58:36 UTC
Starting with Linux kernel v5.11, the k10temp driver will no longer report voltage and current values. I'll likely request a backport to v5.10 to avoid persisting reports that the reported values are inaccurate.
Comment 16 Duncan Roe 2020-12-26 00:15:04 UTC
I reverted to v5.9.y k10temp and the numbers looked entirely reasonable: 51A during kernel make -j17 reverting to 11A afterwards. I'll stick with that for now.

Note You need to log in before you can comment on or make changes to this bug.