Bug 210121
Summary: | [bisected] k10temp on zen 2 cpus (mis)reports dangerously high voltage at idle | ||
---|---|---|---|
Product: | Drivers | Reporter: | Linux_Chemist (untaintableangel) |
Component: | Hardware Monitoring | Assignee: | Guenter Roeck (linux) |
Status: | NEW --- | ||
Severity: | normal | CC: | duncan_roe, jdelvare, mlashley, wei.huang2 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.10 release candidates | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 210685 | ||
Attachments: |
unusual sensor output after the commit
typical sensor output from before the commit Patch to fix for Ryzen9 |
Description
Linux_Chemist
2020-11-09 18:12:10 UTC
Created attachment 293597 [details]
typical sensor output from before the commit
Note the arrangement of Vcore and Vsoc in addition to the values.
*Correction, they are in the same order, not sure why I made a note that they weren't. Hope there's enough to go on here! Could it be as simple as due to: data->svi_addr[0] = F17H_M01H_SVI_TEL_PLANE1; data->svi_addr[1] = F17H_M01H_SVI_TEL_PLANE0; becoming data->svi_addr[0] = F17H_M31H_SVI_TEL_PLANE0; data->svi_addr[1] = F17H_M31H_SVI_TEL_PLANE1; instead of data->svi_addr[0] = F17H_M31H_SVI_TEL_PLANE1; data->svi_addr[1] = F17H_M31H_SVI_TEL_PLANE0; In which case the misreporting is harmless? ;) That revert alone returns Vcore to a sensible number (.931V) but then VSoc is giving the nonsensical 1.55V instead, as you would expect. My takeaway from this is that the voltage reporting is just being miscalculated and isn't indicative of the true voltage, thus there is no physical damage being done to the chip but perhaps there would be considerations to throttling/power management? In any case, I am content to watch and take note. Thanks a lot for reporting this problem. I must admit that it is quite news to me that just _reporting_ a wrong voltage might damage a CPU. That makes me quite curious - how is this supposed to happen ? As for the specific problem, unfortunately I am physically away from my systems and won't be able to test/verify this anytime soon. If you can confirm that the change in #3 fixes the problem, I would suggest to submit a patch to implement it. Please make sure to copy the author of the offending patch. Not being familiar with the ins and outs, it's possible to initially think that sensor readings are accurate (dare the innocent user doubt the sensor? lol) and that they are reporting a deeper problem (e.g. that something else is overvolting the cpu or stressing it to an extreme degree even outside of what is set in the bios. 1.5V is recommended only to be hit transiently on ryzen, so 1.55V on idle is quite scary-looking.) - many things are possible, I don't know lol. It's only natural to have a hypothesis when one goes looking to solve a problem and for it to fall by the wayside when more data is gathered. All part of the learning process :) As per comment #4, if you tweak back only the changes in comment #3, you get a sensible number for VCore but VSoc then gets the 1.55V no matter what. It doesn't fix the problem - the offending patch itself rather drastically changed elements of the calculation for zen2. I'd be willing to bet you get the 1.55V reading on any zen2 cpu. I found myself wondering why the change was needed in the first place if the readings are OK before the patch because it's not like the new factors help converge on similar readings to pre-patch (or for example adding the odd 100mv or something): the new readings are _wildly_ different and it's impossible to follow where the e.g. new "ZEN_SVI_BASE + 0x14" came from to make an educated comment. However, if the change in the patch was important, then I would imagine the author would prefer to alter the calculation themself minorly rather than revert the entire patch and I would prefer to give them that option, so I'll probably e-mail them directly as soon as I can. Cheers! #6 - agreed, we should give the author the opportunity to respond before taking any action. Thanks. There are two changes brought by this specific commit: 1) The planes reading from different addresses 2) The factor changed from 0.25A to 0.31A I think the constant value is caused by (1). It is possible that server part might have a different power plane configuration from desktop parts. Let me dig out the documentation and do a comparison. BTW I don't think this commit will damage any parts by reporting incorrect info, unless some of service has dependency on the readings (which I am not aware of). -Wei I haven't got hold of a desktop part yet. But I did go back to a Zen2 (Rome, CPU model 0x31) server and did some testing. If we look at the code of drivers/hwmon/k10temp.c, AMD 3700x (model 0x71) shares the same code path as Rome 0x31. Since it is an EPYC server, I had to instrument the code to force k10temp to report voltages/currents: diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index a250481b5a97..0b6887b7c0a8 100644 --- a/drivers/hwmon/k10temp.c +++ b/drivers/hwmon/k10temp.c @@ -526,7 +526,7 @@ static int k10temp_probe(struct pci_dev *pdev, const struct pci_device_id *id) break; case 0x31: /* Zen2 Threadripper */ case 0x71: /* Zen2 */ - data->show_current = !is_threadripper() && !is_epyc(); + data->show_current = true; data->cfactor[0] = F17H_M31H_CFACTOR_ICORE; data->cfactor[1] = F17H_M31H_CFACTOR_ISOC; data->svi_addr[0] = F17H_M31H_SVI_TEL_PLANE0; Then I ran the "sensors" command with and without "make -j256" of Linux kernel compilation. Here are the results: === Without Kernel Compilation === k10temp-pci-00c3 Adapter: PCI adapter Vcore: 1.16 V Vsoc: 931.00 mV Tctl: +27.5°C Tdie: +27.5°C Tccd1: +24.2°C Tccd2: +26.2°C Tccd3: +28.0°C Tccd4: +25.5°C Tccd5: +23.2°C Tccd6: +26.5°C Tccd7: +24.8°C Tccd8: +26.2°C Icore: 8.00 A Isoc: 21.70 A k10temp-pci-00cb Adapter: PCI adapter Vcore: 1.18 V Vsoc: 938.00 mV Tctl: +27.0°C Tdie: +27.0°C Tccd1: +23.8°C Tccd2: +26.5°C Tccd3: +27.2°C Tccd4: +26.5°C Tccd5: +26.2°C Tccd6: +26.8°C Tccd7: +24.8°C Tccd8: +25.2°C Icore: 6.00 A Isoc: 20.77 A === With Kernel Compilation === 30s after compilations started k10temp-pci-00c3 Adapter: PCI adapter Vcore: 1.15 V Vsoc: 931.00 mV Tctl: +42.4°C Tdie: +42.4°C Tccd1: +39.5°C Tccd2: +41.8°C Tccd3: +42.8°C Tccd4: +38.2°C Tccd5: +36.8°C Tccd6: +39.8°C Tccd7: +38.2°C Tccd8: +39.8°C Icore: 76.00 A Isoc: 25.73 A k10temp-pci-00cb Adapter: PCI adapter Vcore: 1.14 V Vsoc: 938.00 mV Tctl: +44.4°C Tdie: +44.4°C Tccd1: +39.5°C Tccd2: +40.8°C Tccd3: +43.2°C Tccd4: +43.8°C Tccd5: +43.0°C Tccd6: +42.0°C Tccd7: +41.5°C Tccd8: +41.8°C Icore: 91.00 A Isoc: 25.73 A The Vcore values are reasonable (< 1.55) and other readings (temperature, current) increase along with the system load. I think the values look reasonable. Anyway this is just FYI. I need to find to a desktop part for the same testing. One other defect that I am seeing on 3950X (and that is also present in the original-reporters link) is that Icore is 0 with the 5.10 version of k10temp. I don't have processor manuals, but I can only assume that Model=71 actually *does* use the same addrs as F17H_M01H. Here's the output of the 5.10.1 src hwmon, followed by my patched version. /usr/src/linux-5.10.1-gentoo # modprobe k10temp ; sensors -u k10temp-* ; rmmod k10temp ; echo "Patched" ; insmod drivers/hwmon/k10temp.ko ; sensors -u k10temp-* ; rmmod k10temp k10temp-pci-00c3 Adapter: PCI adapter Vcore: in0_input: 1.550 Vsoc: in1_input: 1.450 Tctl: temp1_input: 37.000 Tdie: temp2_input: 37.000 Tccd1: temp3_input: 33.500 Tccd2: temp4_input: 32.000 Icore: curr1_input: 0.000 Isoc: curr2_input: 11.780 Patched ====== k10temp-pci-00c3 Adapter: PCI adapter Vcore: in0_input: 1.438 Vsoc: in1_input: 1.069 Tctl: temp1_input: 37.000 Tdie: temp2_input: 37.000 Tccd1: temp3_input: 35.500 Tccd2: temp4_input: 31.000 Icore: curr1_input: 33.000 Isoc: curr2_input: 10.500 # lscpu | grep -A3 ^Vendor Vendor ID: AuthenticAMD CPU family: 23 Model: 113 Model name: AMD Ryzen 9 3950X 16-Core Processor I patched to split model 31/71 and return the latter to the old/working behaviour as below: --- /mnt/btrfs/duality-backup/root.20201220/usr/src/linux-5.10.1-gentoo/drivers/hwmon/k10temp.c 2020-12-13 22:41:30.000000000 +0000 +++ drivers/hwmon/k10temp.c 2020-12-20 19:58:00.838631684 +0000 @@ -525,7 +525,6 @@ k10temp_get_ccd_support(pdev, data, 4); break; case 0x31: /* Zen2 Threadripper */ - case 0x71: /* Zen2 */ data->show_current = !is_threadripper() && !is_epyc(); data->cfactor[0] = F17H_M31H_CFACTOR_ICORE; data->cfactor[1] = F17H_M31H_CFACTOR_ISOC; @@ -533,6 +532,14 @@ data->svi_addr[1] = F17H_M31H_SVI_TEL_PLANE1; k10temp_get_ccd_support(pdev, data, 8); break; + case 0x71: /* Zen2 */ + data->show_current = !is_threadripper() && !is_epyc(); + data->cfactor[0] = F17H_M01H_CFACTOR_ICORE; + data->cfactor[1] = F17H_M01H_CFACTOR_ISOC; + data->svi_addr[0] = F17H_M01H_SVI_TEL_PLANE1; + data->svi_addr[1] = F17H_M01H_SVI_TEL_PLANE0; + k10temp_get_ccd_support(pdev, data, 8); + break; } } else if (boot_cpu_data.x86 == 0x19) { data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; Maybe others with desktop parts can give this a try and report back? Created attachment 294243 [details]
Patch to fix for Ryzen9
Support for displaying currents and voltages for AMD Ryzen CPUs is, due to lack of documentation, unsupportable and will be removed either with Linux v5.11 or v5.12. I'm very sorry to hear that and I hope AMD release the documentation as soon as they can in 2021. At any rate, I wish you all a merry xmas and thank you for your efforts! I have a similar problem with `sensors` on my AMD Ryzen 7 3700X 8-Core Processor. At **v5.10** `Icore` shows 0.00 A. Changing `F17H_M31H_SVI_TEL_PLANE0` from `ZEN_SVI_BASE + 0x14` back to `ZEN_SVI_BASE + 0xc` gives a nonzero value which looks to be nonsense: 25.00 A idle and 29.00 A during kernel `make -j17`. Values at **v5.9.y** looked sensible. Starting with Linux kernel v5.11, the k10temp driver will no longer report voltage and current values. I'll likely request a backport to v5.10 to avoid persisting reports that the reported values are inaccurate. I reverted to v5.9.y k10temp and the numbers looked entirely reasonable: 51A during kernel make -j17 reverting to 11A afterwards. I'll stick with that for now. |