Bug 219148 - k10temp: Tccd missing on Zen 4, possible incorrect valid bit check?
Summary: k10temp: Tccd missing on Zen 4, possible incorrect valid bit check?
Status: RESOLVED DOCUMENTED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Hardware Monitoring (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Mario Limonciello (AMD)
URL: https://community.frame.work/t/missin...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-11 06:00 UTC by Colin S
Modified: 2024-08-21 20:39 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Colin S 2024-08-11 06:00:54 UTC
Dear maintainer,

Due to procrastination, I wondered why k10temp was not showing Tccd temperatures for 7840HS (Zen 4, family 19h, model 74h) on Linux, when HWiNFO on Windows seems to give plausible values.

Digging through source history I discovered the CCD register addresses for Zen were originally determined empirically, and then the existence of a valid bit was assumed by deduction from symbols in an amdgpu header <https://github.com/torvalds/linux/commit/fd8bdb23b91876ac1e624337bb88dc1dcc21d67e>. The newest revision of this GPU header (14.0.2) doesn’t seem to contain these symbols any more.

After presenting these findings to Mario Limonciello at <https://community.frame.work/t/missing-per-core-cpu-temperatures-from-k10temp/55833> he requested I open a ticket here and assign to him to check internal documentation when he returns to office.

I am not able to set the assignee on new ticket, so if I am unable to change it myself after submission, I would appreciate it very much if you could reassign this ticket to him.

Thank you kindly,
Comment 1 Jean Delvare 2024-08-12 08:04:02 UTC
Done, thanks for the report.
Comment 2 Mario Limonciello (AMD) 2024-08-12 17:37:36 UTC
I'll CC Perry who might be able to look at this before I'm back. Otherwise I'll look when I'm back.
Comment 3 Mario Limonciello (AMD) 2024-08-21 20:39:30 UTC
FYI - You were referring to the wrong IP version for your product.

https://docs.kernel.org/gpu/amdgpu/driver-misc.html#gpu-product-information
It's 13.0.4 and 13.0.11 for 7x40 devices.

Anyway, I double checked the internal documentation and yes the valid bit use in k10temp is still correct for this product.

When valid is 0 you can't get a THM_DIE*_TEMP reading.

Note You need to log in before you can comment on or make changes to this bug.