Bug 216161

Summary: TSC marked unstable on AMD Ryzen 3500U
Product: Platform Specific/Hardware Reporter: Nelson G (konoha02)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED WILL_NOT_FIX    
Severity: normal CC: konoha02, mario.limonciello, samy
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 5.18.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg-linux5-18-5
amdgpu_firmware_info
amdgpu_firmware_info_ryzen7-3700U
dmesg_ryzen_7_3700U_T495
6.11-dmesg-3500u-defaults
amdgpu_firmware_info-6.11-rc6

Description Nelson G 2022-06-21 17:36:03 UTC
Created attachment 301248 [details]
dmesg-linux5-18-5

Hello,
Following on from https://bugzilla.kernel.org/show_bug.cgi?id=202525 and the suggestion to make new BZ entries for different generations of Ryzen

Machine:
Thinkpad E495
AMD Ryzen 3500U
BIOS 1.24 (latest)

TSC is marked as unstable, machine switches to HPET.  I tried cold boots, warm boots,several reboots, and I have never seen even occasionally tsc working.

From 5.18.5: 
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2096.045 MHz processor
[    0.013688] ACPI: HPET 0x00000000B931A000 000038 (v01 LENOVO TP-R11   00001240 PTEC 00000002)
[    0.013744] ACPI: Reserving HPET table memory at [mem 0xb931a000-0xb931a037]
[    0.063717] ACPI: HPET id: 0x43538210 base: 0xfed00000
[    0.063803] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.133984] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.141015] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e36987b600, max_idle_ns: 440795285743 ns
[    0.252018] TSC synchronization [CPU#0 -> CPU#1]:
[    0.252018] Measured 7370423391 cycles TSC warp between CPUs, turning off TSC clock.
[    0.252018] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[    0.258444] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.305329] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.305329] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.308082] clocksource: Switched to clocksource hpet
[    0.328797] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.777144] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[   35.970878] SVM: TSC scaling supported


dmesg attached.
Comment 1 Mario Limonciello (AMD) 2022-06-23 14:53:24 UTC
Can you please correlate this to cold/warm boot?  Does it only happen in one scenario?

Also, can you please share /sys/kernel/debug/dri/0/amdgpu_firmware_info?
Comment 2 Mario Limonciello (AMD) 2022-06-23 14:54:28 UTC
Sorry; I see you did confirm it's both cold and warm boot, this is definitely a unique issue to the others then.
Comment 3 Nelson G 2022-06-24 08:45:54 UTC
Created attachment 301266 [details]
amdgpu_firmware_info

Here is may amdgpu_firmware_info.  I hope it helps.
If you need anything else, please let me know.
Comment 4 Mario Limonciello (AMD) 2022-06-24 16:03:02 UTC
> SMC feature version: 0, program: 0, firmware version: 0x00041e2a (4.30.42)

Thanks.  I'll share this with the right folks.

I want to set expectations on the path forward because it's not a short path. AMD may try to reproduce this on reference hardware.  If the team can't reproduce it, the only ones that could fix it are Lenovo.  

If the team can repro and comes up with a fix it would need to be provided to OEMs like Lenovo.  OEMs would need to do their own testing with it and eventually could roll it out.  

---

If any other users with Ryzen 3000 encounter this as well, please provide the same details Neil provided:
1) Add a dmesg confirming the failure.
2) Provide amdgpu_firmware_info from your failing system.
Comment 5 Lahfa Samy 2022-09-13 05:13:15 UTC
I'm running on a Lenovo T495 with an AMD Ryzen 7 3700U, here is my amdgpu_firmware_info, I just upgraded my kernel and reboot and noticed this issue, just now.
Comment 6 Lahfa Samy 2022-09-13 05:14:25 UTC
Created attachment 301798 [details]
amdgpu_firmware_info_ryzen7-3700U

Ryzen 7 3700U Zen+, VBIOS version: 113-PICASSO-114
Comment 7 Lahfa Samy 2022-09-13 05:18:56 UTC
Created attachment 301799 [details]
dmesg_ryzen_7_3700U_T495

dmesg logs from boot till USB device being connected, I have a USB-C Thinkpad Dock that also gets connected at startup with 1 external screen and few USB devices connected to it.
Comment 8 Mario Limonciello (AMD) 2022-12-05 22:47:26 UTC
AMD internal team had a check with reference hardware, and this can't be reproduced on SMU 0x00041e57 (4.30.87).  This will need to be fixed by OEM upgrading the BIOS to include newer firmware.
Comment 9 Lahfa Samy 2023-03-29 20:46:39 UTC
Hi @Mario, there has been a recent bios update but nothing mentioned in the changelog other than some Windows specific fixes, is there any way I can check that my firmware has been put up to date ?? Beside being unable to reproduce this bug immediately right now ?

BIOS version : R12ET62W(1.32 ) 
Linux changelog : https://download.lenovo.com/pccbbs/mobiles/r12ul62w.txt
Release date : 27 Febuary 2023

Here the version of the SMC, I have at the moment : 
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info|grep -i smc
SMC feature version: 0, program: 0, firmware version: 0x00041e2f (4.30.47)
Comment 10 Mario Limonciello (AMD) 2023-03-29 21:03:01 UTC
> is there any way I can check that my firmware has been put up to date ??

It's up to the OEMs to decide whether to pick up anything from AMD AGESA.  So "up to date" isn't a term that really makes sense in this context.

> Beside being unable to reproduce this bug immediately right now ?

You can try both cold and warm boot.  If it's similar to Ryzen 5000 issue it was tied specifically to warm boot.

> SMC feature version: 0, program: 0, firmware version: 0x00041e2f (4.30.47)

This is older than what AMD validated, but that doesn't mean it will have the issue.
Comment 11 Lahfa Samy 2023-04-15 23:20:46 UTC
It seems this issue is fixed for me, I haven't encountered it again.
Comment 12 Nelson G 2023-04-16 02:04:27 UTC
(In reply to Lahfa Samy from comment #11)
> It seems this issue is fixed for me, I haven't encountered it again.

Just curious,  you're still using BIOS 1.32 right?  Any idea how did it got fixed for you?  It seems that BIOS hasn't fixed this yet.
Comment 13 Lahfa Samy 2023-05-08 09:56:56 UTC
I am using the BIOS 1.32 yes, and recently I've seen the issue happen again, it wasn't fixed you are correct.
Comment 14 Nelson G 2023-06-18 06:43:13 UTC
Hi Mario,  today I updated to BIOS 1.27 (Thinkpad E495) and the output to cat /sys/kernel/debug/dri/0/amdgpu_firmware_info|grep -i smc is: SMC feature version: 0, program: 0, firmware version: 0x00041e57 (4.30.87)

but it is still running clocksource hpet,  I tried cold boot and warm boot.  Could you please clarify to me if this is still Lenovo OEM's faulty BIOS?
Comment 15 Mario Limonciello (AMD) 2023-06-19 16:34:59 UTC
The version tested by AMD on reference hardware was also 4.30.87 and it passed.  So if you're seeing the warp still this should be OEM BIOS issue at this point.
Comment 16 Lahfa Samy 2023-07-07 09:24:43 UTC
Hi @Mario,

Nothing against you here, just feedback for AMD as a whole, if AMD cannot push/incentivize OEMs to ship the latest validated firmware (mostly AGESA, I believe) because I'm still hitting this bug and even sound playback is affected sometimes after one BIOS update.

My SMC version is still the old one.
> sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info|grep -i smc
> SMC feature version: 0, program: 0, firmware version: 0x00041e2f (4.30.47)
I've seen this : 
> [ 8668.176055] hpet: Lost 3094 RTC interrupts
Not sure if it's related to this issue.

I'll start recommending people to buy other devices than with AMD CPUs, Intel seems much more capable of at least having firmware fixes pushed by OEMs (in my case Lenovo) or stable/durable firmware for Linux, even if the performance is less impressive than AMD, I can understand paying more for stability and durability of firmware and device as a whole.

I'll also see, if someone with more reach can make people more aware of this specific issue, as there is really no solution AMD can provide (I understand, that they've done their best so far) however OEMs are too slow to provide updates especially BIOS updates and we both know how much they care about their Linux customer base, so basically, I just need to throw my "old" device and just a buy new Intel one and settle for that, that's the real fix to this bug.
Comment 17 Mario Limonciello (AMD) 2023-07-07 17:00:08 UTC
>  AMD cannot push/incentivize OEMs to ship the latest validated firmware
>  (mostly AGESA, I believe)

Unfortunately, when AMD looked at it, many of the TSC issues reported in Linux are also present in Windows but a lot harder to detect.

As all things in the software world latest doesn't always translate to greatest.  There is always risk with taking updates and the most important thing is stability.

As many models ship with Windows you need to look at the equation from a business perspective.  If OEMs don't get a complaint for a problem from "their" customers, why should they go through the effort to take a new AGESA update from AMD that could introduce risk? 

> but it is still running clocksource hpet,  I tried cold boot and warm boot. 
> Could you please clarify to me if this is still Lenovo OEM's faulty BIOS?

AMD tested reference hardware with the same APU running a modern kernel and same AGESA and didn't have problems with cold or warm boot.  It has to be OEM BIOS issue at this point.

> I'll start recommending people to buy other devices than with AMD CPUs, Intel
> seems much more capable of at least having firmware fixes pushed by OEMs (in
> my case Lenovo) or stable/durable firmware for Linux, even if the performance
> is less impressive than AMD, I can understand paying more for stability and
> durability of firmware and device as a whole.
> I'll also see, if someone with more reach can make people more aware of this
> specific issue, as there is really no solution AMD can provide (I understand,
> that they've done their best so far) however OEMs are too slow to provide
> updates especially BIOS updates and we both know how much they care about
> their Linux customer base, so basically, I just need to throw my "old" device
> and just a buy new Intel one and settle for that, that's the real fix to this
> bug.

With the case of Lenovo, they have certain systems in a "Linux program".  These systems are sold with and regularly tested with Linux and issues found are fixed from BIOS updates.  You might want to check with them if your system is part of this program.
Comment 18 Lahfa Samy 2023-07-09 12:25:01 UTC
I see thanks for your insightful input on this issue, maybe if AMD can make those bugs more impactful for Windows users, then OEM would be forced to fix them (an artificial incentive, but this could hurt AMDs business), I understand this is just plain weird to do something as such.

It seems the T495 is indeed a Linux certified device: https://support.lenovo.com/my/en/solutions/pd500343-linux-certification-thinkpad-t495-20njz4krus (maybe only specific models are ?) but digging deeper, I believe it is certified to run Ubuntu.

I think, I'm just bound to wait when the OEM wants to fix it or not, I'll just avoid Lenovo (and many more) as OEMs for now.

One last question, if you can answer that, which OEM implements AMD validated firmware in a timely fashion (unlike Lenovo in this case) ? That is probably dependent as well on laptop's release dates (the newer the laptop, the more chance they upgrade its BIOS) ?
Comment 19 Mario Limonciello (AMD) 2023-07-10 01:04:33 UTC
> I think, I'm just bound to wait when the OEM wants to fix it or not, I'll
> just avoid Lenovo (and many more) as OEMs for now.

I suggest you reach out and voice your opinion to your system manufacturer.  As I mentioned above unless they're hearing from their customers they may not prioritize their updates.

> One last question, if you can answer that

I'm sorry; I can't help with that question.
Comment 20 Nelson G 2023-07-10 03:14:30 UTC
Hey, Lahfa Samy,  try writing your issue in this forum https://forums.lenovo.com/t5/Linux-Operating-Systems/ct-p/lx_en  it's worth trying,  try reaching MarkRHPearson  at least he 'tries' to get the message to the BIOS team.
Comment 21 Lahfa Samy 2023-07-10 14:19:56 UTC
Hi, Mario, 

I've already voiced my opinion to the official client support showing them this link and letting them know, you from AMD has confirmed that a BIOS update is needed. 

They told me it was coming end 2022, and one BIOS update did indeed come Febuary 2023.
The firmware inside seems still older than what have said AMD has validated that didn't have issues (comment 8 in this bug report). 
In the changelog, no Linux bug was mentioned, only a Windows fix.

Hey, Nelson G, I've opened a post on the forum and tagged MarkRHPearson, will see how it goes, but I'm still pretty skeptical.

If it ever gets fixed for me, I'll mention it here.
Comment 22 Lahfa Samy 2023-07-20 22:41:58 UTC
Hi, concerning the T495, reaching out to MarkRHPearson, I did receive an answer that it is being worked on and is on the roadmap however more recent Thinkpads are currently being updated. As such, the T495 will not receive the BIOS update soon. Here is the post, I've created : https://forums.lenovo.com/t5/Other-Linux-Discussions/Lenovo-Thinkpad-T495-AMD-TSC-marked-unstable-kernel-bug-needs-a-BIOS-update-for-a-fix/m-p/5237197?page=1#6037331
Comment 23 Nelson G 2024-09-04 02:34:54 UTC
Just wanted to say for the record that I just tested on a few distros (debian, ubuntu, fedora, and arch), the boot paramaters: tsc=reliable clocksource=tsc nohpet 
and nowadays it enables tsc just fine on my E495, with an important performance improvement (specially latency which drops dramatically compared to hpet)  No BIOS update introduced any change about this topic, so something somewhere linux 6.1 and beyond seems to have improved/stabilize this?  
Except that tsc still has to be "forced" with parameters otherwise is still marked unstable. But! the result, again, is tsc being stable and performing just great, it is important to say that by the time i reported this, it was buggy, very buggy to force tsc, the system would suspend randomly and constantly for example,  that among a lot of bugs that made IMPOSSIBLE to even consider enabling it,  now it's been days of using tsc,  so, thanks to whoever made this possible (perhaps by accident?).
Comment 25 Mario Limonciello (AMD) 2024-09-04 19:33:17 UTC
Can you please share an updated kernel log with 6.11-rc6 with no parameters related to TSC applied?
Comment 26 Nelson G 2024-09-05 17:52:18 UTC
Created attachment 306823 [details]
6.11-dmesg-3500u-defaults

here is dmesg, is that enough?
Comment 27 Nelson G 2024-09-05 18:09:18 UTC
Created attachment 306824 [details]
amdgpu_firmware_info-6.11-rc6

just in case
Comment 28 Mario Limonciello (AMD) 2024-09-05 18:10:29 UTC
Thanks, so you're still seeing a TSC warp issue between CPUs.

[    0.286359] TSC synchronization [CPU#0 -> CPU#2]:
[    0.286363] Measured 9282405873 cycles TSC warp between CPUs, turning off TSC clock.
[    0.287708] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[    0.287851]  #1 #3 #5 #7