Bug 202525

Summary:	TSC marked unstable on AMD Ryzen 2000
Product:	Platform Specific/Hardware	Reporter:	Steven Noonan (steven)
Component:	x86-64	Assignee:	platform_x86_64 (platform_x86_64)
Status:	RESOLVED INVALID
Severity:	normal	CC:	alexdeucher, amonakov, belegdol, bp, bugzilla.kernel.org, e595, feng.tang, gabemarcano, james, jonas.zeiger, js, konoha02, marcel, marcel, mario.limonciello, marius.andreiana, maximlevitsky, mpearson-lenovo, nix.sasl, pgnet.dev, pmenzel+bugzilla.kernel.org, postix, rafael.ristovski, rubinov.alexander, samy, seb.szymanski, stefanspr94, t.neish, tglx, tomenglund26, tworaz, usama.anjum, vovochka13, wgh, whenov
Priority:	P1
Hardware:	x86-64
OS:	Linux
Kernel Version:	4.19.20	Subsystem:
Regression:	No	Bisected commit-id:
Bug Depends on:
Bug Blocks:	215037
Attachments:	4.19.20 kernel configuration 4.19.20 kernel log 4.19.20 TSC synchronization via MSR_IA32_TSC Fancier patch for TSC synchronization gross hack

Description Steven Noonan 2019-02-08 09:44:51 UTC

Created attachment 281057 [details]
4.19.20 kernel configuration

This is happening for every boot on my Lenovo ThinkPad A485, which has an AMD Ryzen 7 PRO 2700U CPU:

$ journalctl -k -b -2 -o short-monotonic | grep -i tsc
[    0.000000] quasit kernel: tsc: Fast TSC calibration using PIT
[    0.000000] quasit kernel: tsc: Detected 2195.873 MHz processor
[    0.494637] quasit kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fa6f82c460, max_idle_ns: 440795248846 ns
[    0.544639] quasit kernel: TSC synchronization [CPU#0 -> CPU#1]:
[    0.544639] quasit kernel: Measured 3542962214 cycles TSC warp between CPUs, turning off TSC clock.
[    0.544639] quasit kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed

Is there some way to fix this? Firmware/microcode/something?

I'd really rather not use the HPET clocksource if at all possible because it's stupidly expensive to read compared to the TSC.

Attaching kernel config and log.

Comment 1 Steven Noonan 2019-02-08 09:45:17 UTC

Created attachment 281059 [details]
4.19.20 kernel log

Comment 2 Steven Noonan 2019-02-08 10:35:13 UTC

I used a simple tool to watch the TSC values for each CPU, and it looks like CPUs 1-7 are synchronized, but CPU 0 has a large offset compared to the others. They're all advancing at the same frequency (2195MHz) though:

     1028:     -8   1606   1606   1606   1606   1606   1606   1606  3695718 KHz
     2066:     -9   1605   1605   1605   1605   1605   1605   1605  2941100 KHz
     3103:     -8   1606   1606   1606   1606   1606   1606   1606  2692467 KHz
     4141:     -8   1606   1606   1606   1606   1605   1606   1606  2567740 KHz
     5178:     -7   1607   1607   1607   1607   1607   1607   1607  2493565 KHz
     6216:     -7   1607   1607   1607   1607   1607   1607   1607  2443642 KHz
     7253:     -6   1608   1607   1608   1607   1607   1608   1608  2408297 KHz
     8290:     -6   1609   1608   1608   1608   1608   1608   1608  2381780 KHz
     9328:     -6   1608   1608   1608   1608   1608   1609   1608  2361023 KHz
    10365:     -5   1609   1609   1609   1609   1609   1609   1609  2344619 KHz

To explain this a bit... This tool spawns a thread per CPU which reads from the TSC every millisecond. The main thread sleeps for a second, then reads from clock_gettime(CLOCK_MONOTONIC) and reads all the TSC values from the threads. Given a known (or measured) TSC frequency, we can see offsets or drift behavior.

In the output above, each line is prefixed with a millisecond timestamp. It's then followed by 8 values, one per CPU. Each value is the delta between the actual and expected TSC values for that sample. e.g. if we sleep for 1 second, we expect (1 second * TSC frequency) ticks to pass. This offset is expressed in milliseconds. The last value on the line is an estimation of the TSC frequency in KHz, which gets more accurate over time (as long as clock_gettime isn't drifting, that is).

The interesting and incriminating part in the output above is that CPU0 is about -1600ms offset from the TSCs on the other CPUs.

Comment 3 Steven Noonan 2019-02-08 14:45:09 UTC

Created attachment 281063 [details]
4.19.20 TSC synchronization via MSR_IA32_TSC

I wrote the attached patch to write to MSR_IA32_TSC directly if MSR_IA32_TSC_ADJUST is unavailable. This fixes the TSC on boot for my ThinkPad A485. I haven't tested the patch with suspend/resume or anything fancy yet though.

I've been told by some kernel developers that this approach has been tried before and is unreliable, but it worked for my case. I also note that writing to MSR_IA32_TSC is how the FreeBSD kernel accomplishes TSC synchronization by default.

Even if we aren't confident enough to take this path by default, it would be nice to have it available as a tsc= command line option or something.

The ideal solution here, of course, is for Lenovo to unbreak the firmware so this change is rendered unnecessary to begin with.

Comment 4 Ricardo 2019-07-06 23:40:50 UTC

I am experiencing the same issue on a Lenovo ThinkPad T495 with an AMD Ryzen 7 PRO 3700U CPU.

I will test the attached patch and see if it works for me.

Comment 5 Steven Noonan 2019-07-06 23:51:48 UTC

I've improved the patch slightly since my original post:

https://git.uplinklabs.net/steven/projects/archlinux/ec2/ec2-packages.git/tree/linux-hsw/0006-tsc-allow-directly-synchronizing-TSC-if-TSC_ADJUST-i.patch

Also, Lenovo is aware of the issue and has told me they're working on it. Though who knows how long it takes to qualify a new BIOS revision.

Comment 6 Ricardo 2019-07-07 12:19:32 UTC

I've tried your latest patch and (after adding "tsc=directsync" to kernel parameters) it seems to have worked, apparently:

Before:

Jul 07 13:24:27 wizylap kernel: tsc: Fast TSC calibration using PIT
Jul 07 13:24:27 wizylap kernel: tsc: Detected 2295.648 MHz processor
Jul 07 13:24:27 wizylap kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x211726228a3, max_idle_ns: 440795287600 ns
Jul 07 13:24:27 wizylap kernel: TSC synchronization [CPU#0 -> CPU#1]:
Jul 07 13:24:27 wizylap kernel: Measured 3684618515 cycles TSC warp between CPUs, turning off TSC clock.
Jul 07 13:24:27 wizylap kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
hpet

After:

Jul 07 13:34:57 wizylap kernel: tsc: Fast TSC calibration using PIT
Jul 07 13:34:57 wizylap kernel: tsc: Detected 2295.789 MHz processor
Jul 07 13:34:57 wizylap kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2117ab45c4c, max_idle_ns: 440795239981 ns
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -3684794074 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed 46 warp. Overhead: 299
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -69 warp. Overhead: 322
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -184 warp. Overhead: 667
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -529 warp. Overhead: 138
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU2 observed -3684794005 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU3 observed -3684794028 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU4 observed -3684794005 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU5 observed -3684794005 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU6 observed -3684794005 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU7 observed -3684794005 warp. Overhead: 0
Jul 07 13:34:57 wizylap kernel: clocksource: Switched to clocksource tsc-early
Jul 07 13:34:57 wizylap kernel: tsc: Refined TSC clocksource calibration: 2295.687 MHz
Jul 07 13:34:57 wizylap kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x21174ac26ec, max_idle_ns: 440795209452 ns
Jul 07 13:34:57 wizylap kernel: clocksource: Switched to clocksource tsc

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
tsc

That said, I'm still worried about the other kernel developers' comment that this approach is unreliable...

I have created a service request asking Lenovo to fix the T495 BIOS regarding this issue, although I'm not even sure that my request will reach the firmware/BIOS developers...

Thanks!

Comment 7 Ricardo 2019-07-09 02:34:28 UTC

I've been running with this patch and I haven't noticed any visible issues so far, except when suspending/resuming:

[  151.767845] ACPI: Low-level resume complete                             
[  151.767888] ACPI: EC: EC started                                                                                                                                                                                                           
[  151.767889] PM: Restoring platform NVS memory                                                                                                                                                                                                               
[  151.857270] Enabling non-boot CPUs ...                                 
[  151.857306] x86: Booting SMP configuration:                                                                                                                                                                                                
[  151.857307] smpboot: Booting Node 0 Processor 1 APIC 0x1                                                                                                                                                                                                    
[  152.011707] microcode: CPU1: patch_level=0x08108102                  
[  152.041720] TSC direct sync: CPU1 observed -353928071 warp. Overhead: 0
[  151.917551] TSC direct sync: CPU1 observed 46 warp. Overhead: 345 
[  151.947553] TSC direct sync: CPU1 observed -23 warp. Overhead: 184       
[  151.977555] TSC direct sync: CPU1 observed -23 warp. Overhead: 184         
[  152.007556] TSC direct sync: CPU1 observed -23 warp. Overhead: 184                                                                                                                                                                                         
[  152.037557] TSC direct sync: CPU1 observed -23 warp. Overhead: 184              
[  152.067559] TSC direct sync: CPU1 observed -23 warp. Overhead: 184                                                                                                                                                                                         
[  152.097560] TSC direct sync: CPU1 observed -23 warp. Overhead: 161                     
[  152.127562] TSC direct sync: CPU1 observed -23 warp. Overhead: 161                                                                                                                                                                                          
[  152.157564] TSC synchronization [CPU#0 -> CPU#1]:                                                                                                                                                                                                           
[  152.157564] Measured 23 cycles TSC warp between CPUs, turning off TSC clock.                                                                                                                                                                                
[  152.157577] tsc: Marking TSC unstable due to check_tsc_sync_source failed                                                                                                                                                                                  
[  152.157670] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.                                                                                                                                                              
[  152.157671] sched_clock: Marking unstable (151620901633, 446894724)<-(152824405350, -666736506)                                                                                                                                                             
[  152.165230] clocksource: Switched to clocksource hpet                                                                                   
[  152.068796] CPU1 is up                                                                      
(...)

As you can see, after resuming, the clocksource switches to hpet again.

On another note, Lenovo's support department called me back and told me very clearly that they do not support Linux and that as long as Windows works on my laptop, then they can't do anything to help me with my issue. I insisted that they pass the information I provided to their BIOS development team because it's a BIOS problem (not a Linux one), but I really doubt they listened to me. So there's that.

Comment 8 Steven Noonan 2019-07-09 02:37:01 UTC

Lenovo is aware of it, and working on it. I'll ping my contact over there and ask if they've made progress.

Comment 9 Borislav Petkov 2019-07-10 12:42:07 UTC

(In reply to bugzilla.kernel.org from comment #7)
> On another note, Lenovo's support department called me back and told me very
> clearly that they do not support Linux and that as long as Windows works on
> my laptop, then they can't do anything to help me with my issue.

I guess the support department hasn't been updated yet: www.lenovo.com/linux

Comment 10 Alexander Monakov 2020-05-03 18:24:54 UTC

Steven, your updated version of the patch at git.uplinklabs.net seems to longer accessible; can you upload it as attachment here?

Did Lenovo get around to fix this in firmware?

(FWIW, I'm hitting a similar issue on a newer laptop with Ryzen 4500U, though I'm seeing a much smaller TSC warp, ~3600 cycles; but OTOH, sometimes it survives the check_tsc_sync_source sanity check and fails later anyway when the timekeeping watchdog validates TSC against the HPET over a period of 0.5 seconds)

Comment 11 Steven Noonan 2020-05-03 20:25:21 UTC

(In reply to Alexander Monakov from comment #10)
> Steven, your updated version of the patch at git.uplinklabs.net seems to
> longer accessible; can you upload it as attachment here?

I removed the changeset as it was buggy and firmware fixes solve the problem better. But if you want the patch you can dig through the git history:

https://git.uplinklabs.net/steven/projects/archlinux/ec2/ec2-packages.git/commit/linux-hsw?id=e5768993437a3106eaa7b518c79665447c4923bc


> Did Lenovo get around to fix this in firmware?

Yes, they fixed it for the A485 in the 1.28 BIOS update, with this comment in their changelog:

"(Fix) Fixed TSC synchronization [CPU#0 -> CPU#1] under linux."


> (FWIW, I'm hitting a similar issue on a newer laptop with Ryzen 4500U,
> though I'm seeing a much smaller TSC warp, ~3600 cycles; but OTOH, sometimes
> it survives the check_tsc_sync_source sanity check and fails later anyway
> when the timekeeping watchdog validates TSC against the HPET over a period
> of 0.5 seconds)

Weird, that sounds like a different issue. Are you running the latest BIOS version?

Comment 12 Alexander Monakov 2020-05-03 20:41:13 UTC

For me the entire git.uplinklabs.net host is inaccessible; can you upload the patch please? Not just for my own sake, but also so that future readers can easily look at it.

> Weird, that sounds like a different issue. Are you running the latest BIOS
> version?

I'm seeing it on a 4500U which was just recently released, so even though I'm running the latest BIOS version (from Acer/Insyde), it's also the first public version.

Comment 13 Steven Noonan 2020-05-03 21:09:34 UTC

(In reply to Alexander Monakov from comment #12)
> For me the entire git.uplinklabs.net host is inaccessible; can you upload
> the patch please? Not just for my own sake, but also so that future readers
> can easily look at it.

Should be accessible now. It has a country-based blacklist because I was getting a lot of attacks against the system. I turned that off for the moment.

Comment 14 Alexander Monakov 2020-05-03 21:21:52 UTC

Created attachment 288879 [details]
Fancier patch for TSC synchronization

Uploading Steven's patch for future reference.

(thank you, but was adjusting the blacklist really easier than just attaching the patch?..)

Comment 15 Steven Noonan 2020-05-03 21:33:38 UTC

(In reply to Alexander Monakov from comment #14)
> Created attachment 288879 [details]
> Fancier patch for TSC synchronization
> 
> Uploading Steven's patch for future reference.

Note that I don't recommend use of the patch because it's flakey, especially around suspend/resume. But it should be enough to get the clock synced early on for a single boot.

> (thank you, but was adjusting the blacklist really easier than just
> attaching the patch?..)

Yes, actually. It's a one-line change.

Comment 16 Julian Stecklina 2020-10-05 19:18:08 UTC

The Thinkpad L14 Gen 1 also has this problem.

BIOS Information
        Vendor: LENOVO
        Version: R19ET28W (1.12 )
        Release Date: 08/12/2020

Comment 17 Alexander Monakov 2020-10-05 20:27:00 UTC

I managed to report this problem to AMD back in May; they said they were investigating, but likely the fix would be via a firmware update (i.e. via Acer in my case).

Here's more or less what my report to AMD said: https://gist.github.com/amonakov/c65b633f97e5b301f691563ea2f8c636

Comment 18 Julian Stecklina 2020-10-05 20:34:41 UTC

Interesting. If AMD has to fix their reference code that would still have to trickle to customers via the device vendors. This is going to take forever. :(

For visibility, here is also a Lenovo forum post describing the same issue: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464

Comment 19 Steven Noonan 2020-10-05 20:38:00 UTC

(In reply to Alexander Monakov from comment #17)
> I managed to report this problem to AMD back in May; they said they were
> investigating, but likely the fix would be via a firmware update (i.e. via
> Acer in my case).
> 
> Here's more or less what my report to AMD said:
> https://gist.github.com/amonakov/c65b633f97e5b301f691563ea2f8c636

I've been seeing that particular kind of behavior on my Zephyrus G14 (Ryzen 9 4900HS) with BIOS 217, but only on a warm reboot. If I do a cold boot (power off, wait a few seconds, power on), the TSC stays stable. I haven't tested whether it stays stable across suspend/resume, but I know reboots cause it to desync like that.

It'd be nice if AMD would implement the IA32_TSC_ADJUST MSR in the future, so that the TSC can be trivially resynced by the kernel (as long as the TSC frequencies match, which they seem to in my case).

Comment 20 Borislav Petkov 2020-10-05 20:57:35 UTC

(In reply to Julian Stecklina from comment #18)
> Interesting. If AMD has to fix their reference code that would still have to
> trickle to customers via the device vendors. This is going to take forever.

I doubt this is an agesa problem - from past experience this is more like OEM BIOS doing monkey business with the TSCs. Stuff it should not be doing at all. If it were agesa problem, we'd be seeing this left and right and that on AMD reference platforms too.

Comment 21 Maxim Levitsky 2021-03-31 10:32:03 UTC

I have the same issue on Thinkpad X13
(https://psref.lenovo.com/Detail/ThinkPad/ThinkPad_X13_Gen_1_AMD?M=20UF001CUS)

I updated it to latest bios via fwupd (0.1.30), no change.

Comment 22 Maxim Levitsky 2021-04-05 20:53:02 UTC

Created attachment 296245 [details]
gross hack

FYI, I hacked the patch in comment 14, such as it seems to work across suspends.

I did 2 horrible hacks:

1. disable TSC watchdog (possible with tsc=nowatchdog, but there is no parser to enable this and directsync, so I just enabled it on directsync).
I think it barks at the fact that TSC syncs moves it by relatively large value.
It is probably possible to reset it somehow, but since the whole thing is a horrible hack anyway, then I guess this is good enough.

2. ignore small warps (up to 1000 cycles), although it is usually way way less,
warps are usually close to tsc read overhead.

Seems to survive suspends, my VMs work with TSC.
Also doing 'sudo rdmsr 0x10 -a', only last 24 bit vary, similar to my 3970X, on which TSC is stable (and it looks like its 2021 and that still can't be taken for granted).

Comment 23 Julian Stecklina 2021-04-11 15:36:54 UTC

I used to have this issue on my Thinkpad L14 (see above), but it seems some firmware update fixed it. I don't see TSC being disabled with the current firmware anymore.

[    0.000000] DMI: LENOVO 20U50001GE/20U50001GE, BIOS R19ET33W (1.17 ) 03/10/2021

Comment 24 Sébastien Szymanski 2021-04-12 08:13:58 UTC

I see this issue too on my Tuxedo Computers PULSE 15 (Ryzen 7 4800H)

[    0.000000] DMI: TUXEDO TUXEDO Pulse 15 Gen1/PULSE1501, BIOS N.1.07.A02 12/08/2020

Comment 25 Gabriel Marcano 2021-04-18 23:39:50 UTC

I also seem to have this problem with my Dell G5 SE (Ryzen 9 4900H):
[    0.000000] DMI: Dell Inc. G5 5505/0M8C1F, BIOS 1.5.0 10/27/2020

And related TSC lines from the kernel:
# dmesg | egrep 'TSC|tsc'
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 3293.843 MHz processor
[    0.087060] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2f7a945ce7b, max_idle_ns: 440795356303 ns
[    0.217278] TSC synchronization [CPU#0 -> CPU#8]:
[    0.217279] Measured 5082 cycles TSC warp between CPUs, turning off TSC clock.
[    0.217280] tsc: Marking TSC unstable due to check_tsc_sync_source failed

I found something online that might be related (and I might end up trying as well):
https://www.dell.com/community/Linux-General/Installing-Ubuntu-18-04-on-Inspiron-5575/td-p/7350323

tl;dr, some people are reporting that changing some NVME parameters fix their timing issue. This hack seems unrelated to me, but I'm unfamiliar with NVME and if/how it could possibly cause issues.

Comment 26 Gabriel Marcano 2021-04-19 00:36:19 UTC

Sadly, the directions from the Dell link did not seem to have an effect on my, even though I also have a WD NVME drive. I will attempt to contact Dell and see if they have anything to say about this, since this does look like a hardware/UEFI problem to me.

Comment 27 Maxim Levitsky 2021-06-21 06:10:06 UTC

FYI I am still using my gross hack, and it still works across suspends. I still have no new firmware available via fwupd sadly.

Comment 28 Jonas Zeiger 2021-07-01 09:14:29 UTC

Problem was fixed on my Thinkpad T14 Gen1 AMD (Ryzen 7) with System Firmware 1.32:

https://support.lenovo.com/us/en/downloads/ds544977-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-1-types-20ud-20ue

Apparently no more TSC issues:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
tsc

Comment 29 Jonas Zeiger 2021-07-05 07:08:31 UTC

The problem still sporadically reappears with firmware 1.32 and Linux 5.12.14 (TSC unstable, clocksource switched to HPET) after cold boots, but is generally gone after a reboot.

Comment 30 Marcel Sackermann 2021-07-23 15:52:39 UTC

(In reply to Jonas Zeiger from comment #29)
> The problem still sporadically reappears with firmware 1.32 and Linux
> 5.12.14 (TSC unstable, clocksource switched to HPET) after cold boots, but
> is generally gone after a reboot.

This. Very interesting. The described behavior is reproducible on my T14 Gen1 AMD (Firmware 1.34, Linux 5.13.4) aswell.

Cold boot -> tsc-early unstable; clocksource = hpet
reboot    -> clocksource = tsc-early

Comment 31 Nelson G 2021-07-26 07:33:24 UTC

Thinkpad E495

 

$ dmesg | egrep -i 'tsc|hpet|clocksource'
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2096.096 MHz processor
[    0.004968] ACPI: HPET 0x00000000B90A3000 000038 (v01 LENOVO TP-R11   00001210 PTEC 00000002)
[    0.005018] ACPI: Reserving HPET table memory at [mem 0xb90a3000-0xb90a3037]
[    0.014115] ACPI: HPET id: 0x43538210 base: 0xfed00000
[    0.014183] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.059588] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.079620] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e36c89f085, max_idle_ns: 440795235573 ns
[    0.199622] TSC synchronization [CPU#0 -> CPU#1]:
[    0.199622] Measured 5479288407 cycles TSC warp between CPUs, turning off TSC clock.
[    0.199622] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[    0.216551] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.373390] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.373393] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.375642] clocksource: Switched to clocksource hpet
[    0.392540] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.137978] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs

Comment 32 Jonas Zeiger 2021-08-11 07:30:18 UTC

I keep wondering why my Lenovo T14 Gen1 AMD shows this symptom only sometimes and tried some things.

It could be a series of flukes, but it seems like TSC stays stable when the Ethernet cable is *NOT* plugged during cold-boot and marked as unstable if cold-booting with Ethernet cable plugged.

Assuming this observation is correct: what is caring about ethernet link during early boot? What does that have to do with an unstable TSC?

Comment 33 Alex Deucher 2021-08-20 15:15:53 UTC

Does this patch help?
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3108085

Comment 34 Alexander Monakov 2021-08-20 16:14:08 UTC

(In reply to Alex Deucher from comment #33)
> Does this patch help?
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/
> 3108085

This patch touches a different functionality. There are two separate TSC validation mechanisms referenced in this bug:

1) early check implemented in arch/x86/kernel/tsc_sync.c (check_tsc_sync_source); it validates TSCs on different cores against each other;

2) "timekeeping watchdog" that validates each TSC against HPET over a period of 0.5s (not 1 second as in the patched code); it is implemented in kernel/time/clocksource.c (clocksource_watchdog).

Both of these were observed to fail. For me, the first reliably fails after a soft reboot, and the second occasionally fails on cold boots.

As far as I can tell the patch does not touch the first one at all, and should not have an effect on the second one either.

Comment 35 Jonas Zeiger 2021-08-30 08:06:16 UTC

> Does this patch help?
>
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3108085

I tried v5.12.19 with the patch applied: no change (tsc still randomly marked unstable on cold-boot)

Comment 36 Jonas Zeiger 2021-08-30 10:54:29 UTC

On Lenovo T14 AMD Gen1, Firmware 1.35, Linux 5.12.19 the problem is still reproducible.

Comment 37 Mik 2021-08-30 11:50:51 UTC

ThinkPad E585 FHD

Same mobo, same CPU, same reading steady (Hpet), RJ-45 wire has no effect, nor cold boot,. There were no side effects during the entire Kernel 5.14 cycle. Debian or Arch, Legacy or UEFI.

Since the correction was applied on BIOS 1.54 to the iommu=soft issue (painful journey of work), we did not modify any parameters or made any manual correction and we nave no intention to do so. Latest is 1.58 (Dec.2019) and there won't be any new one. 

The Kernel is 5.14-rc7/MBR/TPM 2.0 is OFF.

Comment 38 Tom Englund 2021-09-13 09:14:44 UTC

Acer Nitro 5 AN515-45 ryzen 5800H

seeing TSC issues aswell, tried both patches but im still getting this at boot.

adding tsc=reliable to cmdline makes it run just fine tho. im on latest bios with linux 5.14.1 and ucode's applied. not sure if its related or a different issue.

[    0.040743] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3821924579961850 ns
[    0.088695] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.098718] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e0abdfedbc, max_idle_ns: 440795208922 ns
[    0.248947] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3822520892550000 ns
[    0.282750] clocksource: Switched to clocksource tsc-early
[    0.288743] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.302783] tsc: Refined TSC clocksource calibration: 3193.998 MHz
[    1.302794] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e0a244aeba, max_idle_ns: 440795290469 ns
[    1.302855] clocksource: Switched to clocksource tsc
[    2.102955] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.102977] clocksource:                       'hpet' wd_nsec: 488105680 wd_now: 1bc2226 wd_last: 1517e35 mask: ffffffff
[    2.102983] clocksource:                       'tsc' cs_nsec: 495853791 cs_now: 92b9b0aa0 cs_last: 8cd34d780 mask: ffffffffffffffff
[    2.102987] clocksource:                       'tsc' is current clocksource.
[    2.102993] tsc: Marking TSC unstable due to clocksource watchdog
[    2.103482] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0,2,4-5,8,10,12.
[    2.103800] clocksource: Switched to clocksource hpet

Comment 39 Alexander Monakov 2021-09-13 11:14:07 UTC

The 5.14 release made timekeeping watchdog validation (TSCs against HPET) much tighter in this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/time/clocksource.c?id=2e27e793e280ff12cb5c202a1214c08b0d3a0f26

In comment #38 it's firing on a difference of 7 milliseconds over a 0.5-second period. Looks like more people will see the issue now.

Comment 40 Tom Englund 2021-09-14 12:59:44 UTC

(In reply to Alexander Monakov from comment #39)
> The 5.14 release made timekeeping watchdog validation (TSCs against HPET)
> much tighter in this commit:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> kernel/time/clocksource.c?id=2e27e793e280ff12cb5c202a1214c08b0d3a0f26
> 
> In comment #38 it's firing on a difference of 7 milliseconds over a
> 0.5-second period. Looks like more people will see the issue now.

hm by looking at my numbers isnt my hpet reporting to big of a skew and hence its marking the tsc as wrong. because hpet is supposedly more accurate?

'hpet' wd_nsec: 488105680

'tsc' cs_nsec: 495853791

isnt that supposed to be as close to 500* as possible? meaning tsc isnt skewed? by that much but hpet is?

Comment 41 Alexander Monakov 2021-09-17 20:51:49 UTC

(In reply to Tom Englund from comment #40)
> hm by looking at my numbers isnt my hpet reporting to big of a skew and
> hence its marking the tsc as wrong. because hpet is supposedly more accurate?
> 
> 'hpet' wd_nsec: 488105680
> 
> 'tsc' cs_nsec: 495853791
> 
> isnt that supposed to be as close to 500* as possible? meaning tsc isnt
> skewed? by that much but hpet is?

As I understand, it arms a 500-millisecond timer on CPU #0 using the currently selected clocksource. As your log says, the selected clocksource was TSC. So, according to the TSC on the CPU #0, approximately 500 milliseconds should have passed, but in that time HPET counted only 488 milliseconds, and TSC on CPU #3 counted almost 496 milliseconds.

The TSC millisecond count (cs_nsec value) is closer to 500 just because the kernel was using the TSC to time 0.5 seconds in the first place.

Comment 42 James Ettle 2021-09-19 00:03:39 UTC

Seen with a Ryzen 5 3400G, MSI Mortar Max B450, AGESA 1.2.0.2:

[    0.000000] tsc: Detected 3693.390 MHz processor
[    0.033049] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3821924579961850 ns
[    0.077149] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.087181] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6a79e303a22, max_idle_ns: 881590710719 ns
[    0.213408] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3822520892550000 ns
[    0.246818] PTP clock support registered
[    0.251987] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.253430] clocksource: Switched to clocksource tsc-early
[    0.261649] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.318820] rtc_cmos 00:02: setting system clock to 2021-09-18T23:20:26 UTC (1632007226)
[    0.696293] sched_clock: Marking stable (695958175, 322521)->(699829946, -3549250)
[    1.275199] tsc: Refined TSC clocksource calibration: 3693.061 MHz
[    1.275207] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6a7777116fa, max_idle_ns: 881590883556 ns
[    1.303201] clocksource: Switched to clocksource tsc
[    1.794259] [drm] DM_PPLIB: values for F clock
[    1.794262] [drm] DM_PPLIB: values for DCF clock
[   22.092420] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[   22.092440] clocksource:                       'hpet' wd_nsec: 496346043 wd_now: 12ca74a7 wd_last: 125e03d3 mask: ffffffff
[   22.092446] clocksource:                       'tsc' cs_nsec: 497230355 cs_now: 1ffde0ac9c cs_last: 1f906ced06 mask: ffffffffffffffff
[   22.092451] clocksource:                       'tsc' is current clocksource.
[   22.092458] tsc: Marking TSC unstable due to clocksource watchdog
[   22.092482] sched_clock: Marking unstable (22092156166, 323186)<-(22096028445, -3549250)
[   22.093065] clocksource: Checking clocksource tsc synchronization from CPU 4 to CPUs 0,7.
[   22.093139] clocksource: Switched to clocksource hpet

Comment 43 stefanspr94 2021-09-20 19:27:50 UTC

I seem to be affected aswell. It started either with Kernel 5.13.14 or 5.14.5 for me. 5.13.9 is fine and doesn't have a single clocksource watchdog error in over two weeks of continuous uptime.


Ryzen 3900X
Gigabyte X570 Aorus Pro r1.0 F34 Agesa 1.2.0.3B
Kernel 5.14.5, 500Hz, Voluntary Preemption, Dynticks Idle


dmesg | grep clocksource


[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3821924579961850 ns
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.000003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d588d6a09c, max_idle_ns: 881590727049 ns
[    0.158996] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3822520892550000 ns
[    0.614064] clocksource: Switched to clocksource tsc-early
[    0.622232] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.646013] tsc: Refined TSC clocksource calibration: 3792.875 MHz
[    1.646022] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d581b92771, max_idle_ns: 881590605997 ns
[    1.646229] clocksource: Switched to clocksource tsc
[23985.411015] clocksource: timekeeping watchdog on CPU19: hpet retried 2 times before success
[44127.896993] clocksource: timekeeping watchdog on CPU9: hpet retried 2 times before success
[44160.393834] clocksource: timekeeping watchdog on CPU2: hpet retried 2 times before success
[44408.897615] clocksource: timekeeping watchdog on CPU19: hpet retried 3 times before success
[44885.381791] clocksource: timekeeping watchdog on CPU12: hpet retried 2 times before success
[45650.867725] clocksource: timekeeping watchdog on CPU7: hpet retried 3 times before success
[45797.855669] clocksource: timekeeping watchdog on CPU13: hpet retried 2 times before success
[46858.836896] clocksource: timekeeping watchdog on CPU23: hpet retried 2 times before success
[46876.836454] clocksource: timekeeping watchdog on CPU11: hpet retried 2 times before success
[48052.293244] clocksource: timekeeping watchdog on CPU10: hpet retried 3 times before success
[49087.274432] clocksource: timekeeping watchdog on CPU16: hpet retried 2 times before success
[49690.251364] clocksource: timekeeping watchdog on CPU22: hpet retried 2 times before success
[51695.622996] clocksource: timekeeping watchdog on CPU1: hpet retried 2 times before success
[52318.784130] clocksource: timekeeping watchdog on CPU23: hpet retried 3 times before success
[52445.343626] clocksource: timekeeping watchdog on CPU12: hpet retried 2 times before success
[52622.222711] clocksource: timekeeping watchdog on CPU6: hpet retried 2 times before success
[52658.169431] clocksource: timekeeping watchdog on CPU6: hpet retried 3 times before success
[52732.656197] clocksource: timekeeping watchdog on CPU11: hpet read-back delay of 52101ns, attempt 4, marking unstable
[52732.656970] tsc: Marking TSC unstable due to clocksource watchdog
[52732.658074] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0,13,17-18,20,23.
[52732.658252] clocksource: Switched to clocksource hpet

dmesg | grep tsc

[    0.016000] tsc: PIT calibration matches HPET. 1 loops
[    0.016000] tsc: Detected 3792.936 MHz processor
[    0.000003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d588d6a09c, max_idle_ns: 881590727049 ns
[    0.614064] clocksource: Switched to clocksource tsc-early
[    1.646013] tsc: Refined TSC clocksource calibration: 3792.875 MHz
[    1.646022] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d581b92771, max_idle_ns: 881590605997 ns
[    1.646229] clocksource: Switched to clocksource tsc
[52732.656970] tsc: Marking TSC unstable due to clocksource watchdog
[52732.657097] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[52732.658074] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0,13,17-18,20,23.

So in my case, the HPET sometimes takes its sweet time to respond, triggering the conditions introduced with the folowing patch:
clocksource: Retry clock read if long delays detected
https://github.com/torvalds/linux/commit/db3a34e17433de2390eb80d436970edcebd0ca3e

If I understand it correctly, the reason it even tries comparing with the HPET is the stricter rules for what's an acceptable skew introduced in this patch:
clocksource: Reduce clocksource-skew threshold
https://github.com/torvalds/linux/commit/2e27e793e280ff12cb5c202a1214c08b0d3a0f26

I don't know whether this is relevant, but this old blog post (commenting https://lkml.org/lkml/2008/9/25/451) mentions why there has to be some slack in the first place (does this also imply the wiggle room should increase proportional to the # of CPUs present?): https://www.chromium.org/chromium-os/how-tos-and-troubleshooting/tsc-resynchronization

Is something wrong with the hardware or could it be that the new rules are just too tight? What are the side effects of switching to HPET as a clocksource in a desktop/workstation/multimedia system? Should I expect performance degredation, glitches or my VMs suddenly breaking?

Comment 44 Frank Kruger 2021-09-26 06:54:29 UTC

(In reply to bugzilla.kernel.org from comment #7)
> On another note, Lenovo's support department called me back and told me very
> clearly that they do not support Linux and that as long as Windows works on
> my laptop, then they can't do anything to help me with my issue. I insisted
> that they pass the information I provided to their BIOS development team
> because it's a BIOS problem (not a Linux one), but I really doubt they
> listened to me. So there's that.

JFYI: Mark RH Pearson (markpearson@lenovo.com) is Lenovo's lead technical engineer for the Linux PC team (https://forums.lenovo.com/user/viewprofilepage/user-id/1942528)

Comment 45 pgnd 2021-10-24 12:40:07 UTC

here, similarly

	cpu:    Ryzen 5 5600G
	mobo:   ASRockRack X470D4U
	bios:   vP4.20, 04/14/2021
	kernel: 5.14.13-200.fc34.x86_64 x86_64

dmesg | egrep "clocksource|hpet"
	[    0.095228] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
	[    0.292549] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
	[    0.298572] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x706dfa647dc, max_idle_ns: 881591068053 ns
	[    0.424848] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
	[    0.541944] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
	[    0.541944] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
	[    0.543599] clocksource: Switched to clocksource tsc-early
	[    0.555786] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
	[    0.662499] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
	[    1.603619] tsc: Refined TSC clocksource calibration: 3927.246 MHz
	[    1.603630] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7137c868d1b, max_idle_ns: 881590679990 ns
	[    1.603697] clocksource: Switched to clocksource tsc
	[    2.299865] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
	[    2.301411] clocksource:                       'hpet' wd_nsec: 499726501 wd_now: 1c3308a wd_last: 15602a4 mask: ffffffff
	[    2.302991] clocksource:                       'tsc' cs_nsec: 496259389 cs_now: 1c92eacd6e cs_last: 1c1ec07430 mask: ffffffffffffffff
	[    2.304613] clocksource:                       'tsc' is current clocksource.
	[    2.305446] tsc: Marking TSC unstable due to clocksource watchdog
	[    2.306501] clocksource: Checking clocksource tsc synchronization from CPU 7 to CPUs 0,3-4,9.
	[    2.307437] clocksource: Switched to clocksource hpet

adding to cmd line

	clocksource=tsc clocksource_failover=tsc tsc=reliable force_tsc_stable=1

reboot

dmesg | egrep "clocksource|hpet"
	[    0.296329] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
	[    0.302353] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x706e8b70714, max_idle_ns: 881590462336 ns
	[    0.418791] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
	[    0.536550] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
	[    0.536550] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
	[    0.538384] clocksource: Switched to clocksource tsc-early
	[    0.551081] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
	[    0.653393] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
	[    1.607671] tsc: Refined TSC clocksource calibration: 3926.815 MHz
	[    1.607681] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x71349ad6649, max_idle_ns: 881590462736 ns
	[    1.607759] clocksource: Switched to clocksource tsc
	[    4.424455]     clocksource_failover=tsc

is it recommended to allow the switch to hpet? or force the tsc?

Comment 46 Julian Sikorski 2021-11-11 17:34:03 UTC

I have recently experienced an onslaught of these messages on an Asus Zenbook UM425IA with Ryzen 5 4500U, in correlation with s0ix resume failures:

$ dmesg | egrep "clocksource|hpet"
[    0.066342] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.138579] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.144599] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x222bfdb946d, max_idle_ns: 440795315613 ns
[    0.263301] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.402660] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.402662] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.404885] clocksource: Switched to clocksource tsc-early
[    1.404546] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.456907] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[    2.475617] tsc: Refined TSC clocksource calibration: 2370.544 MHz
[    2.475627] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x222b856c74e, max_idle_ns: 440795290389 ns
[    2.475705] clocksource: Switched to clocksource tsc
[   26.650613] clocksource: timekeeping watchdog on CPU4: Marking clocksource 'tsc' as unstable because the skew is too large:
[   26.650621] clocksource:                       'hpet' wd_nsec: 227874841 wd_now: 1665386d wd_last: 16336f4c mask: ffffffff
[   26.650625] clocksource:                       'tsc' cs_nsec: 503973359 cs_now: 13bea89949 cs_last: 1377730fbf mask: ffffffffffffffff
[   26.650628] clocksource:                       'tsc' is current clocksource.
[   26.650634] tsc: Marking TSC unstable due to clocksource watchdog
[   26.651389] clocksource: Checking clocksource tsc synchronization from CPU 5 to CPUs 0-1,4.
[   26.651471] clocksource: Switched to clocksource hpet

Comment 47 Marcel Ziswiler 2021-11-17 01:06:34 UTC

first generation Lenovo ThinkPad T14 AMD (Ryzen 7 PRO 4750U based) running 5.14.17-301.fc35.x86_64:

Nov 16 15:16:16 fedora kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
Nov 16 15:16:16 fedora kernel: clocksource:                       'hpet' wd_nsec: 500868965 wd_now: 1ba34ef wd_last: 14cc723 mask: ffffffff
Nov 16 15:16:16 fedora kernel: clocksource:                       'tsc' cs_nsec: 497009832 cs_now: 8bb86f4c0 cs_last: 888decbe6 mask: ffffffffffffffff
Nov 16 15:16:16 fedora kernel: clocksource:                       'tsc' is current clocksource.
Nov 16 15:16:16 fedora kernel: tsc: Marking TSC unstable due to clocksource watchdog
Nov 16 15:16:16 fedora kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Nov 16 15:16:16 fedora kernel: sched_clock: Marking unstable (2218464864, 376657)<-(2237226433, -18385478)
Nov 16 15:16:16 fedora kernel: clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0-1,4-5,12,15.
Nov 16 15:16:16 fedora kernel: clocksource: Switched to clocksource hpet

Comment 48 Nix\ 2021-11-26 02:45:07 UTC

Lenovo Ideapad L340-15API ryzen 3 3200u same bug.
I found this thread googling about dmesg warn.

Kernel 5.15.5 Manjaro.

dmesg | egrep -i 'tsc|hpet|clocksource'
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2595.105 MHz processor
[    0.005826] ACPI: HPET 0x00000000B9576000 000038 (v01 LENOVO CB-01    00000001 PTEC 00000002)
[    0.005881] ACPI: Reserving HPET table memory at [mem 0xb9576000-0xb9576037]
[    0.023566] ACPI: HPET id: 0x43538210 base: 0xfed00000
[    0.023651] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
[    0.085358] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.102060] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25682bdad31, max_idle_ns: 440795225592 ns
[    0.218732] TSC synchronization [CPU#0 -> CPU#1]:
[    0.218732] Measured 3733953964 cycles TSC warp between CPUs, turning off TSC clock.
[    0.218732] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[    0.219775] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
[    0.254175] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.254175] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.257491] clocksource: Switched to clocksource hpet
[    0.270560] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.411400] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[   17.689050] vboxdrv: TSC mode is Invariant, tentative frequency 2595121321 Hz

Comment 49 WGH 2021-11-29 14:09:24 UTC

Someone above said that some BIOS update fixed it for ThinkPad L14 Gen 1, but I'm currently having this on 

DMI: LENOVO 20U5003NRT/20U5003NRT, BIOS R19ET36W (1.20 ) 07/12/2021.

Comment 50 Nix\ 2021-12-19 23:59:03 UTC

With acpi=off clocksourcec=tsc in kernel cmdline, the kernel select tsc-early and no mark an unstable, but the 4 cores of the Ryzen 3 3200U dissapear and show only one.
Kernel 5.15.10 and 5.16-rc5

Comment 51 Julian Stecklina 2021-12-26 21:20:42 UTC

I got it again on my Thinkpad L14 Gen2 with 5.16-rc6:

[    2.154325] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.154329] clocksource:                       'hpet' wd_nsec: 495907650 wd_now: 1ba864b wd_last: 14e2dfc mask: ffffffff
[    2.154333] clocksource:                       'tsc' cs_nsec: 503650490 cs_now: 84c6e1957 cs_last: 8197df12b mask: ffffffffffffffff
[    2.154336] clocksource:                       'tsc' is current clocksource.
[    2.154342] tsc: Marking TSC unstable due to clocksource watchdog
[    2.154356] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[    2.154358] sched_clock: Marking unstable (2154008130, 346704)<-(2172981241, -18626378)
[    2.154907] clocksource: Checking clocksource tsc synchronization from CPU 13 to CPUs 0,2,4,7,9.

This is with:
[    0.000000] DMI: LENOVO 20U50001GE/20U50001GE, BIOS R19ET36W (1.20 ) 07/12/2021

This might be a regression from earlier BIOS versions. I was the one that reported success with BIOS version 1.17.

@Nix: acpi=off is not a viable option these days. You need it for various reasons.

Comment 52 Nix\ 2022-01-12 11:08:25 UTC

Well I don't know if this information can help to developers or AMD to fix it.
Yesterday I compiled 5.16 vanilla kernel and I forgot to check SMP support. So the kernel started with a single core.
The TSC clock was stable.

Later I compiled again, checking SMP support and TSC is unstable. Using the default 5.16 from manjaro is TSC unstable.

Link to picture 5.16 custom no SMP: https://imgur.com/a0ciCrp

I am not a developer and really i can't understand how disabling SMP the TSC clock is stable.

And this is now with 5.16 manjaro stock:

[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2595.144 MHz processor
[    0.103950] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x256850c0173, max_idle_ns: 440795271144 ns
[    0.220621] TSC synchronization [CPU#0 -> CPU#1]:
[    0.220621] Measured 4143265984 cycles TSC warp between CPUs, turning off TSC clock.
[    0.220621] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[   18.124359] vboxdrv: TSC mode is Invariant, tentative frequency 2595145458 Hz
[   30.385511] SVM: TSC scaling supported

Comment 53 James Ettle 2022-01-23 18:28:46 UTC

I've just changed to a 5700G and get this too:

[    2.081370] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.081371] clocksource:                       'hpet' wd_nsec: 499140044 wd_now: 1b971c7 wd_last: 14c64ae mask: ffffffff
[    2.081373] clocksource:                       'tsc' cs_nsec: 495716395 cs_now: c41661756 cs_last: bd08e8802 mask: ffffffffffffffff
[    2.081374] clocksource:                       'tsc' is current clocksource.
[    2.081380] tsc: Marking TSC unstable due to clocksource watchdog
[    2.081386] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[    2.081387] sched_clock: Marking unstable (2081136487, 249626)<-(2097250056, -15864482)
[    2.081545] clocksource: Checking clocksource tsc synchronization from CPU 5 to CPUs 0-1,6,9,14.
[    2.081573] clocksource: Switched to clocksource hpet

However the fix posted by Thomas Gleixner for bug 208887 (delay the calibration) works here too. Not sure what I'd report to AMD/MSI on this one.

Comment 54 James Ettle 2022-02-12 15:33:51 UTC

I take that back, just tried 5.16.9 with the patch - tsc marked unstable. Seems like a warm reboot has an influence.

Comment 55 Paul Menzel 2022-02-26 21:33:47 UTC

No idea if related, on some Ryzen system fast TSC calibration fails, and *[PATCH v2] x86/tsc: Allow quick PIT calibration despite interruptions* [1] fixes it.

[1]: https://lore.kernel.org/all/20190214214608.8672-1-jan@schnhrr.de/

Comment 56 Jonas Zeiger 2022-05-31 09:32:55 UTC

I again tried my luck with Jan H. Schönherr's patch with v5.15.44 (applied without problems) on a "Thinkpad T14 AMD Gen1 20UD" with latest EFI Firmware v1.40:

[    0.000000] DMI: LENOVO 20UD0013GE/20UD0013GE, BIOS R1BET71W(1.40 ) 04/05/2022

After a cold boot TSC is usually unusable:

May 31 10:53:55 tsc: Fast TSC calibration using PIT
May 31 10:53:55 tsc: Detected 1696.898 MHz processor
May 31 10:53:55 clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1875b4ef64c, max_idle_ns: 440795203028 ns
May 31 10:53:55 clocksource: Switched to clocksource tsc-early
May 31 10:53:55 tsc: Refined TSC clocksource calibration: 1709.706 MHz
May 31 10:53:55 clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x18a4f82559b, max_idle_ns: 440795270331 ns
May 31 10:53:55 clocksource: Switched to clocksource tsc
May 31 10:53:55 clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
May 31 10:53:55 clocksource:                       'tsc' cs_nsec: 496003006 cs_now: 7e747fcff cs_last: 7b4bc3db0 mask: ffffffffffffffff
May 31 10:53:55 clocksource:                       'tsc' is current clocksource.
May 31 10:53:55 tsc: Marking TSC unstable due to clocksource watchdog
May 31 10:53:55 clocksource: Checking clocksource tsc synchronization from CPU 0 to CPUs 1-3,5,7,9,15.


After a warm star (reboot) TSC is available:

May 31 10:54:24 tsc: Fast TSC calibration using PIT
May 31 10:54:24 tsc: Detected 1696.656 MHz processor
May 31 10:54:24 clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1874d03af40, max_idle_ns: 440795221980 ns
May 31 10:54:25 clocksource: Switched to clocksource tsc-early
May 31 10:54:25 tsc: Refined TSC clocksource calibration: 1696.819 MHz
May 31 10:54:25 clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x18756a4cab8, max_idle_ns: 440795267109 ns
May 31 10:54:25 clocksource: Switched to clocksource tsc


Neither firmware v1.40 nor the patch changed the status quo (warmstart required).

There seems to be a noticable performance difference with TSC enabled VS TSC unavailable.

Comment 57 Paul Menzel 2022-05-31 12:04:32 UTC

Mario, could AMD please look into this, and improve Jan’s patches? Mark, as also Lenovo Thinkpads are affected, could your engineers please reproduce the issue, and help to analyze and hopefully fix it?

Comment 58 Mark Pearson 2022-05-31 15:11:07 UTC

Thanks Paul for letting me know about this ticket.

Yes - this is already on our radar and I've been working with Mario on it and we have plans for fixing it but no ETA available yet.

Mark

Comment 59 James Ettle 2022-06-05 13:50:06 UTC

(In reply to Paul Menzel from comment #55)
> No idea if related, on some Ryzen system fast TSC calibration fails, and
> *[PATCH v2] x86/tsc: Allow quick PIT calibration despite interruptions* [1]
> fixes it.
> 
> [1]: https://lore.kernel.org/all/20190214214608.8672-1-jan@schnhrr.de/

Didn't observe any improvement on the 5700G. I'm now trying tsc=reliable and seeing what the long-term impact is.

Comment 60 Paul Menzel 2022-06-06 11:17:24 UTC

The bug has *AMD Ryzen 7 PRO 2700U* in it’s title, and Steven reported:

> Yes, they fixed it for the A485 in the 1.28 BIOS update, with this comment in
> > their changelog:
> 
> "(Fix) Fixed TSC synchronization [CPU#0 -> CPU#1] under linux."

Should the bug title be renamed, or marked as solved, and a new issue be opened for the other reports?

Comment 61 Steven Noonan 2022-06-06 12:54:43 UTC

(In reply to Paul Menzel from comment #60)
> The bug has *AMD Ryzen 7 PRO 2700U* in it’s title, and Steven reported:
> 
> > Yes, they fixed it for the A485 in the 1.28 BIOS update, with this comment
> in
> > > their changelog:
> > 
> > "(Fix) Fixed TSC synchronization [CPU#0 -> CPU#1] under linux."
> 
> Should the bug title be renamed, or marked as solved, and a new issue be
> opened for the other reports?

I'll just rename the issue.

The problem is a broader AMD problem than just affecting the ThinkPad A485. And many people in this thread are talking about a wide variety of devices. I think moving this to a new issue would be a mistake, as people on the CC list who are monitoring this issue would lose visibility, and I suspect the broader issue would fall through the cracks as a result.

Comment 62 Alexander Monakov 2022-06-06 21:08:20 UTC

Thank you Steven.

In my communication with AMD, they eventually said they found a root cause in SBIOS and issued an AGESA update (disproving comment #20), and were investigating a similar problem that on linux-5.15 affected "platforms supporting Modern Standby" (where the fix would also be via a firmware update).

Comment 63 James Ettle 2022-06-12 14:22:59 UTC

(In reply to James Ettle from comment #59)
> Didn't observe any improvement on the 5700G. I'm now trying tsc=reliable and
> seeing what the long-term impact is.

Ah, 7000ppm slow following a reboot... bad idea.

(In reply to Alexander Monakov from comment #62)
> In my communication with AMD, they eventually said they found a root cause
> in SBIOS and issued an AGESA update (disproving comment #20), and were
> investigating a similar problem that on linux-5.15 affected "platforms
> supporting Modern Standby" (where the fix would also be via a firmware
> update).

Is the fixed AGESA version number known?

Comment 64 Alexander Monakov 2022-06-12 14:38:01 UTC

(In reply to James Ettle from comment #63)
> 
> Is the fixed AGESA version number known?

1.0.0.6 for the initial issue I asked about, no idea about the follow-up issue related to linux-5.15 they were investigating (as of September 2021).

Comment 65 Mario Limonciello (AMD) 2022-06-13 15:14:26 UTC

> The problem is a broader AMD problem than just affecting the ThinkPad A485.
> And many people in this thread are talking about a wide variety of devices. I
> think moving this to a new issue would be a mistake, as people on the CC list
> who are monitoring this issue would lose visibility, and I suspect the
> broader issue would fall through the cracks as a result.


I think it's a mistake to mix this up across different generations.  Even if it's an issue which can reproduce on AMD's reference designs it should be tracked for one family of APUs/CPUs at a time.  So please let's keep this issue on Ryzen 2000.  If someone has problems with Ryzen 3000/4000/5000/6000 lets have separate bugs.

Comment 66 Mario Limonciello (AMD) 2022-06-13 15:26:42 UTC

As comment #60/#61 indicate this is fixed by the OEM BIOS for Ryzen 2000, so I will close *this* issue to indicate this was not a bug that is addressed by the kernel, but rather a BIOS bug.

Comment 67 Steven Noonan 2022-06-13 15:28:01 UTC

(In reply to Mario Limonciello (AMD) from comment #65)
> So please let's keep this issue on Ryzen 2000.  If someone has problems
> with Ryzen 3000/4000/5000/6000 lets have separate bugs.

If we must, but I am concerned those issues would not get the same attention this issue now has. It took a very long time for enough users to chime in on this bug for the right people to get CC'd' and for AMD to notice this issue.

Every AMD laptop I've had since filing this in 2019 has had TSC sync issues (most often after warm reboots). So that includes the 2700U, 4900HS, and 5900HS CPUs.

Comment 68 Mario Limonciello (AMD) 2022-06-13 15:48:40 UTC

Although they may "look/fail" the same to the kernel, they need to be root caused and addressed individually as they may be OEM specific issues or generation specific issues.  Lumping them all together is a good way for nothing to get solved.  Bugs are cheap and can easily be duped if we're wrong and it's the same root cause across OEMs or generations.

I've personally confirmed on a variety of Ryzen 6000 systems both cold and warm boot are working properly.

Comment 69 James Ettle 2022-06-18 14:45:31 UTC

(In reply to Mario Limonciello (AMD) from comment #65)

> So please let's keep this
> issue on Ryzen 2000.  If someone has problems with Ryzen 3000/4000/5000/6000
> lets have separate bugs.

Created bug 216146 for 5700G.

Comment 70 Alexander 2022-07-07 08:30:22 UTC

Ryzen 6000 series is affected as well, see a review of the Lenovo ThinkPad T14 Gen 3 with BIOS 1.17 [1].

1: https://www.reddit.com/r/thinkpad/comments/voxsr0/

Comment 71 Mario Limonciello (AMD) 2022-07-07 12:00:30 UTC

> Ryzen 6000 series is affected as well, see a review of the Lenovo ThinkPad
> T14 Gen 3 with BIOS 1.17 [1].

This is not the same problem, it's a warp between CPUs cores.  It should have it's own bug report.