Created attachment 281057 [details] 4.19.20 kernel configuration This is happening for every boot on my Lenovo ThinkPad A485, which has an AMD Ryzen 7 PRO 2700U CPU: $ journalctl -k -b -2 -o short-monotonic | grep -i tsc [ 0.000000] quasit kernel: tsc: Fast TSC calibration using PIT [ 0.000000] quasit kernel: tsc: Detected 2195.873 MHz processor [ 0.494637] quasit kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fa6f82c460, max_idle_ns: 440795248846 ns [ 0.544639] quasit kernel: TSC synchronization [CPU#0 -> CPU#1]: [ 0.544639] quasit kernel: Measured 3542962214 cycles TSC warp between CPUs, turning off TSC clock. [ 0.544639] quasit kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed Is there some way to fix this? Firmware/microcode/something? I'd really rather not use the HPET clocksource if at all possible because it's stupidly expensive to read compared to the TSC. Attaching kernel config and log.
Created attachment 281059 [details] 4.19.20 kernel log
I used a simple tool to watch the TSC values for each CPU, and it looks like CPUs 1-7 are synchronized, but CPU 0 has a large offset compared to the others. They're all advancing at the same frequency (2195MHz) though: 1028: -8 1606 1606 1606 1606 1606 1606 1606 3695718 KHz 2066: -9 1605 1605 1605 1605 1605 1605 1605 2941100 KHz 3103: -8 1606 1606 1606 1606 1606 1606 1606 2692467 KHz 4141: -8 1606 1606 1606 1606 1605 1606 1606 2567740 KHz 5178: -7 1607 1607 1607 1607 1607 1607 1607 2493565 KHz 6216: -7 1607 1607 1607 1607 1607 1607 1607 2443642 KHz 7253: -6 1608 1607 1608 1607 1607 1608 1608 2408297 KHz 8290: -6 1609 1608 1608 1608 1608 1608 1608 2381780 KHz 9328: -6 1608 1608 1608 1608 1608 1609 1608 2361023 KHz 10365: -5 1609 1609 1609 1609 1609 1609 1609 2344619 KHz To explain this a bit... This tool spawns a thread per CPU which reads from the TSC every millisecond. The main thread sleeps for a second, then reads from clock_gettime(CLOCK_MONOTONIC) and reads all the TSC values from the threads. Given a known (or measured) TSC frequency, we can see offsets or drift behavior. In the output above, each line is prefixed with a millisecond timestamp. It's then followed by 8 values, one per CPU. Each value is the delta between the actual and expected TSC values for that sample. e.g. if we sleep for 1 second, we expect (1 second * TSC frequency) ticks to pass. This offset is expressed in milliseconds. The last value on the line is an estimation of the TSC frequency in KHz, which gets more accurate over time (as long as clock_gettime isn't drifting, that is). The interesting and incriminating part in the output above is that CPU0 is about -1600ms offset from the TSCs on the other CPUs.
Created attachment 281063 [details] 4.19.20 TSC synchronization via MSR_IA32_TSC I wrote the attached patch to write to MSR_IA32_TSC directly if MSR_IA32_TSC_ADJUST is unavailable. This fixes the TSC on boot for my ThinkPad A485. I haven't tested the patch with suspend/resume or anything fancy yet though. I've been told by some kernel developers that this approach has been tried before and is unreliable, but it worked for my case. I also note that writing to MSR_IA32_TSC is how the FreeBSD kernel accomplishes TSC synchronization by default. Even if we aren't confident enough to take this path by default, it would be nice to have it available as a tsc= command line option or something. The ideal solution here, of course, is for Lenovo to unbreak the firmware so this change is rendered unnecessary to begin with.
I am experiencing the same issue on a Lenovo ThinkPad T495 with an AMD Ryzen 7 PRO 3700U CPU. I will test the attached patch and see if it works for me.
I've improved the patch slightly since my original post: https://git.uplinklabs.net/steven/projects/archlinux/ec2/ec2-packages.git/tree/linux-hsw/0006-tsc-allow-directly-synchronizing-TSC-if-TSC_ADJUST-i.patch Also, Lenovo is aware of the issue and has told me they're working on it. Though who knows how long it takes to qualify a new BIOS revision.
I've tried your latest patch and (after adding "tsc=directsync" to kernel parameters) it seems to have worked, apparently: Before: Jul 07 13:24:27 wizylap kernel: tsc: Fast TSC calibration using PIT Jul 07 13:24:27 wizylap kernel: tsc: Detected 2295.648 MHz processor Jul 07 13:24:27 wizylap kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x211726228a3, max_idle_ns: 440795287600 ns Jul 07 13:24:27 wizylap kernel: TSC synchronization [CPU#0 -> CPU#1]: Jul 07 13:24:27 wizylap kernel: Measured 3684618515 cycles TSC warp between CPUs, turning off TSC clock. Jul 07 13:24:27 wizylap kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet After: Jul 07 13:34:57 wizylap kernel: tsc: Fast TSC calibration using PIT Jul 07 13:34:57 wizylap kernel: tsc: Detected 2295.789 MHz processor Jul 07 13:34:57 wizylap kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2117ab45c4c, max_idle_ns: 440795239981 ns Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -3684794074 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed 46 warp. Overhead: 299 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -69 warp. Overhead: 322 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -184 warp. Overhead: 667 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU1 observed -529 warp. Overhead: 138 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU2 observed -3684794005 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU3 observed -3684794028 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU4 observed -3684794005 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU5 observed -3684794005 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU6 observed -3684794005 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: TSC direct sync: CPU7 observed -3684794005 warp. Overhead: 0 Jul 07 13:34:57 wizylap kernel: clocksource: Switched to clocksource tsc-early Jul 07 13:34:57 wizylap kernel: tsc: Refined TSC clocksource calibration: 2295.687 MHz Jul 07 13:34:57 wizylap kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x21174ac26ec, max_idle_ns: 440795209452 ns Jul 07 13:34:57 wizylap kernel: clocksource: Switched to clocksource tsc $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc That said, I'm still worried about the other kernel developers' comment that this approach is unreliable... I have created a service request asking Lenovo to fix the T495 BIOS regarding this issue, although I'm not even sure that my request will reach the firmware/BIOS developers... Thanks!
I've been running with this patch and I haven't noticed any visible issues so far, except when suspending/resuming: [ 151.767845] ACPI: Low-level resume complete [ 151.767888] ACPI: EC: EC started [ 151.767889] PM: Restoring platform NVS memory [ 151.857270] Enabling non-boot CPUs ... [ 151.857306] x86: Booting SMP configuration: [ 151.857307] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 152.011707] microcode: CPU1: patch_level=0x08108102 [ 152.041720] TSC direct sync: CPU1 observed -353928071 warp. Overhead: 0 [ 151.917551] TSC direct sync: CPU1 observed 46 warp. Overhead: 345 [ 151.947553] TSC direct sync: CPU1 observed -23 warp. Overhead: 184 [ 151.977555] TSC direct sync: CPU1 observed -23 warp. Overhead: 184 [ 152.007556] TSC direct sync: CPU1 observed -23 warp. Overhead: 184 [ 152.037557] TSC direct sync: CPU1 observed -23 warp. Overhead: 184 [ 152.067559] TSC direct sync: CPU1 observed -23 warp. Overhead: 184 [ 152.097560] TSC direct sync: CPU1 observed -23 warp. Overhead: 161 [ 152.127562] TSC direct sync: CPU1 observed -23 warp. Overhead: 161 [ 152.157564] TSC synchronization [CPU#0 -> CPU#1]: [ 152.157564] Measured 23 cycles TSC warp between CPUs, turning off TSC clock. [ 152.157577] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 152.157670] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 152.157671] sched_clock: Marking unstable (151620901633, 446894724)<-(152824405350, -666736506) [ 152.165230] clocksource: Switched to clocksource hpet [ 152.068796] CPU1 is up (...) As you can see, after resuming, the clocksource switches to hpet again. On another note, Lenovo's support department called me back and told me very clearly that they do not support Linux and that as long as Windows works on my laptop, then they can't do anything to help me with my issue. I insisted that they pass the information I provided to their BIOS development team because it's a BIOS problem (not a Linux one), but I really doubt they listened to me. So there's that.
Lenovo is aware of it, and working on it. I'll ping my contact over there and ask if they've made progress.
(In reply to bugzilla.kernel.org from comment #7) > On another note, Lenovo's support department called me back and told me very > clearly that they do not support Linux and that as long as Windows works on > my laptop, then they can't do anything to help me with my issue. I guess the support department hasn't been updated yet: www.lenovo.com/linux
Steven, your updated version of the patch at git.uplinklabs.net seems to longer accessible; can you upload it as attachment here? Did Lenovo get around to fix this in firmware? (FWIW, I'm hitting a similar issue on a newer laptop with Ryzen 4500U, though I'm seeing a much smaller TSC warp, ~3600 cycles; but OTOH, sometimes it survives the check_tsc_sync_source sanity check and fails later anyway when the timekeeping watchdog validates TSC against the HPET over a period of 0.5 seconds)
(In reply to Alexander Monakov from comment #10) > Steven, your updated version of the patch at git.uplinklabs.net seems to > longer accessible; can you upload it as attachment here? I removed the changeset as it was buggy and firmware fixes solve the problem better. But if you want the patch you can dig through the git history: https://git.uplinklabs.net/steven/projects/archlinux/ec2/ec2-packages.git/commit/linux-hsw?id=e5768993437a3106eaa7b518c79665447c4923bc > Did Lenovo get around to fix this in firmware? Yes, they fixed it for the A485 in the 1.28 BIOS update, with this comment in their changelog: "(Fix) Fixed TSC synchronization [CPU#0 -> CPU#1] under linux." > (FWIW, I'm hitting a similar issue on a newer laptop with Ryzen 4500U, > though I'm seeing a much smaller TSC warp, ~3600 cycles; but OTOH, sometimes > it survives the check_tsc_sync_source sanity check and fails later anyway > when the timekeeping watchdog validates TSC against the HPET over a period > of 0.5 seconds) Weird, that sounds like a different issue. Are you running the latest BIOS version?
For me the entire git.uplinklabs.net host is inaccessible; can you upload the patch please? Not just for my own sake, but also so that future readers can easily look at it. > Weird, that sounds like a different issue. Are you running the latest BIOS > version? I'm seeing it on a 4500U which was just recently released, so even though I'm running the latest BIOS version (from Acer/Insyde), it's also the first public version.
(In reply to Alexander Monakov from comment #12) > For me the entire git.uplinklabs.net host is inaccessible; can you upload > the patch please? Not just for my own sake, but also so that future readers > can easily look at it. Should be accessible now. It has a country-based blacklist because I was getting a lot of attacks against the system. I turned that off for the moment.
Created attachment 288879 [details] Fancier patch for TSC synchronization Uploading Steven's patch for future reference. (thank you, but was adjusting the blacklist really easier than just attaching the patch?..)
(In reply to Alexander Monakov from comment #14) > Created attachment 288879 [details] > Fancier patch for TSC synchronization > > Uploading Steven's patch for future reference. Note that I don't recommend use of the patch because it's flakey, especially around suspend/resume. But it should be enough to get the clock synced early on for a single boot. > (thank you, but was adjusting the blacklist really easier than just > attaching the patch?..) Yes, actually. It's a one-line change.
The Thinkpad L14 Gen 1 also has this problem. BIOS Information Vendor: LENOVO Version: R19ET28W (1.12 ) Release Date: 08/12/2020
I managed to report this problem to AMD back in May; they said they were investigating, but likely the fix would be via a firmware update (i.e. via Acer in my case). Here's more or less what my report to AMD said: https://gist.github.com/amonakov/c65b633f97e5b301f691563ea2f8c636
Interesting. If AMD has to fix their reference code that would still have to trickle to customers via the device vendors. This is going to take forever. :( For visibility, here is also a Lenovo forum post describing the same issue: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464
(In reply to Alexander Monakov from comment #17) > I managed to report this problem to AMD back in May; they said they were > investigating, but likely the fix would be via a firmware update (i.e. via > Acer in my case). > > Here's more or less what my report to AMD said: > https://gist.github.com/amonakov/c65b633f97e5b301f691563ea2f8c636 I've been seeing that particular kind of behavior on my Zephyrus G14 (Ryzen 9 4900HS) with BIOS 217, but only on a warm reboot. If I do a cold boot (power off, wait a few seconds, power on), the TSC stays stable. I haven't tested whether it stays stable across suspend/resume, but I know reboots cause it to desync like that. It'd be nice if AMD would implement the IA32_TSC_ADJUST MSR in the future, so that the TSC can be trivially resynced by the kernel (as long as the TSC frequencies match, which they seem to in my case).
(In reply to Julian Stecklina from comment #18) > Interesting. If AMD has to fix their reference code that would still have to > trickle to customers via the device vendors. This is going to take forever. I doubt this is an agesa problem - from past experience this is more like OEM BIOS doing monkey business with the TSCs. Stuff it should not be doing at all. If it were agesa problem, we'd be seeing this left and right and that on AMD reference platforms too.
I have the same issue on Thinkpad X13 (https://psref.lenovo.com/Detail/ThinkPad/ThinkPad_X13_Gen_1_AMD?M=20UF001CUS) I updated it to latest bios via fwupd (0.1.30), no change.
Created attachment 296245 [details] gross hack FYI, I hacked the patch in comment 14, such as it seems to work across suspends. I did 2 horrible hacks: 1. disable TSC watchdog (possible with tsc=nowatchdog, but there is no parser to enable this and directsync, so I just enabled it on directsync). I think it barks at the fact that TSC syncs moves it by relatively large value. It is probably possible to reset it somehow, but since the whole thing is a horrible hack anyway, then I guess this is good enough. 2. ignore small warps (up to 1000 cycles), although it is usually way way less, warps are usually close to tsc read overhead. Seems to survive suspends, my VMs work with TSC. Also doing 'sudo rdmsr 0x10 -a', only last 24 bit vary, similar to my 3970X, on which TSC is stable (and it looks like its 2021 and that still can't be taken for granted).
I used to have this issue on my Thinkpad L14 (see above), but it seems some firmware update fixed it. I don't see TSC being disabled with the current firmware anymore. [ 0.000000] DMI: LENOVO 20U50001GE/20U50001GE, BIOS R19ET33W (1.17 ) 03/10/2021
I see this issue too on my Tuxedo Computers PULSE 15 (Ryzen 7 4800H) [ 0.000000] DMI: TUXEDO TUXEDO Pulse 15 Gen1/PULSE1501, BIOS N.1.07.A02 12/08/2020
I also seem to have this problem with my Dell G5 SE (Ryzen 9 4900H): [ 0.000000] DMI: Dell Inc. G5 5505/0M8C1F, BIOS 1.5.0 10/27/2020 And related TSC lines from the kernel: # dmesg | egrep 'TSC|tsc' [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.000000] tsc: Detected 3293.843 MHz processor [ 0.087060] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2f7a945ce7b, max_idle_ns: 440795356303 ns [ 0.217278] TSC synchronization [CPU#0 -> CPU#8]: [ 0.217279] Measured 5082 cycles TSC warp between CPUs, turning off TSC clock. [ 0.217280] tsc: Marking TSC unstable due to check_tsc_sync_source failed I found something online that might be related (and I might end up trying as well): https://www.dell.com/community/Linux-General/Installing-Ubuntu-18-04-on-Inspiron-5575/td-p/7350323 tl;dr, some people are reporting that changing some NVME parameters fix their timing issue. This hack seems unrelated to me, but I'm unfamiliar with NVME and if/how it could possibly cause issues.
Sadly, the directions from the Dell link did not seem to have an effect on my, even though I also have a WD NVME drive. I will attempt to contact Dell and see if they have anything to say about this, since this does look like a hardware/UEFI problem to me.
FYI I am still using my gross hack, and it still works across suspends. I still have no new firmware available via fwupd sadly.
Problem was fixed on my Thinkpad T14 Gen1 AMD (Ryzen 7) with System Firmware 1.32: https://support.lenovo.com/us/en/downloads/ds544977-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-1-types-20ud-20ue Apparently no more TSC issues: $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc
The problem still sporadically reappears with firmware 1.32 and Linux 5.12.14 (TSC unstable, clocksource switched to HPET) after cold boots, but is generally gone after a reboot.
(In reply to Jonas Zeiger from comment #29) > The problem still sporadically reappears with firmware 1.32 and Linux > 5.12.14 (TSC unstable, clocksource switched to HPET) after cold boots, but > is generally gone after a reboot. This. Very interesting. The described behavior is reproducible on my T14 Gen1 AMD (Firmware 1.34, Linux 5.13.4) aswell. Cold boot -> tsc-early unstable; clocksource = hpet reboot -> clocksource = tsc-early
Thinkpad E495 $ dmesg | egrep -i 'tsc|hpet|clocksource' [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.000000] tsc: Detected 2096.096 MHz processor [ 0.004968] ACPI: HPET 0x00000000B90A3000 000038 (v01 LENOVO TP-R11 00001210 PTEC 00000002) [ 0.005018] ACPI: Reserving HPET table memory at [mem 0xb90a3000-0xb90a3037] [ 0.014115] ACPI: HPET id: 0x43538210 base: 0xfed00000 [ 0.014183] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns [ 0.059588] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.079620] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e36c89f085, max_idle_ns: 440795235573 ns [ 0.199622] TSC synchronization [CPU#0 -> CPU#1]: [ 0.199622] Measured 5479288407 cycles TSC warp between CPUs, turning off TSC clock. [ 0.199622] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 0.216551] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns [ 0.373390] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 [ 0.373393] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 0.375642] clocksource: Switched to clocksource hpet [ 0.392540] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 1.137978] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
I keep wondering why my Lenovo T14 Gen1 AMD shows this symptom only sometimes and tried some things. It could be a series of flukes, but it seems like TSC stays stable when the Ethernet cable is *NOT* plugged during cold-boot and marked as unstable if cold-booting with Ethernet cable plugged. Assuming this observation is correct: what is caring about ethernet link during early boot? What does that have to do with an unstable TSC?
Does this patch help? https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3108085
(In reply to Alex Deucher from comment #33) > Does this patch help? > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/ > 3108085 This patch touches a different functionality. There are two separate TSC validation mechanisms referenced in this bug: 1) early check implemented in arch/x86/kernel/tsc_sync.c (check_tsc_sync_source); it validates TSCs on different cores against each other; 2) "timekeeping watchdog" that validates each TSC against HPET over a period of 0.5s (not 1 second as in the patched code); it is implemented in kernel/time/clocksource.c (clocksource_watchdog). Both of these were observed to fail. For me, the first reliably fails after a soft reboot, and the second occasionally fails on cold boots. As far as I can tell the patch does not touch the first one at all, and should not have an effect on the second one either.
> Does this patch help? > > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3108085 I tried v5.12.19 with the patch applied: no change (tsc still randomly marked unstable on cold-boot)
On Lenovo T14 AMD Gen1, Firmware 1.35, Linux 5.12.19 the problem is still reproducible.
ThinkPad E585 FHD Same mobo, same CPU, same reading steady (Hpet), RJ-45 wire has no effect, nor cold boot,. There were no side effects during the entire Kernel 5.14 cycle. Debian or Arch, Legacy or UEFI. Since the correction was applied on BIOS 1.54 to the iommu=soft issue (painful journey of work), we did not modify any parameters or made any manual correction and we nave no intention to do so. Latest is 1.58 (Dec.2019) and there won't be any new one. The Kernel is 5.14-rc7/MBR/TPM 2.0 is OFF.
Acer Nitro 5 AN515-45 ryzen 5800H seeing TSC issues aswell, tried both patches but im still getting this at boot. adding tsc=reliable to cmdline makes it run just fine tho. im on latest bios with linux 5.14.1 and ucode's applied. not sure if its related or a different issue. [ 0.040743] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3821924579961850 ns [ 0.088695] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.098718] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e0abdfedbc, max_idle_ns: 440795208922 ns [ 0.248947] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3822520892550000 ns [ 0.282750] clocksource: Switched to clocksource tsc-early [ 0.288743] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 1.302783] tsc: Refined TSC clocksource calibration: 3193.998 MHz [ 1.302794] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2e0a244aeba, max_idle_ns: 440795290469 ns [ 1.302855] clocksource: Switched to clocksource tsc [ 2.102955] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2.102977] clocksource: 'hpet' wd_nsec: 488105680 wd_now: 1bc2226 wd_last: 1517e35 mask: ffffffff [ 2.102983] clocksource: 'tsc' cs_nsec: 495853791 cs_now: 92b9b0aa0 cs_last: 8cd34d780 mask: ffffffffffffffff [ 2.102987] clocksource: 'tsc' is current clocksource. [ 2.102993] tsc: Marking TSC unstable due to clocksource watchdog [ 2.103482] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0,2,4-5,8,10,12. [ 2.103800] clocksource: Switched to clocksource hpet
The 5.14 release made timekeeping watchdog validation (TSCs against HPET) much tighter in this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/time/clocksource.c?id=2e27e793e280ff12cb5c202a1214c08b0d3a0f26 In comment #38 it's firing on a difference of 7 milliseconds over a 0.5-second period. Looks like more people will see the issue now.
(In reply to Alexander Monakov from comment #39) > The 5.14 release made timekeeping watchdog validation (TSCs against HPET) > much tighter in this commit: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > kernel/time/clocksource.c?id=2e27e793e280ff12cb5c202a1214c08b0d3a0f26 > > In comment #38 it's firing on a difference of 7 milliseconds over a > 0.5-second period. Looks like more people will see the issue now. hm by looking at my numbers isnt my hpet reporting to big of a skew and hence its marking the tsc as wrong. because hpet is supposedly more accurate? 'hpet' wd_nsec: 488105680 'tsc' cs_nsec: 495853791 isnt that supposed to be as close to 500* as possible? meaning tsc isnt skewed? by that much but hpet is?
(In reply to Tom Englund from comment #40) > hm by looking at my numbers isnt my hpet reporting to big of a skew and > hence its marking the tsc as wrong. because hpet is supposedly more accurate? > > 'hpet' wd_nsec: 488105680 > > 'tsc' cs_nsec: 495853791 > > isnt that supposed to be as close to 500* as possible? meaning tsc isnt > skewed? by that much but hpet is? As I understand, it arms a 500-millisecond timer on CPU #0 using the currently selected clocksource. As your log says, the selected clocksource was TSC. So, according to the TSC on the CPU #0, approximately 500 milliseconds should have passed, but in that time HPET counted only 488 milliseconds, and TSC on CPU #3 counted almost 496 milliseconds. The TSC millisecond count (cs_nsec value) is closer to 500 just because the kernel was using the TSC to time 0.5 seconds in the first place.
Seen with a Ryzen 5 3400G, MSI Mortar Max B450, AGESA 1.2.0.2: [ 0.000000] tsc: Detected 3693.390 MHz processor [ 0.033049] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3821924579961850 ns [ 0.077149] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.087181] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6a79e303a22, max_idle_ns: 881590710719 ns [ 0.213408] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3822520892550000 ns [ 0.246818] PTP clock support registered [ 0.251987] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 0.253430] clocksource: Switched to clocksource tsc-early [ 0.261649] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.318820] rtc_cmos 00:02: setting system clock to 2021-09-18T23:20:26 UTC (1632007226) [ 0.696293] sched_clock: Marking stable (695958175, 322521)->(699829946, -3549250) [ 1.275199] tsc: Refined TSC clocksource calibration: 3693.061 MHz [ 1.275207] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6a7777116fa, max_idle_ns: 881590883556 ns [ 1.303201] clocksource: Switched to clocksource tsc [ 1.794259] [drm] DM_PPLIB: values for F clock [ 1.794262] [drm] DM_PPLIB: values for DCF clock [ 22.092420] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 22.092440] clocksource: 'hpet' wd_nsec: 496346043 wd_now: 12ca74a7 wd_last: 125e03d3 mask: ffffffff [ 22.092446] clocksource: 'tsc' cs_nsec: 497230355 cs_now: 1ffde0ac9c cs_last: 1f906ced06 mask: ffffffffffffffff [ 22.092451] clocksource: 'tsc' is current clocksource. [ 22.092458] tsc: Marking TSC unstable due to clocksource watchdog [ 22.092482] sched_clock: Marking unstable (22092156166, 323186)<-(22096028445, -3549250) [ 22.093065] clocksource: Checking clocksource tsc synchronization from CPU 4 to CPUs 0,7. [ 22.093139] clocksource: Switched to clocksource hpet
I seem to be affected aswell. It started either with Kernel 5.13.14 or 5.14.5 for me. 5.13.9 is fine and doesn't have a single clocksource watchdog error in over two weeks of continuous uptime. Ryzen 3900X Gigabyte X570 Aorus Pro r1.0 F34 Agesa 1.2.0.3B Kernel 5.14.5, 500Hz, Voluntary Preemption, Dynticks Idle dmesg | grep clocksource [ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3821924579961850 ns [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.000003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d588d6a09c, max_idle_ns: 881590727049 ns [ 0.158996] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3822520892550000 ns [ 0.614064] clocksource: Switched to clocksource tsc-early [ 0.622232] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 1.646013] tsc: Refined TSC clocksource calibration: 3792.875 MHz [ 1.646022] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d581b92771, max_idle_ns: 881590605997 ns [ 1.646229] clocksource: Switched to clocksource tsc [23985.411015] clocksource: timekeeping watchdog on CPU19: hpet retried 2 times before success [44127.896993] clocksource: timekeeping watchdog on CPU9: hpet retried 2 times before success [44160.393834] clocksource: timekeeping watchdog on CPU2: hpet retried 2 times before success [44408.897615] clocksource: timekeeping watchdog on CPU19: hpet retried 3 times before success [44885.381791] clocksource: timekeeping watchdog on CPU12: hpet retried 2 times before success [45650.867725] clocksource: timekeeping watchdog on CPU7: hpet retried 3 times before success [45797.855669] clocksource: timekeeping watchdog on CPU13: hpet retried 2 times before success [46858.836896] clocksource: timekeeping watchdog on CPU23: hpet retried 2 times before success [46876.836454] clocksource: timekeeping watchdog on CPU11: hpet retried 2 times before success [48052.293244] clocksource: timekeeping watchdog on CPU10: hpet retried 3 times before success [49087.274432] clocksource: timekeeping watchdog on CPU16: hpet retried 2 times before success [49690.251364] clocksource: timekeeping watchdog on CPU22: hpet retried 2 times before success [51695.622996] clocksource: timekeeping watchdog on CPU1: hpet retried 2 times before success [52318.784130] clocksource: timekeeping watchdog on CPU23: hpet retried 3 times before success [52445.343626] clocksource: timekeeping watchdog on CPU12: hpet retried 2 times before success [52622.222711] clocksource: timekeeping watchdog on CPU6: hpet retried 2 times before success [52658.169431] clocksource: timekeeping watchdog on CPU6: hpet retried 3 times before success [52732.656197] clocksource: timekeeping watchdog on CPU11: hpet read-back delay of 52101ns, attempt 4, marking unstable [52732.656970] tsc: Marking TSC unstable due to clocksource watchdog [52732.658074] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0,13,17-18,20,23. [52732.658252] clocksource: Switched to clocksource hpet dmesg | grep tsc [ 0.016000] tsc: PIT calibration matches HPET. 1 loops [ 0.016000] tsc: Detected 3792.936 MHz processor [ 0.000003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d588d6a09c, max_idle_ns: 881590727049 ns [ 0.614064] clocksource: Switched to clocksource tsc-early [ 1.646013] tsc: Refined TSC clocksource calibration: 3792.875 MHz [ 1.646022] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d581b92771, max_idle_ns: 881590605997 ns [ 1.646229] clocksource: Switched to clocksource tsc [52732.656970] tsc: Marking TSC unstable due to clocksource watchdog [52732.657097] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [52732.658074] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0,13,17-18,20,23. So in my case, the HPET sometimes takes its sweet time to respond, triggering the conditions introduced with the folowing patch: clocksource: Retry clock read if long delays detected https://github.com/torvalds/linux/commit/db3a34e17433de2390eb80d436970edcebd0ca3e If I understand it correctly, the reason it even tries comparing with the HPET is the stricter rules for what's an acceptable skew introduced in this patch: clocksource: Reduce clocksource-skew threshold https://github.com/torvalds/linux/commit/2e27e793e280ff12cb5c202a1214c08b0d3a0f26 I don't know whether this is relevant, but this old blog post (commenting https://lkml.org/lkml/2008/9/25/451) mentions why there has to be some slack in the first place (does this also imply the wiggle room should increase proportional to the # of CPUs present?): https://www.chromium.org/chromium-os/how-tos-and-troubleshooting/tsc-resynchronization Is something wrong with the hardware or could it be that the new rules are just too tight? What are the side effects of switching to HPET as a clocksource in a desktop/workstation/multimedia system? Should I expect performance degredation, glitches or my VMs suddenly breaking?
(In reply to bugzilla.kernel.org from comment #7) > On another note, Lenovo's support department called me back and told me very > clearly that they do not support Linux and that as long as Windows works on > my laptop, then they can't do anything to help me with my issue. I insisted > that they pass the information I provided to their BIOS development team > because it's a BIOS problem (not a Linux one), but I really doubt they > listened to me. So there's that. JFYI: Mark RH Pearson (markpearson@lenovo.com) is Lenovo's lead technical engineer for the Linux PC team (https://forums.lenovo.com/user/viewprofilepage/user-id/1942528)
here, similarly cpu: Ryzen 5 5600G mobo: ASRockRack X470D4U bios: vP4.20, 04/14/2021 kernel: 5.14.13-200.fc34.x86_64 x86_64 dmesg | egrep "clocksource|hpet" [ 0.095228] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns [ 0.292549] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.298572] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x706dfa647dc, max_idle_ns: 881591068053 ns [ 0.424848] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns [ 0.541944] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 [ 0.541944] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 0.543599] clocksource: Switched to clocksource tsc-early [ 0.555786] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.662499] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 1.603619] tsc: Refined TSC clocksource calibration: 3927.246 MHz [ 1.603630] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7137c868d1b, max_idle_ns: 881590679990 ns [ 1.603697] clocksource: Switched to clocksource tsc [ 2.299865] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2.301411] clocksource: 'hpet' wd_nsec: 499726501 wd_now: 1c3308a wd_last: 15602a4 mask: ffffffff [ 2.302991] clocksource: 'tsc' cs_nsec: 496259389 cs_now: 1c92eacd6e cs_last: 1c1ec07430 mask: ffffffffffffffff [ 2.304613] clocksource: 'tsc' is current clocksource. [ 2.305446] tsc: Marking TSC unstable due to clocksource watchdog [ 2.306501] clocksource: Checking clocksource tsc synchronization from CPU 7 to CPUs 0,3-4,9. [ 2.307437] clocksource: Switched to clocksource hpet adding to cmd line clocksource=tsc clocksource_failover=tsc tsc=reliable force_tsc_stable=1 reboot dmesg | egrep "clocksource|hpet" [ 0.296329] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.302353] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x706e8b70714, max_idle_ns: 881590462336 ns [ 0.418791] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns [ 0.536550] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 [ 0.536550] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 0.538384] clocksource: Switched to clocksource tsc-early [ 0.551081] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.653393] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 1.607671] tsc: Refined TSC clocksource calibration: 3926.815 MHz [ 1.607681] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x71349ad6649, max_idle_ns: 881590462736 ns [ 1.607759] clocksource: Switched to clocksource tsc [ 4.424455] clocksource_failover=tsc is it recommended to allow the switch to hpet? or force the tsc?
I have recently experienced an onslaught of these messages on an Asus Zenbook UM425IA with Ryzen 5 4500U, in correlation with s0ix resume failures: $ dmesg | egrep "clocksource|hpet" [ 0.066342] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns [ 0.138579] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.144599] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x222bfdb946d, max_idle_ns: 440795315613 ns [ 0.263301] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns [ 0.402660] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 [ 0.402662] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 0.404885] clocksource: Switched to clocksource tsc-early [ 1.404546] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 1.456907] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 2.475617] tsc: Refined TSC clocksource calibration: 2370.544 MHz [ 2.475627] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x222b856c74e, max_idle_ns: 440795290389 ns [ 2.475705] clocksource: Switched to clocksource tsc [ 26.650613] clocksource: timekeeping watchdog on CPU4: Marking clocksource 'tsc' as unstable because the skew is too large: [ 26.650621] clocksource: 'hpet' wd_nsec: 227874841 wd_now: 1665386d wd_last: 16336f4c mask: ffffffff [ 26.650625] clocksource: 'tsc' cs_nsec: 503973359 cs_now: 13bea89949 cs_last: 1377730fbf mask: ffffffffffffffff [ 26.650628] clocksource: 'tsc' is current clocksource. [ 26.650634] tsc: Marking TSC unstable due to clocksource watchdog [ 26.651389] clocksource: Checking clocksource tsc synchronization from CPU 5 to CPUs 0-1,4. [ 26.651471] clocksource: Switched to clocksource hpet
first generation Lenovo ThinkPad T14 AMD (Ryzen 7 PRO 4750U based) running 5.14.17-301.fc35.x86_64: Nov 16 15:16:16 fedora kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: Nov 16 15:16:16 fedora kernel: clocksource: 'hpet' wd_nsec: 500868965 wd_now: 1ba34ef wd_last: 14cc723 mask: ffffffff Nov 16 15:16:16 fedora kernel: clocksource: 'tsc' cs_nsec: 497009832 cs_now: 8bb86f4c0 cs_last: 888decbe6 mask: ffffffffffffffff Nov 16 15:16:16 fedora kernel: clocksource: 'tsc' is current clocksource. Nov 16 15:16:16 fedora kernel: tsc: Marking TSC unstable due to clocksource watchdog Nov 16 15:16:16 fedora kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. Nov 16 15:16:16 fedora kernel: sched_clock: Marking unstable (2218464864, 376657)<-(2237226433, -18385478) Nov 16 15:16:16 fedora kernel: clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0-1,4-5,12,15. Nov 16 15:16:16 fedora kernel: clocksource: Switched to clocksource hpet
Lenovo Ideapad L340-15API ryzen 3 3200u same bug. I found this thread googling about dmesg warn. Kernel 5.15.5 Manjaro. dmesg | egrep -i 'tsc|hpet|clocksource' [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.000000] tsc: Detected 2595.105 MHz processor [ 0.005826] ACPI: HPET 0x00000000B9576000 000038 (v01 LENOVO CB-01 00000001 PTEC 00000002) [ 0.005881] ACPI: Reserving HPET table memory at [mem 0xb9576000-0xb9576037] [ 0.023566] ACPI: HPET id: 0x43538210 base: 0xfed00000 [ 0.023651] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns [ 0.085358] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns [ 0.102060] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25682bdad31, max_idle_ns: 440795225592 ns [ 0.218732] TSC synchronization [CPU#0 -> CPU#1]: [ 0.218732] Measured 3733953964 cycles TSC warp between CPUs, turning off TSC clock. [ 0.218732] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 0.219775] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns [ 0.254175] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 [ 0.254175] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 0.257491] clocksource: Switched to clocksource hpet [ 0.270560] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.411400] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes nvram, hpet irqs [ 17.689050] vboxdrv: TSC mode is Invariant, tentative frequency 2595121321 Hz
Someone above said that some BIOS update fixed it for ThinkPad L14 Gen 1, but I'm currently having this on DMI: LENOVO 20U5003NRT/20U5003NRT, BIOS R19ET36W (1.20 ) 07/12/2021.
With acpi=off clocksourcec=tsc in kernel cmdline, the kernel select tsc-early and no mark an unstable, but the 4 cores of the Ryzen 3 3200U dissapear and show only one. Kernel 5.15.10 and 5.16-rc5
I got it again on my Thinkpad L14 Gen2 with 5.16-rc6: [ 2.154325] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2.154329] clocksource: 'hpet' wd_nsec: 495907650 wd_now: 1ba864b wd_last: 14e2dfc mask: ffffffff [ 2.154333] clocksource: 'tsc' cs_nsec: 503650490 cs_now: 84c6e1957 cs_last: 8197df12b mask: ffffffffffffffff [ 2.154336] clocksource: 'tsc' is current clocksource. [ 2.154342] tsc: Marking TSC unstable due to clocksource watchdog [ 2.154356] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 2.154358] sched_clock: Marking unstable (2154008130, 346704)<-(2172981241, -18626378) [ 2.154907] clocksource: Checking clocksource tsc synchronization from CPU 13 to CPUs 0,2,4,7,9. This is with: [ 0.000000] DMI: LENOVO 20U50001GE/20U50001GE, BIOS R19ET36W (1.20 ) 07/12/2021 This might be a regression from earlier BIOS versions. I was the one that reported success with BIOS version 1.17. @Nix: acpi=off is not a viable option these days. You need it for various reasons.
Well I don't know if this information can help to developers or AMD to fix it. Yesterday I compiled 5.16 vanilla kernel and I forgot to check SMP support. So the kernel started with a single core. The TSC clock was stable. Later I compiled again, checking SMP support and TSC is unstable. Using the default 5.16 from manjaro is TSC unstable. Link to picture 5.16 custom no SMP: https://imgur.com/a0ciCrp I am not a developer and really i can't understand how disabling SMP the TSC clock is stable. And this is now with 5.16 manjaro stock: [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.000000] tsc: Detected 2595.144 MHz processor [ 0.103950] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x256850c0173, max_idle_ns: 440795271144 ns [ 0.220621] TSC synchronization [CPU#0 -> CPU#1]: [ 0.220621] Measured 4143265984 cycles TSC warp between CPUs, turning off TSC clock. [ 0.220621] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 18.124359] vboxdrv: TSC mode is Invariant, tentative frequency 2595145458 Hz [ 30.385511] SVM: TSC scaling supported
I've just changed to a 5700G and get this too: [ 2.081370] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2.081371] clocksource: 'hpet' wd_nsec: 499140044 wd_now: 1b971c7 wd_last: 14c64ae mask: ffffffff [ 2.081373] clocksource: 'tsc' cs_nsec: 495716395 cs_now: c41661756 cs_last: bd08e8802 mask: ffffffffffffffff [ 2.081374] clocksource: 'tsc' is current clocksource. [ 2.081380] tsc: Marking TSC unstable due to clocksource watchdog [ 2.081386] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 2.081387] sched_clock: Marking unstable (2081136487, 249626)<-(2097250056, -15864482) [ 2.081545] clocksource: Checking clocksource tsc synchronization from CPU 5 to CPUs 0-1,6,9,14. [ 2.081573] clocksource: Switched to clocksource hpet However the fix posted by Thomas Gleixner for bug 208887 (delay the calibration) works here too. Not sure what I'd report to AMD/MSI on this one.
I take that back, just tried 5.16.9 with the patch - tsc marked unstable. Seems like a warm reboot has an influence.
No idea if related, on some Ryzen system fast TSC calibration fails, and *[PATCH v2] x86/tsc: Allow quick PIT calibration despite interruptions* [1] fixes it. [1]: https://lore.kernel.org/all/20190214214608.8672-1-jan@schnhrr.de/
I again tried my luck with Jan H. Schönherr's patch with v5.15.44 (applied without problems) on a "Thinkpad T14 AMD Gen1 20UD" with latest EFI Firmware v1.40: [ 0.000000] DMI: LENOVO 20UD0013GE/20UD0013GE, BIOS R1BET71W(1.40 ) 04/05/2022 After a cold boot TSC is usually unusable: May 31 10:53:55 tsc: Fast TSC calibration using PIT May 31 10:53:55 tsc: Detected 1696.898 MHz processor May 31 10:53:55 clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1875b4ef64c, max_idle_ns: 440795203028 ns May 31 10:53:55 clocksource: Switched to clocksource tsc-early May 31 10:53:55 tsc: Refined TSC clocksource calibration: 1709.706 MHz May 31 10:53:55 clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x18a4f82559b, max_idle_ns: 440795270331 ns May 31 10:53:55 clocksource: Switched to clocksource tsc May 31 10:53:55 clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: May 31 10:53:55 clocksource: 'tsc' cs_nsec: 496003006 cs_now: 7e747fcff cs_last: 7b4bc3db0 mask: ffffffffffffffff May 31 10:53:55 clocksource: 'tsc' is current clocksource. May 31 10:53:55 tsc: Marking TSC unstable due to clocksource watchdog May 31 10:53:55 clocksource: Checking clocksource tsc synchronization from CPU 0 to CPUs 1-3,5,7,9,15. After a warm star (reboot) TSC is available: May 31 10:54:24 tsc: Fast TSC calibration using PIT May 31 10:54:24 tsc: Detected 1696.656 MHz processor May 31 10:54:24 clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1874d03af40, max_idle_ns: 440795221980 ns May 31 10:54:25 clocksource: Switched to clocksource tsc-early May 31 10:54:25 tsc: Refined TSC clocksource calibration: 1696.819 MHz May 31 10:54:25 clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x18756a4cab8, max_idle_ns: 440795267109 ns May 31 10:54:25 clocksource: Switched to clocksource tsc Neither firmware v1.40 nor the patch changed the status quo (warmstart required). There seems to be a noticable performance difference with TSC enabled VS TSC unavailable.
Mario, could AMD please look into this, and improve Jan’s patches? Mark, as also Lenovo Thinkpads are affected, could your engineers please reproduce the issue, and help to analyze and hopefully fix it?
Thanks Paul for letting me know about this ticket. Yes - this is already on our radar and I've been working with Mario on it and we have plans for fixing it but no ETA available yet. Mark
(In reply to Paul Menzel from comment #55) > No idea if related, on some Ryzen system fast TSC calibration fails, and > *[PATCH v2] x86/tsc: Allow quick PIT calibration despite interruptions* [1] > fixes it. > > [1]: https://lore.kernel.org/all/20190214214608.8672-1-jan@schnhrr.de/ Didn't observe any improvement on the 5700G. I'm now trying tsc=reliable and seeing what the long-term impact is.
The bug has *AMD Ryzen 7 PRO 2700U* in it’s title, and Steven reported: > Yes, they fixed it for the A485 in the 1.28 BIOS update, with this comment in > > their changelog: > > "(Fix) Fixed TSC synchronization [CPU#0 -> CPU#1] under linux." Should the bug title be renamed, or marked as solved, and a new issue be opened for the other reports?
(In reply to Paul Menzel from comment #60) > The bug has *AMD Ryzen 7 PRO 2700U* in it’s title, and Steven reported: > > > Yes, they fixed it for the A485 in the 1.28 BIOS update, with this comment > in > > > their changelog: > > > > "(Fix) Fixed TSC synchronization [CPU#0 -> CPU#1] under linux." > > Should the bug title be renamed, or marked as solved, and a new issue be > opened for the other reports? I'll just rename the issue. The problem is a broader AMD problem than just affecting the ThinkPad A485. And many people in this thread are talking about a wide variety of devices. I think moving this to a new issue would be a mistake, as people on the CC list who are monitoring this issue would lose visibility, and I suspect the broader issue would fall through the cracks as a result.
Thank you Steven. In my communication with AMD, they eventually said they found a root cause in SBIOS and issued an AGESA update (disproving comment #20), and were investigating a similar problem that on linux-5.15 affected "platforms supporting Modern Standby" (where the fix would also be via a firmware update).
(In reply to James Ettle from comment #59) > Didn't observe any improvement on the 5700G. I'm now trying tsc=reliable and > seeing what the long-term impact is. Ah, 7000ppm slow following a reboot... bad idea. (In reply to Alexander Monakov from comment #62) > In my communication with AMD, they eventually said they found a root cause > in SBIOS and issued an AGESA update (disproving comment #20), and were > investigating a similar problem that on linux-5.15 affected "platforms > supporting Modern Standby" (where the fix would also be via a firmware > update). Is the fixed AGESA version number known?
(In reply to James Ettle from comment #63) > > Is the fixed AGESA version number known? 1.0.0.6 for the initial issue I asked about, no idea about the follow-up issue related to linux-5.15 they were investigating (as of September 2021).
> The problem is a broader AMD problem than just affecting the ThinkPad A485. > And many people in this thread are talking about a wide variety of devices. I > think moving this to a new issue would be a mistake, as people on the CC list > who are monitoring this issue would lose visibility, and I suspect the > broader issue would fall through the cracks as a result. I think it's a mistake to mix this up across different generations. Even if it's an issue which can reproduce on AMD's reference designs it should be tracked for one family of APUs/CPUs at a time. So please let's keep this issue on Ryzen 2000. If someone has problems with Ryzen 3000/4000/5000/6000 lets have separate bugs.
As comment #60/#61 indicate this is fixed by the OEM BIOS for Ryzen 2000, so I will close *this* issue to indicate this was not a bug that is addressed by the kernel, but rather a BIOS bug.
(In reply to Mario Limonciello (AMD) from comment #65) > So please let's keep this issue on Ryzen 2000. If someone has problems > with Ryzen 3000/4000/5000/6000 lets have separate bugs. If we must, but I am concerned those issues would not get the same attention this issue now has. It took a very long time for enough users to chime in on this bug for the right people to get CC'd' and for AMD to notice this issue. Every AMD laptop I've had since filing this in 2019 has had TSC sync issues (most often after warm reboots). So that includes the 2700U, 4900HS, and 5900HS CPUs.
Although they may "look/fail" the same to the kernel, they need to be root caused and addressed individually as they may be OEM specific issues or generation specific issues. Lumping them all together is a good way for nothing to get solved. Bugs are cheap and can easily be duped if we're wrong and it's the same root cause across OEMs or generations. I've personally confirmed on a variety of Ryzen 6000 systems both cold and warm boot are working properly.
(In reply to Mario Limonciello (AMD) from comment #65) > So please let's keep this > issue on Ryzen 2000. If someone has problems with Ryzen 3000/4000/5000/6000 > lets have separate bugs. Created bug 216146 for 5700G.
Ryzen 6000 series is affected as well, see a review of the Lenovo ThinkPad T14 Gen 3 with BIOS 1.17 [1]. 1: https://www.reddit.com/r/thinkpad/comments/voxsr0/
> Ryzen 6000 series is affected as well, see a review of the Lenovo ThinkPad > T14 Gen 3 with BIOS 1.17 [1]. This is not the same problem, it's a warp between CPUs cores. It should have it's own bug report.