Bug 216146

Summary: TSC marked unstable on AMD Ryzen 5700G
Product: Platform Specific/Hardware Reporter: James Ettle (james)
Component: x86-64Assignee: Mario Limonciello (AMD) (mario.limonciello)
Status: RESOLVED DOCUMENTED    
Severity: normal CC: diego.ce, gaaf, kernelbugzilla, marcos, mario.limonciello, me, ppatry, rafael.ristovski, robie, samy, usama.anjum, vovochka13
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 5.18.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
dmesg output with AGESA 1.2.0.7
dmesg output with AGESA 1.2.0.7, tsc=unstable

Description James Ettle 2022-06-18 14:44:25 UTC
Created attachment 301209 [details]
dmesg output

(Following on from 202525 and the suggestion to make new BZ entries for different generations of Ryzen.)

Processor: AMD Ryzen 5700G
Motherboard: MSI Mortar Max B450
BIOS: 2.D0 05/17/2021 (not the latest available), AGESA 1.2.0.2 

Observation: TSC is marked as unstable, machine switches to HPET. Forcing TSC occasionally works fine, but on some boots results in a 7000ppm miscalibration.

From 5.18.5:

[    2.088935] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.088940] clocksource:                       'hpet' wd_nsec: 499081656 wd_now: 1b99cb7 wd_last: 14c92e2 mask: ffffffff
[    2.088943] clocksource:                       'tsc' cs_nsec: 495725287 cs_now: 1e875ad044 cs_last: 1e1686a41c mask: ffffffffffffffff
[    2.088945] clocksource:                       'tsc' is current clocksource.
[    2.088950] tsc: Marking TSC unstable due to clocksource watchdog
[    2.088962] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[    2.088963] sched_clock: Marking unstable (2088727973, 233765)<-(2105642684, -16681222)
[    2.089791] clocksource: Checking clocksource tsc synchronization from CPU 4 to CPUs 0,3,10,12.
[    2.089870] clocksource: Switched to clocksource hpet

Full dmesg attached. (It's from a distro kernel, but experiments with vanilla sources don't alter the results.)
Comment 1 Artem S. Tashkinov 2022-06-18 17:29:19 UTC
Could you try updating BIOS to the latest version?

Could you try adding tsc=unstable to kernel boot arguments?

There's bug 203183 where people experience the same issue, you may wanna check it out.
Comment 2 James Ettle 2022-06-19 12:22:33 UTC
(In reply to Artem S. Tashkinov from comment #1)
> Could you try updating BIOS to the latest version?
> 
> Could you try adding tsc=unstable to kernel boot arguments?
> 
> There's bug 203183 where people experience the same issue, you may wanna
> check it out.

Same result with latest BIOS 2.G0 05/13/2022, AGESA 1.2.0.7:

[    2.083493] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.083499] clocksource:                       'hpet' wd_nsec: 499419688 wd_now: 1b9b8a8 wd_last: 14c9beb mask: ffffffff
[    2.083501] clocksource:                       'tsc' cs_nsec: 496108391 cs_now: 13c9b4af92 cs_last: 1358ccf0e8 mask: ffffffffffffffff
[    2.083503] clocksource:                       'tsc' is current clocksource.
[    2.083512] tsc: Marking TSC unstable due to clocksource watchdog
[    2.083526] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[    2.083527] sched_clock: Marking unstable (2083291279, 234506)<-(2100510221, -16984590)
[    2.083761] clocksource: Checking clocksource tsc synchronization from CPU 2 to CPUs 0,6-7,15.
[    2.083856] clocksource: Switched to clocksource hpet

With tsc=unstable, it just switches to hpet earlier.

Full dmesg logs in both cases will be attached.
Comment 3 James Ettle 2022-06-19 12:23:04 UTC
Created attachment 301222 [details]
dmesg output with AGESA 1.2.0.7
Comment 4 James Ettle 2022-06-19 12:23:24 UTC
Created attachment 301223 [details]
dmesg output with AGESA 1.2.0.7, tsc=unstable
Comment 5 Mario Limonciello (AMD) 2022-06-20 14:16:02 UTC
Can you please confirm if this is tied to warm boot only on the 5700G?  Or can cold boot also reproduce this?
Comment 6 James Ettle 2022-06-20 20:00:28 UTC
(In reply to Mario Limonciello (AMD) from comment #5)
> Can you please confirm if this is tied to warm boot only on the 5700G?  Or
> can cold boot also reproduce this?

So far, I've only seen it with warm boots and not cold (although I'll keep trying cold boots over the next few days).
Comment 7 Mario Limonciello (AMD) 2022-06-20 20:44:48 UTC
OK thanks - that matches my expectation.  There is a similar issue reported in the mobile 5xxx series.  I'll discuss with others to see if it could be the same root cause.
Comment 8 Mario Limonciello (AMD) 2022-06-21 03:58:11 UTC
From my conversations, the fix for warm boot on mobile will also apply to desktop  as well.  As you saw it's not present in AGESA 1.2.0.7, but as long as no regressions are found it should be in a future release.

We can keep this open to keep track of when that happens.
Comment 9 James Ettle 2022-06-21 13:54:43 UTC
(In reply to Mario Limonciello (AMD) from comment #8)
> From my conversations, the fix for warm boot on mobile will also apply to
> desktop  as well.  As you saw it's not present in AGESA 1.2.0.7, but as long
> as no regressions are found it should be in a future release.

OK, and I'll keep an eye out for beta BIOSes for my motherboard with later AGESA versions (hopefully it'll receive updates for a few more years).
Comment 11 Marcos Scriven 2022-06-26 17:00:11 UTC
I just found this bug (via https://bugzilla.kernel.org/show_bug.cgi?id=202525), having experienced the same issue on my 5700G.

This is on an Aorus b550i Pro AX board. 

It looks to me like their F16d BIOS only has AGESA 1.2.0.7: https://www.gigabyte.com/Motherboard/B550I-AORUS-PRO-AX-rev-10/support#support-dl-bios

I have a few questions if I may:

1) What's the actual impact of this bug?
2) Do you believe it's possible for it to be in 1.2.0.8?
3) Does AMD have any sway in advocating for BIOS manufacturers to include this fix? I see in this instance Aorus still haven't finalised their F16 release anyway (as it still has an alphabetical suffix)

Thanks
Comment 12 Mario Limonciello (AMD) 2022-06-27 19:09:54 UTC
> 1) What's the actual impact of this bug?

The TSC fetched by the OS can't be effectively used for comparing time that passed in two commands.  The kernel switches to using the HPET internally.

> 2) Do you believe it's possible for it to be in 1.2.0.8?

It's possible, but I can't make any commitment here.

> 3) Does AMD have any sway in advocating for BIOS manufacturers to include
> this fix? I see in this instance Aorus still haven't finalised their F16
> release anyway (as it still has an alphabetical suffix)

There is nothing special that needs to be done for this bug report.  When they integrate the appropriate AEGEA release that contains this fix they will pick it up.
Comment 13 Mario Limonciello (AMD) 2022-06-27 19:10:24 UTC
(In reply to Mario Limonciello (AMD) from comment #12)
> > 1) What's the actual impact of this bug?
> 
> The TSC fetched by the OS can't be effectively used for comparing time that
> passed in two commands.  The kernel switches to using the HPET internally.
> 
> > 2) Do you believe it's possible for it to be in 1.2.0.8?
> 
> It's possible, but I can't make any commitment here.
> 
> > 3) Does AMD have any sway in advocating for BIOS manufacturers to include
> > this fix? I see in this instance Aorus still haven't finalised their F16
> > release anyway (as it still has an alphabetical suffix)
> 
> There is nothing special that needs to be done for this bug report.  When
> they integrate the appropriate AEGEA release that contains this fix they
> will pick it up.

s/AEGEA/AGESA/
Comment 14 Andrea Brandi 2022-08-09 20:44:23 UTC
I'm experiencing the same issue with my Ryzen 5 5600G on a Gigabyte B550 AORUS PRO V2 (https://www.gigabyte.com/Motherboard/B550-AORUS-PRO-V2-rev-10) latest bios with AGESA 1.2.0.7. Like the others, it concerns "CPU3" and "tsc" marked as unstable. As a workaround I added `tsc=nowatchdog` to my boot parameters, it looks like it's working but I'm not quite sure it's 100% safe.

$ uname -rv
5.18.16-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 03 Aug 2022 11:25:04 +0000
Comment 15 James Ettle 2022-08-09 20:48:41 UTC
(In reply to Andrea Brandi from comment #14)
> boot parameters, it looks like it's working but I'm not quite sure it's 100%
> safe.

Check across reboots both warm and cold with chronyc tracking - see if it picks up a large skew rate (like 1000s of ppm off).
Comment 16 Andrea Brandi 2022-08-09 23:18:36 UTC
(In reply to James Ettle from comment #15)
> Check across reboots both warm and cold with chronyc tracking - see if it
> picks up a large skew rate (like 1000s of ppm off).

Thank you James, I tried a couple of reboots, cold and warm, in both cases no much differences, just a little bit higher "Frequency" on cold. I tried also monitoring using `hpet` and I observed very similar values, the difference is that Skew stays low all the time, instead forcing `tsc`, Skew starts much higher after boot (15.000+ ppm) but it goes down rather quickly, in minutes. These are my values 30m of work after boot.

Reference ID    : B913B823 (radha.parvati.it)
Stratum         : 3
Ref time (UTC)  : Tue Aug 09 23:18:02 2022
System time     : 0.000024053 seconds slow of NTP time
Last offset     : -0.000068177 seconds
RMS offset      : 0.144366026 seconds
Frequency       : 32.025 ppm slow
Residual freq   : -0.003 ppm
Skew            : 0.422 ppm
Root delay      : 0.019605236 seconds
Root dispersion : 0.002717295 seconds
Update interval : 128.7 seconds
Leap status     : Normal

I don't like that "slow" after frequency, but it happens also with hpet. I don't know how to correctly interpret those values to be honest, do they look good to you? What I can say is that using `hpet` I have bad audio latency in my VMs (virtio), forcing `tsc` the situation is much better.
Comment 17 James Ettle 2022-08-10 21:12:32 UTC
For example, this is what I see with tsc=nowatchdog on a 5700G - nearly 7000 ppm off:

$ chronyc tracking
Reference ID    : D993DF4E (bart.nexellent.net)
Stratum         : 3
Ref time (UTC)  : Wed Aug 10 21:07:02 2022
System time     : 0.000964666 seconds fast of NTP time
Last offset     : +0.000494900 seconds
RMS offset      : 0.407102495 seconds
Frequency       : 6963.022 ppm slow
Residual freq   : -0.048 ppm
Skew            : 11.078 ppm
Root delay      : 0.024999851 seconds
Root dispersion : 0.033659816 seconds
Update interval : 64.4 seconds
Leap status     : Normal

I'll give tsc_early_khz a go to see if that gives a workaround.
Comment 18 Pascal Patry 2023-01-26 14:48:23 UTC
I'm seeing this issue as well on a Beelink SER5 5560U.

SMC feature version: 0, program: 0, firmware version: 0x00403800 (64.56.0)
Comment 19 Mario Limonciello (AMD) 2023-01-26 15:30:34 UTC
FYI - the fix for this is contained in SMC 64.64.0.
Comment 20 Alex Hermann 2023-01-31 17:14:16 UTC
What does it mean that this bug is resolved by "DOCUMENTATION"? Does that mean the bug is documented and won't be fixed? Where is it documented?

So what's going to happen now? Will the bug be ignored from now as it is "RESOLVED"? Or will there finally be a fix?

In comment #8, it was said a fix would be released in an AGESA release. Won't that happen anymore?

Please reopen this bug report until there is a documented and released fix.

Thanks.
Comment 21 Mario Limonciello (AMD) 2023-01-31 17:17:09 UTC
There is no change to the kernel for this issue.  It's a platform firmware issue.  
When OEMs update to a new AGESA release they would pick this up.

The reason for indicating the SMC version is that is platform firmware component that fixes it, and you can tell whether you have the fix by looking at that version.
Comment 22 Alex Hermann 2023-01-31 17:29:00 UTC
What is SMC? My google-fu doesn't give any result.

In what AGESA version is the fix?

Thanks.
Comment 23 Mario Limonciello (AMD) 2023-01-31 17:33:46 UTC
> What is SMC? My google-fu doesn't give any result.

It's one of the platform firmware components.  You can determine what version you have by this command:

# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info  | grep SMC

> In what AGESA version is the fix?

I intentionally didn't indicate the AGESA version because it varies based on the particular ASIC and if an OEM found a regression in this component they can potentially not update it.
Comment 24 Alex Hermann 2023-03-03 15:02:21 UTC
I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this sufficient information for you to clarify if the fix is included?
Comment 25 Mario Limonciello (AMD) 2023-03-03 15:03:19 UTC
> I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this
> sufficient information for you to clarify if the fix is included?

If you flash it and get me the output of 

# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info  | grep SMC

I can confirm if it's there or not.
Comment 26 Alex Hermann 2023-03-03 15:22:20 UTC
I want to prevent flashing it if it doesn't contain the fix. Considering the reported regressions wrt RAM/XMP in AGESA's above 1.2.0.3c, I'd rather not go through the hassle of an update and potential revert if it doesn't knowingly contain a needed fix.

Isn't it a bit disturbing AMD doesn't know if/when/where its fixes are distributed?
Comment 27 Mario Limonciello (AMD) 2023-03-03 15:26:24 UTC
> Isn't it a bit disturbing AMD doesn't know if/when/where its fixes are
> distributed?

That's a strong statement.  Not everyone in AMD talks to everyone at all the OEMs and knows all the information about their business.
It's entirely up to OEMs to decide if they want to take pieces or not.  Someone in AMD who talks to your motherboard vendor may know.

>I want to prevent flashing it if it doesn't contain the fix. Considering the
>reported regressions wrt RAM/XMP in AGESA's above 1.2.0.3c, I'd rather not go
>through the hassle of an update and potential revert if it doesn't knowingly
>contain a needed fix.

On AMD's reference firmware that fix was included in that version.  So the fix *should* be there if the OEM didn't remove it for some reason.
Comment 28 Rafael Ristovski 2023-03-03 16:51:48 UTC
FWIW, there is an Asus Beta BIOS update for my B450M-A II that has the updated SMC firmware. One can check the versions of components with `psptool` (versions in hex):

previous bios image with SMC 64.61.0:
> $ psptool bios.old 2>/dev/null | grep 0x1574d00 | cut -d '|' -f 9
> 0.40.3D.0

new bios image with SMC 64.65.0:
> $ psptool bios.new 2>/dev/null | grep 0x1574d00 | cut -d '|' -f 9
> 0.40.41.0

the release notes mention "1. Update AGESA version to ComboV2PI 1208", so it is fair to assume that at least Asus is shipping the fixed SMC version in their 1.2.0.8 AGESA.
Comment 29 James Ettle 2023-03-25 15:20:28 UTC
Just updated my MSI Mortar Max B450M board to a (beta) BIOS 7B89v2I, has AGESA 1.2.0.8 and SMC 64.65.0. Seems OK so far, two boots and it's using TSC happily.
Comment 30 Marcin 2023-03-25 19:44:02 UTC
I updated my Asus ProArt Creator x570 to latest BIOS version - 1101, which mentions "Update AGESA version to ComboV2PI 1208" in its release notes. I've been happily running it for about a month and been checking TSC related dmesg messages once in a while. Neither rebooting nor suspend/wake cycle causes TSC to be marked unstable anymore so I'm pretty sure it's fixed now.
Comment 31 Diego Vasconcelos 2023-04-07 02:02:13 UTC
(In reply to Mario Limonciello (AMD) from comment #25)
> > I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this
> > sufficient information for you to clarify if the fix is included?
> 
> If you flash it and get me the output of 
> 
> # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info  | grep SMC
> 
> I can confirm if it's there or not.

AGESA V2 1.2.0.8 fix Ryzen Serie 4000?

I updated my motherboard (Gigabyte A520M DS3H) with Ryzen 4600g to the latest BIOS version F17b with AGESA V2 1.2.0.8.

The computer starts with TSC, but after a reboot, TSC is marked as unstable and HPET is chosen.

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info  | grep SMC 
                                                                                                                      
SMC feature version: 0, program: 0, firmware version: 0x00375b00 (55.91.0)
Comment 32 Mario Limonciello (AMD) 2023-04-07 13:46:04 UTC
(In reply to Diego Vasconcelos from comment #31)
> (In reply to Mario Limonciello (AMD) from comment #25)
> > > I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this
> > > sufficient information for you to clarify if the fix is included?
> > 
> > If you flash it and get me the output of 
> > 
> > # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info  | grep SMC
> > 
> > I can confirm if it's there or not.
> 
> AGESA V2 1.2.0.8 fix Ryzen Serie 4000?
> 

Ryzen 4000 is tracked in https://bugzilla.kernel.org/show_bug.cgi?id=216166