Bug 216146
Summary: | TSC marked unstable on AMD Ryzen 5700G | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | James Ettle (james) |
Component: | x86-64 | Assignee: | Mario Limonciello (AMD) (mario.limonciello) |
Status: | RESOLVED DOCUMENTED | ||
Severity: | normal | CC: | diego.ce, gaaf, kernelbugzilla, marcos, mario.limonciello, me, ppatry, rafael.ristovski, robie, samy, usama.anjum, vovochka13 |
Priority: | P1 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 5.18.5 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg output
dmesg output with AGESA 1.2.0.7 dmesg output with AGESA 1.2.0.7, tsc=unstable |
Could you try updating BIOS to the latest version? Could you try adding tsc=unstable to kernel boot arguments? There's bug 203183 where people experience the same issue, you may wanna check it out. (In reply to Artem S. Tashkinov from comment #1) > Could you try updating BIOS to the latest version? > > Could you try adding tsc=unstable to kernel boot arguments? > > There's bug 203183 where people experience the same issue, you may wanna > check it out. Same result with latest BIOS 2.G0 05/13/2022, AGESA 1.2.0.7: [ 2.083493] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2.083499] clocksource: 'hpet' wd_nsec: 499419688 wd_now: 1b9b8a8 wd_last: 14c9beb mask: ffffffff [ 2.083501] clocksource: 'tsc' cs_nsec: 496108391 cs_now: 13c9b4af92 cs_last: 1358ccf0e8 mask: ffffffffffffffff [ 2.083503] clocksource: 'tsc' is current clocksource. [ 2.083512] tsc: Marking TSC unstable due to clocksource watchdog [ 2.083526] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 2.083527] sched_clock: Marking unstable (2083291279, 234506)<-(2100510221, -16984590) [ 2.083761] clocksource: Checking clocksource tsc synchronization from CPU 2 to CPUs 0,6-7,15. [ 2.083856] clocksource: Switched to clocksource hpet With tsc=unstable, it just switches to hpet earlier. Full dmesg logs in both cases will be attached. Created attachment 301222 [details]
dmesg output with AGESA 1.2.0.7
Created attachment 301223 [details]
dmesg output with AGESA 1.2.0.7, tsc=unstable
Can you please confirm if this is tied to warm boot only on the 5700G? Or can cold boot also reproduce this? (In reply to Mario Limonciello (AMD) from comment #5) > Can you please confirm if this is tied to warm boot only on the 5700G? Or > can cold boot also reproduce this? So far, I've only seen it with warm boots and not cold (although I'll keep trying cold boots over the next few days). OK thanks - that matches my expectation. There is a similar issue reported in the mobile 5xxx series. I'll discuss with others to see if it could be the same root cause. From my conversations, the fix for warm boot on mobile will also apply to desktop as well. As you saw it's not present in AGESA 1.2.0.7, but as long as no regressions are found it should be in a future release. We can keep this open to keep track of when that happens. (In reply to Mario Limonciello (AMD) from comment #8) > From my conversations, the fix for warm boot on mobile will also apply to > desktop as well. As you saw it's not present in AGESA 1.2.0.7, but as long > as no regressions are found it should be in a future release. OK, and I'll keep an eye out for beta BIOSes for my motherboard with later AGESA versions (hopefully it'll receive updates for a few more years). I just found this bug (via https://bugzilla.kernel.org/show_bug.cgi?id=202525), having experienced the same issue on my 5700G. This is on an Aorus b550i Pro AX board. It looks to me like their F16d BIOS only has AGESA 1.2.0.7: https://www.gigabyte.com/Motherboard/B550I-AORUS-PRO-AX-rev-10/support#support-dl-bios I have a few questions if I may: 1) What's the actual impact of this bug? 2) Do you believe it's possible for it to be in 1.2.0.8? 3) Does AMD have any sway in advocating for BIOS manufacturers to include this fix? I see in this instance Aorus still haven't finalised their F16 release anyway (as it still has an alphabetical suffix) Thanks > 1) What's the actual impact of this bug? The TSC fetched by the OS can't be effectively used for comparing time that passed in two commands. The kernel switches to using the HPET internally. > 2) Do you believe it's possible for it to be in 1.2.0.8? It's possible, but I can't make any commitment here. > 3) Does AMD have any sway in advocating for BIOS manufacturers to include > this fix? I see in this instance Aorus still haven't finalised their F16 > release anyway (as it still has an alphabetical suffix) There is nothing special that needs to be done for this bug report. When they integrate the appropriate AEGEA release that contains this fix they will pick it up. (In reply to Mario Limonciello (AMD) from comment #12) > > 1) What's the actual impact of this bug? > > The TSC fetched by the OS can't be effectively used for comparing time that > passed in two commands. The kernel switches to using the HPET internally. > > > 2) Do you believe it's possible for it to be in 1.2.0.8? > > It's possible, but I can't make any commitment here. > > > 3) Does AMD have any sway in advocating for BIOS manufacturers to include > > this fix? I see in this instance Aorus still haven't finalised their F16 > > release anyway (as it still has an alphabetical suffix) > > There is nothing special that needs to be done for this bug report. When > they integrate the appropriate AEGEA release that contains this fix they > will pick it up. s/AEGEA/AGESA/ I'm experiencing the same issue with my Ryzen 5 5600G on a Gigabyte B550 AORUS PRO V2 (https://www.gigabyte.com/Motherboard/B550-AORUS-PRO-V2-rev-10) latest bios with AGESA 1.2.0.7. Like the others, it concerns "CPU3" and "tsc" marked as unstable. As a workaround I added `tsc=nowatchdog` to my boot parameters, it looks like it's working but I'm not quite sure it's 100% safe. $ uname -rv 5.18.16-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 03 Aug 2022 11:25:04 +0000 (In reply to Andrea Brandi from comment #14) > boot parameters, it looks like it's working but I'm not quite sure it's 100% > safe. Check across reboots both warm and cold with chronyc tracking - see if it picks up a large skew rate (like 1000s of ppm off). (In reply to James Ettle from comment #15) > Check across reboots both warm and cold with chronyc tracking - see if it > picks up a large skew rate (like 1000s of ppm off). Thank you James, I tried a couple of reboots, cold and warm, in both cases no much differences, just a little bit higher "Frequency" on cold. I tried also monitoring using `hpet` and I observed very similar values, the difference is that Skew stays low all the time, instead forcing `tsc`, Skew starts much higher after boot (15.000+ ppm) but it goes down rather quickly, in minutes. These are my values 30m of work after boot. Reference ID : B913B823 (radha.parvati.it) Stratum : 3 Ref time (UTC) : Tue Aug 09 23:18:02 2022 System time : 0.000024053 seconds slow of NTP time Last offset : -0.000068177 seconds RMS offset : 0.144366026 seconds Frequency : 32.025 ppm slow Residual freq : -0.003 ppm Skew : 0.422 ppm Root delay : 0.019605236 seconds Root dispersion : 0.002717295 seconds Update interval : 128.7 seconds Leap status : Normal I don't like that "slow" after frequency, but it happens also with hpet. I don't know how to correctly interpret those values to be honest, do they look good to you? What I can say is that using `hpet` I have bad audio latency in my VMs (virtio), forcing `tsc` the situation is much better. For example, this is what I see with tsc=nowatchdog on a 5700G - nearly 7000 ppm off: $ chronyc tracking Reference ID : D993DF4E (bart.nexellent.net) Stratum : 3 Ref time (UTC) : Wed Aug 10 21:07:02 2022 System time : 0.000964666 seconds fast of NTP time Last offset : +0.000494900 seconds RMS offset : 0.407102495 seconds Frequency : 6963.022 ppm slow Residual freq : -0.048 ppm Skew : 11.078 ppm Root delay : 0.024999851 seconds Root dispersion : 0.033659816 seconds Update interval : 64.4 seconds Leap status : Normal I'll give tsc_early_khz a go to see if that gives a workaround. I'm seeing this issue as well on a Beelink SER5 5560U. SMC feature version: 0, program: 0, firmware version: 0x00403800 (64.56.0) FYI - the fix for this is contained in SMC 64.64.0. What does it mean that this bug is resolved by "DOCUMENTATION"? Does that mean the bug is documented and won't be fixed? Where is it documented? So what's going to happen now? Will the bug be ignored from now as it is "RESOLVED"? Or will there finally be a fix? In comment #8, it was said a fix would be released in an AGESA release. Won't that happen anymore? Please reopen this bug report until there is a documented and released fix. Thanks. There is no change to the kernel for this issue. It's a platform firmware issue. When OEMs update to a new AGESA release they would pick this up. The reason for indicating the SMC version is that is platform firmware component that fixes it, and you can tell whether you have the fix by looking at that version. What is SMC? My google-fu doesn't give any result. In what AGESA version is the fix? Thanks. > What is SMC? My google-fu doesn't give any result. It's one of the platform firmware components. You can determine what version you have by this command: # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep SMC > In what AGESA version is the fix? I intentionally didn't indicate the AGESA version because it varies based on the particular ASIC and if an OEM found a regression in this component they can potentially not update it. I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this sufficient information for you to clarify if the fix is included? > I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this
> sufficient information for you to clarify if the fix is included?
If you flash it and get me the output of
# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep SMC
I can confirm if it's there or not.
I want to prevent flashing it if it doesn't contain the fix. Considering the reported regressions wrt RAM/XMP in AGESA's above 1.2.0.3c, I'd rather not go through the hassle of an update and potential revert if it doesn't knowingly contain a needed fix. Isn't it a bit disturbing AMD doesn't know if/when/where its fixes are distributed? > Isn't it a bit disturbing AMD doesn't know if/when/where its fixes are > distributed? That's a strong statement. Not everyone in AMD talks to everyone at all the OEMs and knows all the information about their business. It's entirely up to OEMs to decide if they want to take pieces or not. Someone in AMD who talks to your motherboard vendor may know. >I want to prevent flashing it if it doesn't contain the fix. Considering the >reported regressions wrt RAM/XMP in AGESA's above 1.2.0.3c, I'd rather not go >through the hassle of an update and potential revert if it doesn't knowingly >contain a needed fix. On AMD's reference firmware that fix was included in that version. So the fix *should* be there if the OEM didn't remove it for some reason. FWIW, there is an Asus Beta BIOS update for my B450M-A II that has the updated SMC firmware. One can check the versions of components with `psptool` (versions in hex): previous bios image with SMC 64.61.0: > $ psptool bios.old 2>/dev/null | grep 0x1574d00 | cut -d '|' -f 9 > 0.40.3D.0 new bios image with SMC 64.65.0: > $ psptool bios.new 2>/dev/null | grep 0x1574d00 | cut -d '|' -f 9 > 0.40.41.0 the release notes mention "1. Update AGESA version to ComboV2PI 1208", so it is fair to assume that at least Asus is shipping the fixed SMC version in their 1.2.0.8 AGESA. Just updated my MSI Mortar Max B450M board to a (beta) BIOS 7B89v2I, has AGESA 1.2.0.8 and SMC 64.65.0. Seems OK so far, two boots and it's using TSC happily. I updated my Asus ProArt Creator x570 to latest BIOS version - 1101, which mentions "Update AGESA version to ComboV2PI 1208" in its release notes. I've been happily running it for about a month and been checking TSC related dmesg messages once in a while. Neither rebooting nor suspend/wake cycle causes TSC to be marked unstable anymore so I'm pretty sure it's fixed now. (In reply to Mario Limonciello (AMD) from comment #25) > > I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this > > sufficient information for you to clarify if the fix is included? > > If you flash it and get me the output of > > # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep SMC > > I can confirm if it's there or not. AGESA V2 1.2.0.8 fix Ryzen Serie 4000? I updated my motherboard (Gigabyte A520M DS3H) with Ryzen 4600g to the latest BIOS version F17b with AGESA V2 1.2.0.8. The computer starts with TSC, but after a reboot, TSC is marked as unstable and HPET is chosen. # cat /sys/kernel/debug/dri/1/amdgpu_firmware_info | grep SMC SMC feature version: 0, program: 0, firmware version: 0x00375b00 (55.91.0) (In reply to Diego Vasconcelos from comment #31) > (In reply to Mario Limonciello (AMD) from comment #25) > > > I just found a BIOS release for B550 with ComboAm4v2PI 1.2.0.8. Is this > > > sufficient information for you to clarify if the fix is included? > > > > If you flash it and get me the output of > > > > # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep SMC > > > > I can confirm if it's there or not. > > AGESA V2 1.2.0.8 fix Ryzen Serie 4000? > Ryzen 4000 is tracked in https://bugzilla.kernel.org/show_bug.cgi?id=216166 |
Created attachment 301209 [details] dmesg output (Following on from 202525 and the suggestion to make new BZ entries for different generations of Ryzen.) Processor: AMD Ryzen 5700G Motherboard: MSI Mortar Max B450 BIOS: 2.D0 05/17/2021 (not the latest available), AGESA 1.2.0.2 Observation: TSC is marked as unstable, machine switches to HPET. Forcing TSC occasionally works fine, but on some boots results in a 7000ppm miscalibration. From 5.18.5: [ 2.088935] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2.088940] clocksource: 'hpet' wd_nsec: 499081656 wd_now: 1b99cb7 wd_last: 14c92e2 mask: ffffffff [ 2.088943] clocksource: 'tsc' cs_nsec: 495725287 cs_now: 1e875ad044 cs_last: 1e1686a41c mask: ffffffffffffffff [ 2.088945] clocksource: 'tsc' is current clocksource. [ 2.088950] tsc: Marking TSC unstable due to clocksource watchdog [ 2.088962] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 2.088963] sched_clock: Marking unstable (2088727973, 233765)<-(2105642684, -16681222) [ 2.089791] clocksource: Checking clocksource tsc synchronization from CPU 4 to CPUs 0,3,10,12. [ 2.089870] clocksource: Switched to clocksource hpet Full dmesg attached. (It's from a distro kernel, but experiments with vanilla sources don't alter the results.)