Bug 216967 - TSC marked unstable on AMD Ryzen 4500U
Summary: TSC marked unstable on AMD Ryzen 4500U
Status: RESOLVED DUPLICATE of bug 216166
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-25 15:01 UTC by Marco
Modified: 2023-07-05 18:53 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.1.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Marco 2023-01-25 15:01:42 UTC
Same as a lot of other AMD users, the issue is present also on my laptop if booted with tsc=nowatchdog. Same CPU3 drift, and TSC marked as unstable on cold boot.

# chronyc tracking
Reference ID    : 877DA587 (ntp74.kashra-server.com)
Stratum         : 3
Ref time (UTC)  : Wed Jan 25 14:56:44 2023
System time     : 0.000261786 seconds fast of NTP time
Last offset     : -0.234086469 seconds
RMS offset      : 0.234086469 seconds
Frequency       : 2096.744 ppm slow
Residual freq   : -1438.992 ppm
Skew            : 2061.489 ppm
Root delay      : 0.027643852 seconds
Root dispersion : 0.172473431 seconds
Update interval : 64.6 seconds
Leap status     : Normal

# sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 53, firmware version: 0x000000a6
PFP feature version: 53, firmware version: 0x000000c2
CE feature version: 53, firmware version: 0x00000050
RLC feature version: 1, firmware version: 0x0000003c
RLC SRLC feature version: 1, firmware version: 0x00000001
RLC SRLG feature version: 1, firmware version: 0x00000001
RLC SRLS feature version: 1, firmware version: 0x00000001
RLCP feature version: 0, firmware version: 0x00000000
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 53, firmware version: 0x000001d4
IMU feature version: 0, firmware version: 0x00000000
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x21000095
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000031
TA DTM feature version: 0x00000000, firmware version: 0x12000013
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x27000005
SMC feature version: 0, program: 0, firmware version: 0x00375000 (55.80.0)
SDMA0 feature version: 41, firmware version: 0x00000028
VCN feature version: 0, firmware version: 0x05113000
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x01010023
TOC feature version: 0, firmware version: 0x00000000
MES_KIQ feature version: 0, firmware version: 0x00000000
MES feature version: 0, firmware version: 0x00000000
VBIOS version: 113-RENOIR-031
Comment 1 Borislav Petkov 2023-01-25 15:07:33 UTC
Lenovo hardware?

If so, you should go ask them to send you a BIOS fix.

Also, can you upload dmesg from that machine?

Thx.
Comment 2 Mario Limonciello (AMD) 2023-01-25 15:09:18 UTC
Yes; this should be a duplicate of #216166 and the same fix for that applies here.  Lenovo will need to provide it.
Comment 3 Marco 2023-01-25 17:15:27 UTC
Nope, ASUS VivoBook TM420IA. Yea, I guessed that it was a firmware bug, but dealing with ASUS Support is the same as trying to carve water out of a rock, so I hoped there was some alternative route besides firmware updates.
Comment 4 Mario Limonciello (AMD) 2023-01-25 17:23:29 UTC
I'm sorry to say but we looked at this in depth and unfortunately in this case there is nothing that can be done from Linux.

I recall the problem only happens with "warm boot", and so you can get into the habit of "cold boot" your machine as a workaround until ASUS can get a fix for it in.
Comment 5 Marco 2023-01-25 17:45:31 UTC
(In reply to Mario Limonciello (AMD) from comment #4)
> I'm sorry to say but we looked at this in depth and unfortunately in this
> case there is nothing that can be done from Linux.
> 
> I recall the problem only happens with "warm boot", and so you can get into
> the habit of "cold boot" your machine as a workaround until ASUS can get a
> fix for it in.

Unfortunately this happens always, regardless of boot state. I'll try pestering ASUS support again, but I'm sure everything I will get out is "It's not Windows, best of luck" to you. 

This is actually the second time that I got shafted from AMD firmware bugs, initially on this year amd_sfh had been changed and caused a regression on my machine, I lost autorotation on my 2in1. 

In some way, the AMD engineer managed to force ASUS to provide me a fixed BIOS, send me the build through email, and never updated the build on your website. 

I wrote to ASUS providing the email the engineer sent me and the BIOS build, since the usb C port wasn't working on the new build, and requested a fixed version with both the sensor fix and the USB-C fixed. All I got was: if it's not on the main site, it's the latest version. 
I do not know how he did it, but as of today I have a newer build compared with the site version, and still with a non working USB-C port.

Unfortunately, I can't recommend your products anymore to people, expecially if using Linux, too much stuff is left to vendors to fix and all they do is forget that they need to actually update the vendor firmware and not releasing "new" bioses with the same AGESA version of two years ago (yes, the 6 bios released in the span of two year never had an updated AGESA version, only my custom build).

Sorry for the vent,

If you can't do anything mode, feel free to close this,

Marco.
Comment 6 Marco 2023-01-25 17:46:04 UTC
/s/this year/last year/
Comment 7 Mario Limonciello (AMD) 2023-01-25 17:54:53 UTC

*** This bug has been marked as a duplicate of bug 216166 ***
Comment 8 Borislav Petkov 2023-01-25 18:24:42 UTC
(In reply to Marco from comment #5)
> Unfortunately, I can't recommend your products anymore to people

The CPU vendor doesn't matter - it is the OEMs who don't care about their laptops supporting Linux properly. I, myself, wouldn't take ASUS hardware even for free, judging by past experience.

What you could do next time you buy, is search the net whether someone else has run Linux on that hw and what her/his feedback about it is. Or, if you can get your hands on the machine you wanna buy, you boot a live CD on it and stare at dmesg and see whether everything gets detected properly and so on.

I have a Zen2 laptop here from Lenovo and a perfectly fine TSC which is simply f***ed by the OEM BIOS. Guess what'll happen the next time I have to buy a Lenovo machine...

I'm sorry but this is the reality, unfortunately.
Comment 9 Marco 2023-01-25 18:33:16 UTC
(In reply to Borislav Petkov from comment #8)
> (In reply to Marco from comment #5)
> > Unfortunately, I can't recommend your products anymore to people
> 
> The CPU vendor doesn't matter - it is the OEMs who don't care about their
> laptops supporting Linux properly. I, myself, wouldn't take ASUS hardware
> even for free, judging by past experience.
> 
> What you could do next time you buy, is search the net whether someone else
> has run Linux on that hw and what her/his feedback about it is. Or, if you
> can get your hands on the machine you wanna buy, you boot a live CD on it
> and stare at dmesg and see whether everything gets detected properly and so
> on.
> 
> I have a Zen2 laptop here from Lenovo and a perfectly fine TSC which is
> simply f***ed by the OEM BIOS. Guess what'll happen the next time I have to
> buy a Lenovo machine...
> 
> I'm sorry but this is the reality, unfortunately.

If only the firmware was open, all of this would never happen. I would update my god damn AGESA version by hand, and the problem would be solved.

But no, gotta love closed, unsupported,and unreliable crap.

Unfortunately Intel had far fewer BIOS bugs on laptop with Linux compared with the plethora that I had with AMD, and yes, a lot of it is just in their shared codebase. Even if a part of the blame is on the vendor side, AMD has still some part to it.

Maybe it's just coincidence that I had more luck with Intel and a too small sample size.

On desktop they are forced to update it, since you can replace the CPU freely, but on laptop its better to make 400 models at month rather than even do the minimum required to make them work.

Marco.
Comment 10 Mario Limonciello (AMD) 2023-01-25 18:39:27 UTC
I'm sorry for your situation.  In general, I would suggest purchasing laptops that the manufacturer has certified to work well with a Linux OS vendor.

Everything else, we (AMD) do the best we can to help the ecosystem, but the reality is that there are some things completely out of our control.
Comment 11 Borislav Petkov 2023-01-25 19:03:56 UTC
(In reply to Marco from comment #9)
> If only the firmware was open, all of this would never happen. I would
> update my god damn AGESA version by hand, and the problem would be solved.
> 
> But no, gotta love closed, unsupported,and unreliable crap.

I'd love to have open firmware everywhere but it is not that easy. And the more
you get involved in this, the more you realize that the rabbit hole doesn't end.
Look at all those presentations from the open firmware folks.

> Maybe it's just coincidence that I had more luck with Intel and a too small
> sample size.

Yes, I think it is a coincidence because the kernel is full of fixes for BIOS
bugs, regardless of vendors. And we do try to push back harder on those so that
they do get fixed before they even reach OEM vendors but it's a win-lose battle.

Hell, it is 2023 and we still don't have a f*ckin' serial port on laptops so
that we can get debug dump and you're talking about supporting Linux. Pff.

If you ask me, OEM vendors should have no opportunity to f*ck up the machine.
Simply do your damn bling bling shiny clicky toy BIOS window and GTFO.

In the case of the TSC, they should not even have the opportunity to touch the
TSC MSR. That thing should be read-only. End of story. But that ship has sailed
long ago.

This is what I mean with the endless rabbit hole...

> On desktop they are forced to update it, since you can replace the CPU
> freely, but on laptop its better to make 400 models at month rather than
> even do the minimum required to make them work.

No, desktop is the same snafu.

The only difference is servers where the end customer says, but but, I want this
fixed. And more often than not it does get fixed for obvious reasons...
Comment 12 Marco 2023-01-25 19:28:15 UTC
(In reply to Mario Limonciello (AMD) from comment #10)
> I'm sorry for your situation.  In general, I would suggest purchasing
> laptops that the manufacturer has certified to work well with a Linux OS
> vendor.
> 
> Everything else, we (AMD) do the best we can to help the ecosystem, but the
> reality is that there are some things completely out of our control.

Yea, I know, I have one AMD desktop and a Steam Deck, if the vendor gives a damn it usually just works. Got biased from the amount of issues on this laptop compared to my previous brand (Clevo), however thanks to the modularity of Clevo barebones the older was far more easy to mod, even at BIOS level (updated quite a lot of Intel Microcodes that way, and no signature on the UEFI capsule itself, good times). You're good, usually, but on this machine it is one problem after another unfortunately :S
Comment 13 Marco 2023-01-25 19:32:05 UTC
(In reply to Borislav Petkov from comment #11)
> (In reply to Marco from comment #9)
> > If only the firmware was open, all of this would never happen. I would
> > update my god damn AGESA version by hand, and the problem would be solved.
> > 
> > But no, gotta love closed, unsupported,and unreliable crap.
> 
> I'd love to have open firmware everywhere but it is not that easy. And the
> more
> you get involved in this, the more you realize that the rabbit hole doesn't
> end.
> Look at all those presentations from the open firmware folks.
> 
> > Maybe it's just coincidence that I had more luck with Intel and a too small
> > sample size.
> 
> Yes, I think it is a coincidence because the kernel is full of fixes for BIOS
> bugs, regardless of vendors. And we do try to push back harder on those so
> that
> they do get fixed before they even reach OEM vendors but it's a win-lose
> battle.
> 
> Hell, it is 2023 and we still don't have a f*ckin' serial port on laptops so
> that we can get debug dump and you're talking about supporting Linux. Pff.
> 
> If you ask me, OEM vendors should have no opportunity to f*ck up the machine.
> Simply do your damn bling bling shiny clicky toy BIOS window and GTFO.
> 
> In the case of the TSC, they should not even have the opportunity to touch
> the
> TSC MSR. That thing should be read-only. End of story. But that ship has
> sailed
> long ago.
> 
> This is what I mean with the endless rabbit hole...
> 
> > On desktop they are forced to update it, since you can replace the CPU
> > freely, but on laptop its better to make 400 models at month rather than
> > even do the minimum required to make them work.
> 
> No, desktop is the same snafu.
> 
> The only difference is servers where the end customer says, but but, I want
> this
> fixed. And more often than not it does get fixed for obvious reasons...

Oh, no, not as much down as some OS firmware stuff, but if you don't want to support the platform just release the god damn build chain and sources of your modification and let me fix my own machine; that's all I was asking. I am not at the level to want to verify the silicon pattern printed on the silicon itself :P
Comment 14 Mario Limonciello (AMD) 2023-01-25 19:34:38 UTC
> And the more you get involved in this, the more you realize that the rabbit
> hole doesn't end

To add to this - even on "open" firmware designs, you still end up with things like FSP that will NEVER open up.

Sure, you can do a lot more with these designs, but more knobs don't immediately equate to you can fix all the problems.  You need proper debug tools, documentation, a good understanding of how different IP blocks interact, and knowledge of the bugs you need to dodge when you make solutions.

If tomorrow someone told me there is a source code drop available for all firmware your laptop, I still don't expect anyone but a miniscule # of people would be able to fix this TSC issue.

>And we do try to push back harder on those so that
they do get fixed before they even reach OEM vendors but it's a win-lose battle.

At the end of the day, the OEM owns the product.  Even if we offer firmware fixes for bugs, they might not roll them out.  It sucks for everyone involved.

When this happens, we do our best to at least offer a workaround in the kernel as well if it makes sense.

For example, this change is going into 6.2-rc for a suspend bug where multiple IRQs are active over s2idle:
https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git/commit/?h=review-hans&id=8e60615e8932167057b363c11a7835da7f007106

You can see it has a guard that will only apply it up until a certain firmware version that we have the fix.  If the OEM includes it, the workaround turns off and you have the right behavior.  If the OEM doesn't, at least Linux will behave "better" than before.

If I had something like that to offer to you for the TSC I absolutely would, but I don't.

> The only difference is servers where the end customer says, but but, I want
> this
fixed. And more often than not it does get fixed for obvious reasons...

The reality is that no matter the ecosystem the platform firmware comes from you need a team to be developing it and testing it.  If you just churn out cookie cutter laptops and don't bother to test them with Linux while you're making them this is exactly where you end up.
Comment 15 Borislav Petkov 2023-01-25 19:43:08 UTC
(In reply to Mario Limonciello (AMD) from comment #14)
> You can see it has a guard that will only apply it up until a certain
> firmware version that we have the fix.  If the OEM includes it, the
> workaround turns off and you have the right behavior.  If the OEM doesn't,
> at least Linux will behave "better" than before.

That one is simple. If it involves touching a lot of code and it becomes real ugly and we have to support it forever, I don't have a problem to shoot it down.  The kernel can't be the
dumping ground for BIOS f*ckup fixes. At some point, they will have to face the
music.

> The reality is that no matter the ecosystem the platform firmware comes from
> you need a team to be developing it and testing it.  If you just churn out
> cookie cutter laptops and don't bother to test them with Linux while you're
> making them this is exactly where you end up.

Preach brother!

:-)
Comment 16 Calvin Walton 2023-07-05 17:22:03 UTC
According to responses in https://forums.lenovo.com/t5/Other-Linux-Discussions/Unusable-TSC-on-P14s-and-X13-with-the-latest-LTS-kernel/m-p/5064905?page=9#6031445

It seems like Lenovo has just started shipping Thinkpad firmware updates with SMC firmware version 55.93.0, e.g. T14 Gen 1 / P14s Gen 1 ver 1.44 on https://support.lenovo.com/us/en/downloads/ds544977-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-1-types-20ud-20ue (which I don't see on LVFS yet? Probably will be added soon).

Is the required firmware update included in the AGESA packages that desktop motherboard manufacturers get? I've got a Renoir APU in an Asus board which currently has ComboAM4v2 PI 1.2.0.8 and SMC firmware version 55.91.0 - any chance you could say what version I should be looking for or asking for to get this fix?
Comment 17 Mario Limonciello (AMD) 2023-07-05 18:53:13 UTC
> any chance you could say what version I should be looking for or asking for
> to get this fix?

For desktop, ComboAM4v2 PI 1.2.0.9 picked up the fix.

Note You need to log in before you can comment on or make changes to this bug.