Bug 216161
Summary: | TSC marked unstable on AMD Ryzen 3500U | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Nelson G (konoha02) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED WILL_NOT_FIX | ||
Severity: | normal | CC: | konoha02, mario.limonciello, samy |
Priority: | P1 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 5.18.5 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg-linux5-18-5
amdgpu_firmware_info amdgpu_firmware_info_ryzen7-3700U dmesg_ryzen_7_3700U_T495 6.11-dmesg-3500u-defaults amdgpu_firmware_info-6.11-rc6 |
Description
Nelson G
2022-06-21 17:36:03 UTC
Can you please correlate this to cold/warm boot? Does it only happen in one scenario? Also, can you please share /sys/kernel/debug/dri/0/amdgpu_firmware_info? Sorry; I see you did confirm it's both cold and warm boot, this is definitely a unique issue to the others then. Created attachment 301266 [details]
amdgpu_firmware_info
Here is may amdgpu_firmware_info. I hope it helps.
If you need anything else, please let me know.
> SMC feature version: 0, program: 0, firmware version: 0x00041e2a (4.30.42)
Thanks. I'll share this with the right folks.
I want to set expectations on the path forward because it's not a short path. AMD may try to reproduce this on reference hardware. If the team can't reproduce it, the only ones that could fix it are Lenovo.
If the team can repro and comes up with a fix it would need to be provided to OEMs like Lenovo. OEMs would need to do their own testing with it and eventually could roll it out.
---
If any other users with Ryzen 3000 encounter this as well, please provide the same details Neil provided:
1) Add a dmesg confirming the failure.
2) Provide amdgpu_firmware_info from your failing system.
I'm running on a Lenovo T495 with an AMD Ryzen 7 3700U, here is my amdgpu_firmware_info, I just upgraded my kernel and reboot and noticed this issue, just now. Created attachment 301798 [details]
amdgpu_firmware_info_ryzen7-3700U
Ryzen 7 3700U Zen+, VBIOS version: 113-PICASSO-114
Created attachment 301799 [details]
dmesg_ryzen_7_3700U_T495
dmesg logs from boot till USB device being connected, I have a USB-C Thinkpad Dock that also gets connected at startup with 1 external screen and few USB devices connected to it.
AMD internal team had a check with reference hardware, and this can't be reproduced on SMU 0x00041e57 (4.30.87). This will need to be fixed by OEM upgrading the BIOS to include newer firmware. Hi @Mario, there has been a recent bios update but nothing mentioned in the changelog other than some Windows specific fixes, is there any way I can check that my firmware has been put up to date ?? Beside being unable to reproduce this bug immediately right now ? BIOS version : R12ET62W(1.32 ) Linux changelog : https://download.lenovo.com/pccbbs/mobiles/r12ul62w.txt Release date : 27 Febuary 2023 Here the version of the SMC, I have at the moment : sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info|grep -i smc SMC feature version: 0, program: 0, firmware version: 0x00041e2f (4.30.47) > is there any way I can check that my firmware has been put up to date ?? It's up to the OEMs to decide whether to pick up anything from AMD AGESA. So "up to date" isn't a term that really makes sense in this context. > Beside being unable to reproduce this bug immediately right now ? You can try both cold and warm boot. If it's similar to Ryzen 5000 issue it was tied specifically to warm boot. > SMC feature version: 0, program: 0, firmware version: 0x00041e2f (4.30.47) This is older than what AMD validated, but that doesn't mean it will have the issue. It seems this issue is fixed for me, I haven't encountered it again. (In reply to Lahfa Samy from comment #11) > It seems this issue is fixed for me, I haven't encountered it again. Just curious, you're still using BIOS 1.32 right? Any idea how did it got fixed for you? It seems that BIOS hasn't fixed this yet. I am using the BIOS 1.32 yes, and recently I've seen the issue happen again, it wasn't fixed you are correct. Hi Mario, today I updated to BIOS 1.27 (Thinkpad E495) and the output to cat /sys/kernel/debug/dri/0/amdgpu_firmware_info|grep -i smc is: SMC feature version: 0, program: 0, firmware version: 0x00041e57 (4.30.87) but it is still running clocksource hpet, I tried cold boot and warm boot. Could you please clarify to me if this is still Lenovo OEM's faulty BIOS? The version tested by AMD on reference hardware was also 4.30.87 and it passed. So if you're seeing the warp still this should be OEM BIOS issue at this point. Hi @Mario, Nothing against you here, just feedback for AMD as a whole, if AMD cannot push/incentivize OEMs to ship the latest validated firmware (mostly AGESA, I believe) because I'm still hitting this bug and even sound playback is affected sometimes after one BIOS update. My SMC version is still the old one. > sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info|grep -i smc > SMC feature version: 0, program: 0, firmware version: 0x00041e2f (4.30.47) I've seen this : > [ 8668.176055] hpet: Lost 3094 RTC interrupts Not sure if it's related to this issue. I'll start recommending people to buy other devices than with AMD CPUs, Intel seems much more capable of at least having firmware fixes pushed by OEMs (in my case Lenovo) or stable/durable firmware for Linux, even if the performance is less impressive than AMD, I can understand paying more for stability and durability of firmware and device as a whole. I'll also see, if someone with more reach can make people more aware of this specific issue, as there is really no solution AMD can provide (I understand, that they've done their best so far) however OEMs are too slow to provide updates especially BIOS updates and we both know how much they care about their Linux customer base, so basically, I just need to throw my "old" device and just a buy new Intel one and settle for that, that's the real fix to this bug. > AMD cannot push/incentivize OEMs to ship the latest validated firmware > (mostly AGESA, I believe) Unfortunately, when AMD looked at it, many of the TSC issues reported in Linux are also present in Windows but a lot harder to detect. As all things in the software world latest doesn't always translate to greatest. There is always risk with taking updates and the most important thing is stability. As many models ship with Windows you need to look at the equation from a business perspective. If OEMs don't get a complaint for a problem from "their" customers, why should they go through the effort to take a new AGESA update from AMD that could introduce risk? > but it is still running clocksource hpet, I tried cold boot and warm boot. > Could you please clarify to me if this is still Lenovo OEM's faulty BIOS? AMD tested reference hardware with the same APU running a modern kernel and same AGESA and didn't have problems with cold or warm boot. It has to be OEM BIOS issue at this point. > I'll start recommending people to buy other devices than with AMD CPUs, Intel > seems much more capable of at least having firmware fixes pushed by OEMs (in > my case Lenovo) or stable/durable firmware for Linux, even if the performance > is less impressive than AMD, I can understand paying more for stability and > durability of firmware and device as a whole. > I'll also see, if someone with more reach can make people more aware of this > specific issue, as there is really no solution AMD can provide (I understand, > that they've done their best so far) however OEMs are too slow to provide > updates especially BIOS updates and we both know how much they care about > their Linux customer base, so basically, I just need to throw my "old" device > and just a buy new Intel one and settle for that, that's the real fix to this > bug. With the case of Lenovo, they have certain systems in a "Linux program". These systems are sold with and regularly tested with Linux and issues found are fixed from BIOS updates. You might want to check with them if your system is part of this program. I see thanks for your insightful input on this issue, maybe if AMD can make those bugs more impactful for Windows users, then OEM would be forced to fix them (an artificial incentive, but this could hurt AMDs business), I understand this is just plain weird to do something as such. It seems the T495 is indeed a Linux certified device: https://support.lenovo.com/my/en/solutions/pd500343-linux-certification-thinkpad-t495-20njz4krus (maybe only specific models are ?) but digging deeper, I believe it is certified to run Ubuntu. I think, I'm just bound to wait when the OEM wants to fix it or not, I'll just avoid Lenovo (and many more) as OEMs for now. One last question, if you can answer that, which OEM implements AMD validated firmware in a timely fashion (unlike Lenovo in this case) ? That is probably dependent as well on laptop's release dates (the newer the laptop, the more chance they upgrade its BIOS) ? > I think, I'm just bound to wait when the OEM wants to fix it or not, I'll > just avoid Lenovo (and many more) as OEMs for now. I suggest you reach out and voice your opinion to your system manufacturer. As I mentioned above unless they're hearing from their customers they may not prioritize their updates. > One last question, if you can answer that I'm sorry; I can't help with that question. Hey, Lahfa Samy, try writing your issue in this forum https://forums.lenovo.com/t5/Linux-Operating-Systems/ct-p/lx_en it's worth trying, try reaching MarkRHPearson at least he 'tries' to get the message to the BIOS team. Hi, Mario, I've already voiced my opinion to the official client support showing them this link and letting them know, you from AMD has confirmed that a BIOS update is needed. They told me it was coming end 2022, and one BIOS update did indeed come Febuary 2023. The firmware inside seems still older than what have said AMD has validated that didn't have issues (comment 8 in this bug report). In the changelog, no Linux bug was mentioned, only a Windows fix. Hey, Nelson G, I've opened a post on the forum and tagged MarkRHPearson, will see how it goes, but I'm still pretty skeptical. If it ever gets fixed for me, I'll mention it here. Hi, concerning the T495, reaching out to MarkRHPearson, I did receive an answer that it is being worked on and is on the roadmap however more recent Thinkpads are currently being updated. As such, the T495 will not receive the BIOS update soon. Here is the post, I've created : https://forums.lenovo.com/t5/Other-Linux-Discussions/Lenovo-Thinkpad-T495-AMD-TSC-marked-unstable-kernel-bug-needs-a-BIOS-update-for-a-fix/m-p/5237197?page=1#6037331 Just wanted to say for the record that I just tested on a few distros (debian, ubuntu, fedora, and arch), the boot paramaters: tsc=reliable clocksource=tsc nohpet and nowadays it enables tsc just fine on my E495, with an important performance improvement (specially latency which drops dramatically compared to hpet) No BIOS update introduced any change about this topic, so something somewhere linux 6.1 and beyond seems to have improved/stabilize this? Except that tsc still has to be "forced" with parameters otherwise is still marked unstable. But! the result, again, is tsc being stable and performing just great, it is important to say that by the time i reported this, it was buggy, very buggy to force tsc, the system would suspend randomly and constantly for example, that among a lot of bugs that made IMPOSSIBLE to even consider enabling it, now it's been days of using tsc, so, thanks to whoever made this possible (perhaps by accident?). Can you please share an updated kernel log with 6.11-rc6 with no parameters related to TSC applied? Created attachment 306823 [details]
6.11-dmesg-3500u-defaults
here is dmesg, is that enough?
Created attachment 306824 [details]
amdgpu_firmware_info-6.11-rc6
just in case
Thanks, so you're still seeing a TSC warp issue between CPUs. [ 0.286359] TSC synchronization [CPU#0 -> CPU#2]: [ 0.286363] Measured 9282405873 cycles TSC warp between CPUs, turning off TSC clock. [ 0.287708] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 0.287851] #1 #3 #5 #7 |