Created attachment 304711 [details] dmesg outputs with and without CONFIG_HW_RANDOM_TPM option enabled I can reproduce the AMD fTPM stuttering bug with 6.4.3 kernel on Asus GA402RJ laptop. One workaround that worked for me is to compile the kernel without CONFIG_HW_RANDOM_TPM option. My fTPM firmware version is 0x3005700020005. I found this by inserting this debug print line in the kernel: > diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c >index cd48033b804a..c80bdd92360d 100644 >--- a/drivers/char/tpm/tpm-chip.c >+++ b/drivers/char/tpm/tpm-chip.c >@@ -550,6 +550,9 @@ static bool tpm_amd_is_rng_defective(struct tpm_chip >*chip) > return false; > > version = ((u64)val1 << 32) | val2; >+ dev_warn(&chip->dev, >+ "AMD fTPM version = 0x%llx\n", >+ version); > if ((version >> 48) == 6) { > if (version >= 0x0006000000180006ULL) > return false; I am also attaching the my dmesg output with kernel compiled with CONFIG_HW_RANDOM_TPM option and without.
> My fTPM firmware version is 0x3005700020005. I found this by inserting this > debug print line in the kernel: Thanks. > ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402RJ_GA402RJ/GA402RJ, BIOS > GA402RJ.318 03/09/2023 AFAICT this machine does have the fixed firmware and it's also quite modern. You're the first person that has reported this on Rembrandt. This may be a secondary issue. What kind of workload reproduces the stutter? Anything specific?
(In reply to Mario Limonciello (AMD) from comment #1) > > My fTPM firmware version is 0x3005700020005. I found this by inserting this > > debug print line in the kernel: > > Thanks. > > > ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402RJ_GA402RJ/GA402RJ, BIOS > > GA402RJ.318 03/09/2023 > > AFAICT this machine does have the fixed firmware and it's also quite modern. > You're the first person that has reported this on Rembrandt. > This may be a secondary issue. > > What kind of workload reproduces the stutter? Anything specific? No, it just happens randomly. There are several people with the same laptop that have the same issue on the asusctl project discord. Some of them contacted Asus support and received privately a special BIOS that adds an option to disable fTPM in the settings.
And presumably this special BIOS helps? Does that special BIOS switch it from fTPM to Pluton? or completely disables TPM functionality?
(In reply to Mario Limonciello (AMD) from comment #3) > And presumably this special BIOS helps? Does that special BIOS switch it > from fTPM to Pluton? or completely disables TPM functionality? Yes, it fixes the stutters. I think it just disables TPM completely (which introduces some issues for Windows, so it's not optimal)
Jason - what are your thoughts here? I'm personally leaning on we just block all AMD fTPM from the feature. Two separate issues showing up like this it's not worth a hodgepodge of quirks and SMBIOS strings.
Created attachment 304718 [details] potential patch (v1) Here's a patch that should effectively disable it as well if that is the direction we go for this issue.
(In reply to Mario Limonciello (AMD) from comment #5) > Jason - what are your thoughts here? > > I'm personally leaning on we just block all AMD fTPM from the feature. Two > separate issues showing up like this it's not worth a hodgepodge of quirks > and SMBIOS strings. That seems a bit extreme, but we could be a bit more conservative and do the version gate on a more major version digit. And then inside AMD you can spread the word that they need to bump that version digit if they want the fTPM to work again. IOW, if we're gonna disqualify a lot of them, let's make an obvious path back in that the engineers can follow.
That's a good point. Let me start a discussion internally on it before concluding though. Something else erking me is whether this is possibly caused by using RDRAND at the same time as the fTPM RNG functionality. Is it possible whatever app that was triggering this could have been using that instruction (possibly heavily?)
Stupid question from a bystander (so maybe it's best to ingnore this), as it's not clear from me from the mails and comments I saw (but maybe I missed something): is this work in earlier kernels at all? and if so: did it really start to happen with b006c439d58d or is this maybe something different?
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #9) > Stupid question from a bystander (so maybe it's best to ingnore this), as > it's not clear from me from the mails and comments I saw (but maybe I missed > something): is this work in earlier kernels at all? and if so: did it really > start to happen with b006c439d58d or is this maybe something different? No, this bugs is present for a long time. I have it since I started using this machine. It was supposed to be fixed at some point recently, that's why I specified the kernel version.
(In reply to daniil.stas from comment #10) > (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from > comment #9) > > Stupid question from a bystander (so maybe it's best to ingnore this), as > > it's not clear from me from the mails and comments I saw (but maybe I > missed > > something): is this work in earlier kernels at all? and if so: did it > really > > start to happen with b006c439d58d or is this maybe something different? > > No, this bugs is present for a long time. I have it since I started using > this machine. It was supposed to be fixed at some point recently, that's why > I specified the kernel version. This is the patch that fixes a bug like this for some platforms, but turns out it doesn't work for my case: https://lore.kernel.org/lkml/20230214201955.7461-1-mario.limonciello@amd.com/
(In reply to daniil.stas from comment #11) > (In reply to daniil.stas from comment #10) > > (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from > > comment #9) > > > Stupid question from a bystander (so maybe it's best to ingnore this), as > > > it's not clear from me from the mails and comments I saw (but maybe I > > missed > > > something): is this work in earlier kernels at all? and if so: did it > > really > > > start to happen with b006c439d58d or is this maybe something different? > > > > No, this bugs is present for a long time. I have it since I started using > > this machine. It was supposed to be fixed at some point recently, that's > why > > I specified the kernel version. > > This is the patch that fixes a bug like this for some platforms, but turns > out it doesn't work for my case: > https://lore.kernel.org/lkml/20230214201955.7461-1-mario.limonciello@amd.com/ Don't know if it started with b006c439d58d originally
@daniil: Can you share specific games you've seen this happen on? There are a lot of theories swirling around about this issue so reproducing it the exact same way you have would be ideal to understand it.
(In reply to Mario Limonciello (AMD) from comment #13) > @daniil: > > Can you share specific games you've seen this happen on? There are a lot of > theories swirling around about this issue so reproducing it the exact same > way you have would be ideal to understand it. I don't think it's connected to any specific games, I think it just happens randomly (a few times per day). I've seen this issue happen playing some heavy games like Street Fighter 6 or also light ones like Hedgewars. Also it could happen while just watching youtube. This issue is the most noticeable when you look at some dynamic picture, that's probably why people more often say, they see it during playing games or watching videos. It probably happens during other activities too, but I can't remember now specifically at the moment. I currently use a kernel without CONFIG_HW_RANDOM_TPM option for some time. I also use a disk encryption, maybe it uses RNG for some cryptographic purposes. The network activity can be using RNG too. Both disk and network is actively used in games and online videos, so this also could be the things that makes this issue happen more often.
> I've seen this issue happen playing some heavy games like Street Fighter 6 or > also light ones like Hedgewars. OK so it's not very likely tied to any intense anti-cheat usage in the games. > This issue is the most noticeable when you look at some dynamic picture, > that's probably why people more often say, they see it during playing games > or watching videos. In your experience you only notice graphical stutter or also audio stutter? > I also use a disk encryption, maybe it uses RNG for some cryptographic > purposes. That's potentially relevant here. Can you explain the disk encryption solution you use? Or anything else you store to the fTPM non-volatile storage? Is it possible that you (or any software) was accessing another TPM function *besides* RNG when the issue occurred? > The network activity can be using RNG too. Both disk and network is actively > used in games and online videos, so this also could be the things that makes > this issue happen more often. Right; getrandom() is used by a TON of software. So anything that drains the kernel's entropy pool could cause it. I wonder if maybe you can "easily" trigger it by more definitive reproducer like a python script reading /dev/random constantly while you're playing a youtube video. If so; that can be something that will help AMD in reproducing it too.
> That's potentially relevant here. In terms of kernel rng, likely isn't. > getrandom() is used by a TON of software. So anything that drains the > kernel's entropy pool could cause it. That's not how the rng works. > a python script reading /dev/random Also unrelated. The way to repro is more likely reading from /dev/hwrng.
> > This issue is the most noticeable when you look at some dynamic picture, > > that's probably why people more often say, they see it during playing games > > or watching videos. > > In your experience you only notice graphical stutter or also audio stutter? > There are audio stutters too. There is also a separate issue that causes video only freezes if I enable hardware video acceleration for VP9 encoding, but that's most likely unrelated to this one. > > > I also use a disk encryption, maybe it uses RNG for some cryptographic > > purposes. > > That's potentially relevant here. Can you explain the disk encryption > solution you use? Or anything else you store to the fTPM non-volatile > storage? > > Is it possible that you (or any software) was accessing another TPM function > *besides* RNG when the issue occurred? > I use dm-crypt for encryption, I used this guide to set it up: https://wiki.archlinux.org/title/Dm-crypt/Encrypting_an_entire_system#LVM_on_LUKS I don't think it uses TPM directly. I also tried to disable TPM support completely before (option CONFIG_TCG_TPM), and nothing broke in my system, so I think it's not used anywhere in the applications.
> In terms of kernel rng, likely isn't. The reason I'm asking for it is because of how the fTPM works; if there are other requests that could be going on at the same time if those could have lead to this behavior. > The way to repro is more likely reading from /dev/hwrng. OK. > I don't think it uses TPM directly. Yeah this won't matter for anything but for boot time. This isn't the cause.
Created attachment 304752 [details] A script for logging stuttering events
Created attachment 304753 [details] stuttering_test.py script result sample
> I wonder if maybe you can "easily" trigger it by more definitive reproducer > like a python script reading /dev/random constantly while you're playing a > youtube video. If so; that can be something that will help AMD in > reproducing it too. I tried to reproduce the issue by using this shell script and watching a youtube video: > while true; do sudo head -c16 /dev/hwrng; sleep 1; done I also made a python script that tries to detect and log the stuttering events. The script and the resulting log is in the attachments (stutter_test.py, stutter_test.log). Looks like the script detects them well. And judging by the log, looks like the events happen in around 10 min intervals generally. I tried to changing the shell script to: > while true; do sudo head -c32 /dev/hwrng; sleep 1; done (reading 2 times more). But it didn't affect the interval noticeably. The stuttering events were detected by the python script even at times when the youtube video was paused.
I am also experiencing this issue, I have been seeing random stutters once or twice a day almost every day on my HP Omen 16-n0000 laptop (Ryzen 7 6800H + Radeon RX 6650M) ever since I bought it (around the end of March, with kernel 6.1 I think), although I never realised that it had something to do with the fTPM until now. Using the `while true; do sudo head -c16 /dev/hwrng; sleep 1; done` command I am able to reproduce the stutter, for me it also seems to happen around every 10 minutes or so normally, although I have also seen a couple of anomalies where it takes 5 or 15 minutes for the stutter to appear. So far I have not seen any specific workload (other than the hwrng reading script) that triggers the stutter more than any other, I have seen it happen while doing anything from browsing the web, watching Twitch streams and/or YouTube videos, reading/writing documents, to playing Forza Horizon 5. Currently my system is running kernel 6.4.7 on Arch Linux with the KDE Plasma desktop, and I have been using a kernel with the patch attached in comment 6 applied for about two days now and I have yet to see the stutter (although I did switch back to an unpatched kernel to test the hwrng read script temporarily).
Thanks guys!
FWIW on my desktop Ryzen 2600 using latest AGESA (ComboV2PI 1.2.0.A), I can't hear any crackling in audio reproduction, no matter the load. So yea, probably there is still some dragons around left from the firmware team, but not in my specific case.
For reference, the mobo I'm running is https://rog.asus.com/it/motherboards/rog-strix/rog-strix-b450-f-gaming-model/helpdesk_bios/ and the BIOS is the latest one released.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=554b841d470338a3b1d6335b14ee1cd0c8f5d754