Bug 217719 - fTPM usage can cause stutters on some AMD platforms
Summary: fTPM usage can cause stutters on some AMD platforms
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: Mario Limonciello (AMD)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-27 17:03 UTC by daniil.stas
Modified: 2023-08-08 00:12 UTC (History)
6 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg outputs with and without CONFIG_HW_RANDOM_TPM option enabled (230.00 KB, application/gzip)
2023-07-27 17:03 UTC, daniil.stas
Details
potential patch (v1) (2.16 KB, patch)
2023-07-27 20:19 UTC, Mario Limonciello (AMD)
Details | Diff
A script for logging stuttering events (463 bytes, text/x-python)
2023-08-02 07:59 UTC, daniil.stas
Details
stuttering_test.py script result sample (16.13 KB, text/plain)
2023-08-02 08:00 UTC, daniil.stas
Details

Description daniil.stas 2023-07-27 17:03:40 UTC
Created attachment 304711 [details]
dmesg outputs with and without CONFIG_HW_RANDOM_TPM option enabled

I can reproduce the AMD fTPM stuttering bug with 6.4.3 kernel on Asus GA402RJ laptop.
One workaround that worked for me is to compile the kernel without CONFIG_HW_RANDOM_TPM option.

My fTPM firmware version is 0x3005700020005. I found this by inserting this debug print line in the kernel:

> diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
>index cd48033b804a..c80bdd92360d 100644
>--- a/drivers/char/tpm/tpm-chip.c
>+++ b/drivers/char/tpm/tpm-chip.c
>@@ -550,6 +550,9 @@ static bool tpm_amd_is_rng_defective(struct tpm_chip
>*chip)
>                return false;
> 
>        version = ((u64)val1 << 32) | val2;
>+       dev_warn(&chip->dev,
>+                "AMD fTPM version = 0x%llx\n",
>+                version);
>        if ((version >> 48) == 6) {
>                if (version >= 0x0006000000180006ULL)
>                        return false;

I am also attaching the my dmesg output with kernel compiled with CONFIG_HW_RANDOM_TPM option and without.
Comment 1 Mario Limonciello (AMD) 2023-07-27 19:53:24 UTC
> My fTPM firmware version is 0x3005700020005. I found this by inserting this
> debug print line in the kernel:

Thanks.

> ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402RJ_GA402RJ/GA402RJ, BIOS
> GA402RJ.318 03/09/2023

AFAICT this machine does have the fixed firmware and it's also quite modern.  You're the first person that has reported this on Rembrandt.
This may be a secondary issue.

What kind of workload reproduces the stutter?  Anything specific?
Comment 2 daniil.stas 2023-07-27 20:01:46 UTC
(In reply to Mario Limonciello (AMD) from comment #1)
> > My fTPM firmware version is 0x3005700020005. I found this by inserting this
> > debug print line in the kernel:
> 
> Thanks.
> 
> > ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402RJ_GA402RJ/GA402RJ, BIOS
> > GA402RJ.318 03/09/2023
> 
> AFAICT this machine does have the fixed firmware and it's also quite modern.
> You're the first person that has reported this on Rembrandt.
> This may be a secondary issue.
> 
> What kind of workload reproduces the stutter?  Anything specific?

No, it just happens randomly.
There are several people with the same laptop that have the same issue on the asusctl project discord. Some of them contacted Asus support and received privately a special BIOS that adds an option to disable fTPM in the settings.
Comment 3 Mario Limonciello (AMD) 2023-07-27 20:10:11 UTC
And presumably this special BIOS helps?  Does that special BIOS switch it from fTPM to Pluton?  or completely disables TPM functionality?
Comment 4 daniil.stas 2023-07-27 20:12:08 UTC
(In reply to Mario Limonciello (AMD) from comment #3)
> And presumably this special BIOS helps?  Does that special BIOS switch it
> from fTPM to Pluton?  or completely disables TPM functionality?

Yes, it fixes the stutters. I think it just disables TPM completely (which introduces some issues for Windows, so it's not optimal)
Comment 5 Mario Limonciello (AMD) 2023-07-27 20:13:24 UTC
Jason - what are your thoughts here?

I'm personally leaning on we just block all AMD fTPM from the feature.  Two separate issues showing up like this it's not worth a hodgepodge of quirks and SMBIOS strings.
Comment 6 Mario Limonciello (AMD) 2023-07-27 20:19:27 UTC
Created attachment 304718 [details]
potential patch (v1)

Here's a patch that should effectively disable it as well if that is the direction we go for this issue.
Comment 7 Jason A. Donenfeld 2023-07-27 20:53:50 UTC
(In reply to Mario Limonciello (AMD) from comment #5)
> Jason - what are your thoughts here?
> 
> I'm personally leaning on we just block all AMD fTPM from the feature.  Two
> separate issues showing up like this it's not worth a hodgepodge of quirks
> and SMBIOS strings.

That seems a bit extreme, but we could be a bit more conservative and do the version gate on a more major version digit. And then inside AMD you can spread the word that they need to bump that version digit if they want the fTPM to work again. IOW, if we're gonna disqualify a lot of them, let's make an obvious path back in that the engineers can follow.
Comment 8 Mario Limonciello (AMD) 2023-07-27 21:02:41 UTC
That's a good point. Let me start a discussion internally on it before concluding though.

Something else erking me is whether this is possibly caused by using RDRAND at the same time as the fTPM RNG functionality.

Is it possible whatever app that was triggering this could have been using that instruction (possibly heavily?)
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-28 12:09:59 UTC
Stupid question from a bystander (so maybe it's best to ingnore this), as it's not clear from me from the mails and comments I saw (but maybe I missed something): is this work in earlier kernels at all? and if so: did it really start to happen with b006c439d58d or is this maybe something different?
Comment 10 daniil.stas 2023-07-28 12:15:08 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #9)
> Stupid question from a bystander (so maybe it's best to ingnore this), as
> it's not clear from me from the mails and comments I saw (but maybe I missed
> something): is this work in earlier kernels at all? and if so: did it really
> start to happen with b006c439d58d or is this maybe something different?

No, this bugs is present for a long time. I have it since I started using this machine. It was supposed to be fixed at some point recently, that's why I specified the kernel version.
Comment 11 daniil.stas 2023-07-28 12:18:31 UTC
(In reply to daniil.stas from comment #10)
> (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from
> comment #9)
> > Stupid question from a bystander (so maybe it's best to ingnore this), as
> > it's not clear from me from the mails and comments I saw (but maybe I
> missed
> > something): is this work in earlier kernels at all? and if so: did it
> really
> > start to happen with b006c439d58d or is this maybe something different?
> 
> No, this bugs is present for a long time. I have it since I started using
> this machine. It was supposed to be fixed at some point recently, that's why
> I specified the kernel version.

This is the patch that fixes a bug like this for some platforms, but turns out it doesn't work for my case:
https://lore.kernel.org/lkml/20230214201955.7461-1-mario.limonciello@amd.com/
Comment 12 daniil.stas 2023-07-28 12:23:17 UTC
(In reply to daniil.stas from comment #11)
> (In reply to daniil.stas from comment #10)
> > (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from
> > comment #9)
> > > Stupid question from a bystander (so maybe it's best to ingnore this), as
> > > it's not clear from me from the mails and comments I saw (but maybe I
> > missed
> > > something): is this work in earlier kernels at all? and if so: did it
> > really
> > > start to happen with b006c439d58d or is this maybe something different?
> > 
> > No, this bugs is present for a long time. I have it since I started using
> > this machine. It was supposed to be fixed at some point recently, that's
> why
> > I specified the kernel version.
> 
> This is the patch that fixes a bug like this for some platforms, but turns
> out it doesn't work for my case:
> https://lore.kernel.org/lkml/20230214201955.7461-1-mario.limonciello@amd.com/

Don't know if it started with b006c439d58d originally
Comment 13 Mario Limonciello (AMD) 2023-08-01 15:55:09 UTC
@daniil:

Can you share specific games you've seen this happen on?  There are a lot of theories swirling around about this issue so reproducing it the exact same way you have would be ideal to understand it.
Comment 14 daniil.stas 2023-08-01 17:11:24 UTC
(In reply to Mario Limonciello (AMD) from comment #13)
> @daniil:
> 
> Can you share specific games you've seen this happen on?  There are a lot of
> theories swirling around about this issue so reproducing it the exact same
> way you have would be ideal to understand it.

I don't think it's connected to any specific games, I think it just happens randomly (a few times per day).
I've seen this issue happen playing some heavy games like Street Fighter 6 or also light ones like Hedgewars.
Also it could happen while just watching youtube.

This issue is the most noticeable when you look at some dynamic picture, that's probably why people more often say, they see it during playing games or watching videos.

It probably happens during other activities too, but I can't remember now specifically at the moment.
I currently use a kernel without CONFIG_HW_RANDOM_TPM option for some time.

I also use a disk encryption, maybe it uses RNG for some cryptographic purposes. The network activity can be using RNG too. Both disk and network is actively used in games and online videos, so this also could be the things that makes this issue happen more often.
Comment 15 Mario Limonciello (AMD) 2023-08-01 17:18:49 UTC
> I've seen this issue happen playing some heavy games like Street Fighter 6 or
> also light ones like Hedgewars.

OK so it's not very likely tied to any intense anti-cheat usage in the games.

> This issue is the most noticeable when you look at some dynamic picture,
> that's probably why people more often say, they see it during playing games
> or watching videos.

In your experience you only notice graphical stutter or also audio stutter?

> I also use a disk encryption, maybe it uses RNG for some cryptographic
> purposes. 

That's potentially relevant here.  Can you explain the disk encryption solution you use?  Or anything else you store to the fTPM non-volatile storage?

Is it possible that you (or any software) was accessing another TPM function *besides* RNG when the issue occurred?

> The network activity can be using RNG too. Both disk and network is actively
> used in games and online videos, so this also could be the things that makes
> this issue happen more often.

Right; getrandom() is used by a TON of software.  So anything that drains the kernel's entropy pool could cause it.

I wonder if maybe you can "easily" trigger it by more definitive reproducer like a python script reading /dev/random constantly while you're playing a youtube video.  If so; that can be something that will help AMD in reproducing it too.
Comment 16 Jason A. Donenfeld 2023-08-01 17:29:10 UTC
> That's potentially relevant here.

In terms of kernel rng, likely isn't.

> getrandom() is used by a TON of software.  So anything that drains the
> kernel's entropy pool could cause it.

That's not how the rng works.

> a python script reading /dev/random 

Also unrelated.

The way to repro is more likely reading from /dev/hwrng.
Comment 17 daniil.stas 2023-08-01 17:50:00 UTC
> > This issue is the most noticeable when you look at some dynamic picture,
> > that's probably why people more often say, they see it during playing games
> > or watching videos.
> 
> In your experience you only notice graphical stutter or also audio stutter?
> 


There are audio stutters too.
There is also a separate issue that causes video only freezes if I enable hardware video acceleration for VP9 encoding, but that's most likely unrelated to this one.

> 
> > I also use a disk encryption, maybe it uses RNG for some cryptographic
> > purposes. 
> 
> That's potentially relevant here.  Can you explain the disk encryption
> solution you use?  Or anything else you store to the fTPM non-volatile
> storage?
> 
> Is it possible that you (or any software) was accessing another TPM function
> *besides* RNG when the issue occurred?
> 


I use dm-crypt for encryption, I used this guide to set it up: https://wiki.archlinux.org/title/Dm-crypt/Encrypting_an_entire_system#LVM_on_LUKS

I don't think it uses TPM directly.
I also tried to disable TPM support completely before (option CONFIG_TCG_TPM), and nothing broke in my system, so I think it's not used anywhere in the applications.
Comment 18 Mario Limonciello (AMD) 2023-08-01 17:54:54 UTC
> In terms of kernel rng, likely isn't.

The reason I'm asking for it is because of how the fTPM works; if there are other requests that could be going on at the same time if those could have lead to this behavior.

> The way to repro is more likely reading from /dev/hwrng.

OK.

> I don't think it uses TPM directly.

Yeah this won't matter for anything but for boot time.  This isn't the cause.
Comment 19 daniil.stas 2023-08-02 07:59:30 UTC
Created attachment 304752 [details]
A script for logging stuttering events
Comment 20 daniil.stas 2023-08-02 08:00:50 UTC
Created attachment 304753 [details]
stuttering_test.py script result sample
Comment 21 daniil.stas 2023-08-02 08:02:03 UTC
> I wonder if maybe you can "easily" trigger it by more definitive reproducer
> like a python script reading /dev/random constantly while you're playing a
> youtube video.  If so; that can be something that will help AMD in
> reproducing it too.

I tried to reproduce the issue by using this shell script and watching a youtube video:

> while true; do sudo head -c16 /dev/hwrng; sleep 1; done

I also made a python script that tries to detect and log the stuttering events.
The script and the resulting log is in the attachments (stutter_test.py, stutter_test.log).

Looks like the script detects them well. And judging by the log, looks like the events happen in around 10 min intervals generally.

I tried to changing the shell script to:

> while true; do sudo head -c32 /dev/hwrng; sleep 1; done

(reading 2 times more). But it didn't affect the interval noticeably.

The stuttering events were detected by the python script even at times when the youtube video was paused.
Comment 22 Prajna Sariputra 2023-08-02 10:25:18 UTC
I am also experiencing this issue, I have been seeing random stutters once or twice a day almost every day on my HP Omen 16-n0000 laptop (Ryzen 7 6800H + Radeon RX 6650M) ever since I bought it (around the end of March, with kernel 6.1 I think), although I never realised that it had something to do with the fTPM until now.

Using the `while true; do sudo head -c16 /dev/hwrng; sleep 1; done` command I am able to reproduce the stutter, for me it also seems to happen around every 10 minutes or so normally, although I have also seen a couple of anomalies where it takes 5 or 15 minutes for the stutter to appear.

So far I have not seen any specific workload (other than the hwrng reading script) that triggers the stutter more than any other, I have seen it happen while doing anything from browsing the web, watching Twitch streams and/or YouTube videos, reading/writing documents, to playing Forza Horizon 5.

Currently my system is running kernel 6.4.7 on Arch Linux with the KDE Plasma desktop, and I have been using a kernel with the patch attached in comment 6 applied for about two days now and I have yet to see the stutter (although I did switch back to an unpatched kernel to test the hwrng read script temporarily).
Comment 23 Mario Limonciello (AMD) 2023-08-02 12:07:49 UTC
Thanks guys!
Comment 24 Marco 2023-08-03 09:40:17 UTC
FWIW on my desktop Ryzen 2600 using latest AGESA (ComboV2PI 1.2.0.A), I can't hear any crackling in audio reproduction, no matter the load. So yea, probably there is still some dragons around left from the firmware team, but not in my specific case.
Comment 25 Marco 2023-08-03 09:48:33 UTC
For reference, the mobo I'm running is https://rog.asus.com/it/motherboards/rog-strix/rog-strix-b450-f-gaming-model/helpdesk_bios/ and the BIOS is the latest one released.

Note You need to log in before you can comment on or make changes to this bug.