Bug 216989 - Kernels after 6.1 experience AMD Ryzen fTPM stutter
Summary: Kernels after 6.1 experience AMD Ryzen fTPM stutter
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: AMD Linux
: P1 normal
Assignee: Mario Limonciello (AMD)
URL:
Keywords:
: 217122 (view as bug list)
Depends on:
Blocks:
 
Reported: 2023-02-02 02:49 UTC by reach622
Modified: 2023-05-12 17:00 UTC (History)
16 users (show)

See Also:
Kernel Version: 6.1.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
tpm2_getcap on ASUS_Zephyrys_G15_6800HS (1.69 KB, text/plain)
2023-02-09 05:51 UTC, Bell
Details
tpm2_getcap (ASUS Strix G15 AMD Advantage Edition) (1.66 KB, text/plain)
2023-02-09 14:40 UTC, Bradley Fourie
Details
potential patch (7.06 KB, patch)
2023-02-10 03:39 UTC, Mario Limonciello (AMD)
Details | Diff
tpm2_getcap properties-fixed (1.66 KB, text/plain)
2023-02-11 14:40 UTC, Balazs Vinarz
Details
tpm2_getcap (ACER nitro 5 AN515-45 bios v1.10) (1.66 KB, text/plain)
2023-02-13 20:37 UTC, Tom Englund
Details
potential patch (v2) (7.14 KB, application/mbox)
2023-02-13 21:02 UTC, Mario Limonciello (AMD)
Details
potential patch (v3) (7.37 KB, patch)
2023-02-14 13:23 UTC, Jason A. Donenfeld
Details | Diff
potential patch (v3) (7.37 KB, patch)
2023-02-14 13:25 UTC, Jason A. Donenfeld
Details | Diff
tpm2_getcap properties-fixed (LENOVO IdeaPad 5 Pro 14ACN6) (1.75 KB, text/plain)
2023-02-28 18:45 UTC, Branko Grubić
Details

Description reach622 2023-02-02 02:49:48 UTC
Linux kernel >=6.1 exhibits a stuttering issue that occurs once every few hours. See https://www.reddit.com/r/archlinux/comments/zvgev0/audio_stuttering_issues_with_kernel_611/ https://www.reddit.com/r/linux_gaming/comments/zzqaf7/having_intermittent_stutters_with_a_ryzen_cpu/ https://bbs.archlinux.org/viewtopic.php?id=282333 for detailed information.

The stutter lasts for 1-2 seconds and causes the framerate of the display to decrease dramatically and causes bursts in audio output.

Additional info:

* linux 6.1.0 or later

Steps to reproduce:

* Use Linux kernel >=6.1

* Use AMD Ryzen CPU with fTPM enabled

* Wait for a few hours
Comment 1 Bell 2023-02-02 03:33:24 UTC
Hey, Let me add some extra information to help.
1. this issue can happen in 6.2-rc6 without loading third-party kernel modules. (NVIDIA or Virtualbox and so)
2. some guy on the Desktop/Laptop who can disable ftpm and did eliminate the problem.
3. this problem can happen in newer AMD processors from the 4000 series to the 6000 series.
4. this problem isn't caused by the dedicated graphics card I guess, here are some combinations that stuttering can happen:
AMD(built-in GPU) + NVIDIA  Laptop
AMD(No built-in GPU) + AMD(dedicated) Desktop
AMD(built-in GPU) + AMD(dedicated) Laptop/Desktop
AMD + AMD(Built-in GPU only) Laptop
all suffer from this.

Hope this can help :)
Comment 2 Mario Limonciello (AMD) 2023-02-02 19:14:44 UTC
This stutter is introduced for you in kernel 6.1?  Can you please bisect it?
Comment 3 Artem S. Tashkinov 2023-02-02 20:02:51 UTC
Kernel hackers rarely game so your best bet is bisecting.
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-03 07:10:35 UTC
(In reply to Artem S. Tashkinov from comment #3)
> Kernel hackers rarely game so your best bet is bisecting.

FWIW, it's *totally fine* to report a regression without bisecting it, as sometimes the maintainers or someone else will know about the problem and the culprit already (or might have an idea what might cause it). But if that's not the case the duty to find the culprit in the end is the reporters, as explained in https://docs.kernel.org/admin-guide/reporting-regressions.html
Comment 5 Borislav Petkov 2023-02-03 16:57:36 UTC
I guess no need to bisect:

https://www.amd.com/en/support/kb/faq/pa-410


but look for a BIOS update...
Comment 6 reach622 2023-02-03 23:39:09 UTC
Many ASUS laptops (G15) do not have a latest BIOS update that fixes the issue.
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-04 08:21:56 UTC
(In reply to Borislav Petkov from comment #5)
> I guess no need to bisect:
> https://www.amd.com/en/support/kb/faq/pa-410

Well, as mentioned in 
https://lore.kernel.org/all/3a196414-68d8-29c9-24cc-2b8cb4c9d358@leemhuis.info/
this seems to be something that didn't happen in 6.0, so this afaics still makes it a regression.
Comment 8 Bell 2023-02-04 11:21:12 UTC
set CONFIG_HW_RANDOM_TPM=n in the build config does avoid this lag.
So I guess most of the manufacturers didn't follow AMD instructions.
Well......Now we need to check what CONFIG_HW_RANDOM_TPM does and how 6.0.x->6.1.x affect this.
Comment 9 Bell 2023-02-05 07:16:20 UTC
(In reply to Bell from comment #8)
> set CONFIG_HW_RANDOM_TPM=n in the build config does avoid this lag.
> So I guess most of the manufacturers didn't follow AMD instructions.
> Well......Now we need to check what CONFIG_HW_RANDOM_TPM does and how
> 6.0.x->6.1.x affect this.

ok, I found a way to trigger this bug in kernel 6.0.x.
run "sudo cat /dev/hwrng > /dev/null" for around 5-15 minutes, and here you go.
so there must be something that keeps calling the hardware random numbers generator in 6.1.x
as time goes on, at a certain point, minor error stack together and lead to another error (something overflow I guess?)
plus, if you use a simple rust monitoring program (https://bugs.archlinux.org/task/77340), you will notice timeout errors in 6.1.x even when you do nothing. However, this isn't the case in 6.0.x as long as you are not calling hwrng.
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-05 08:03:32 UTC
(In reply to Bell from comment #9)

> ok, I found a way to trigger this bug in kernel 6.0.x.

In that case, could you try to bisect this? Maybe starting by trying before and after the two big random merges during the merge window of 6.1 might be a good idea, but OTOH it also could be a bad idea.
Comment 11 Bell 2023-02-05 10:21:31 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10)
> (In reply to Bell from comment #9)
> 
> > ok, I found a way to trigger this bug in kernel 6.0.x.
> 
> In that case, could you try to bisect this? Maybe starting by trying before
> and after the two big random merges during the merge window of 6.1 might be
> a good idea, but OTOH it also could be a bad idea.

oh, I start git bisect 2 days ago, from v6.0.12 to v6.1
And I just get a very promising result.
------------------------------------------------------------------------------
b006c439d58db625318bf2207feabf847510a8a6 is the first bad commit
commit b006c439d58db625318bf2207feabf847510a8a6
Author: Dominik Brodowski <linux@dominikbrodowski.net>
Date:   Thu Sep 22 15:59:31 2022 +0200

    hwrng: core - start hwrng kthread also for untrusted sources
    
    Start the hwrng kthread even if the hwrng source has a quality setting
    of zero. Then, every crng reseed interval, one batch of data from this
    zero-quality hwrng source will be mixed into the CRNG pool.
    
    This patch is based on the assumption that data from a hwrng source
    will not actively harm the CRNG state. Instead, many hwrng sources
    (such as TPM devices), even though they are assigend a quality level of
    zero, actually provide some entropy, which is good enough to mix into
    the CRNG pool every once in a while.
    
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

 drivers/char/hw_random/core.c | 36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)
------------------------------------------------------------------------------
I am gonna change it in the 6.2-rc6 and see if I can fix it.
Comment 12 Bell 2023-02-05 11:04:46 UTC
I comment out the code like this in drivers/char/hw_random/core.c in 6.2-rc6
Just for the test.
---------------------------------------------
/* if necessary, start hwrng thread */
/*
if (!hwrng_fill) {
	hwrng_fill = kthread_run(hwrng_fillfn, NULL, "hwrng");
	if (IS_ERR(hwrng_fill)) {
		pr_err("hwrng_fill thread creation failed\n");
		hwrng_fill = NULL;
	}
}
*/
---------------------------------------------
And the problem disappears.
So that's pretty much all I can do now --- Locate the problem.
I am not familiar with kernel development and how the code exactly works.
But if you want me to help test, feel free to send me a message :)
Comment 13 Bell 2023-02-05 11:17:53 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10)
> (In reply to Bell from comment #9)
> 
> > ok, I found a way to trigger this bug in kernel 6.0.x.
> 
> In that case, could you try to bisect this? Maybe starting by trying before
> and after the two big random merges during the merge window of 6.1 might be
> a good idea, but OTOH it also could be a bad idea.

I'm sorry, I just realized that I misunderstood what you said.
You mean to find out why multiple timeouts trigger this bug.
It's not a good idea to regression.
Please give me some more time. I'm ready to bisect from 5.19.
Sorry again for my misunderstanding.
Comment 14 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-05 11:23:45 UTC
(In reply to Bell from comment #13)
> I'm sorry, I just realized that I misunderstood what you said.

Whatever, what you did afaics is (nearly all that you did), as if I understood you right, the problem is caused by b006c439d58db625318bf2207feabf847510a8a6. But one more detail might be helpful: does that commit revert clearly in latest mainline and does the problem disappear then? What you wrote in comment 12 indicates that is likely, but testing if a complete revert does the trick might be helpful.

thx for you work,
Comment 15 Bell 2023-02-05 12:09:58 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #14)
> (In reply to Bell from comment #13)
> > I'm sorry, I just realized that I misunderstood what you said.
> 
> Whatever, what you did afaics is (nearly all that you did), as if I
> understood you right, the problem is caused by
> b006c439d58db625318bf2207feabf847510a8a6. But one more detail might be
> helpful: does that commit revert clearly in latest mainline and does the
> problem disappear then? What you wrote in comment 12 indicates that is
> likely, but testing if a complete revert does the trick might be helpful.
> 
> thx for you work,

No, the test shows the problem is still there.
I think "start hwrng kthread for untrusted sources" is only the surface reason.
As I said before if you manually call the /dev/hwrng around 5-15 minutes, you can trigger it from 6.0.x to 6.2-rc2 too.
So we can "fix" it by commenting out the code that starts hwrng kthread.
But that is not really a "fix", It just avoids constant call /dev/hwrng in kthread.
What I really want to figure out now is that:
1. Why hwrng called out too many time can lead a problem.
2. Is this problem introduced by the kernel? Or It is a hardware design failure cant be fixed?
I will do some further tests and bisect to find out. If this is a design failure , well, then we can add a condition when it detects AMD CPU then not start hwrng thread.
I am new to kernel development, thank you very much!
Comment 16 Bell 2023-02-05 14:22:47 UTC
(In reply to Bell from comment #15)
Unfortunately, the oldest kernel I can run on my laptop is 5.15.91-LTS, anything below that will get a kernel panic in the boot process. But I think this is enough for the test.
In 5.15.91, the stuttering still happens when I call the /dev/hwrng. It is still possibly a software bug, but I tend to assume this is a hardware bug.
So what we can do now? Send an e-mail to both AMD engineers and Dominik? (The author commit b006c439d58db625318bf2207feabf847510a8a6)
Comment 17 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-02-05 14:31:17 UTC
(In reply to Bell from comment #16)
> Send an e-mail to both AMD engineers

Won't likely change anything, unless the firmware update Boris pointed to doesn't help.

> and Dominik?
> (The author commit b006c439d58db625318bf2207feabf847510a8a6)

I'll take care of that, but give me a minute please
Comment 18 Echo J. 2023-02-05 16:08:20 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #17)
> (In reply to Bell from comment #16)
> Won't likely change anything, unless the firmware update Boris pointed to
> doesn't help.

The issue is that some PCs (especially laptops) didn't get that update at all

So there needs to be some sort of workaround for the affected PCs
Comment 19 Bell 2023-02-05 16:22:46 UTC
(In reply to Echo J. from comment #18)
> The issue is that some PCs (especially laptops) didn't get that update at all
> 
> So there needs to be some sort of workaround for the affected PCs
That is why I sent an e-mail to ASUS to ask about ftpm update. most manufacturers like to give their consumer line-up a middle finger you know. I never expect they can follow AMD instructions and update BIOS or so.
For now, let's just see if we can fix it with kernel patches rather than BIOS update.
Comment 20 Tom Englund 2023-02-08 11:46:31 UTC
For what its worth its affecting most acer laptops aswell. ive tried contacting them through reddit/their community forums/and email. the furthest i got was through email with conversations going back and forth which eventually ended in the person recommending me to contact my store i bought it from and RMA it.. so unless something magical happends and the manufacturers starts caring and actually update their bios'es quite a lot of ryzen laptops will face this if they have ftpm/tpm enabled.
Comment 21 Jason A. Donenfeld 2023-02-08 16:03:35 UTC
If you don't have a BIOS update available yet, and your TPM is thus problematic, you should be able to workaround this with something like this (depending on which modules you're using; `lsmod|grep tpm` will help):

    # echo blacklist tpm > /etc/modprobe.d/disable-tpm.conf
    # echo blacklist tpm_crb >> /etc/modprobe.d/disable-tpm.conf
    # echo blacklist tpm_tis >> /etc/modprobe.d/disable-tpm.conf
    # echo blacklist tpm_tis_core >> /etc/modprobe.d/disable-tpm.conf

Alternatively, disable the buggy "AMD Platform Security Processor" in your laptop's BIOS menu.
Comment 22 Mario Limonciello (AMD) 2023-02-09 03:04:29 UTC
> then we can add a condition when it detects AMD CPU then not start hwrng
> thread.

Could those affected please share the values for `tpm2_getcap properties-fixed` from their system?  Also please share the APU in your system and your BIOS version.

I'd like to compare the details against the advisory linked above to make sure we're looking at the same issue manifesting in Linux.

> For now, let's just see if we can fix it with kernel patches rather than BIOS
> update.

As long as it matches my expectations perhaps we can use fixed properties (tpm2_get_tpm_pt) to detect it and then use that as a gate to starting the hwrng thread.
Comment 23 Jason A. Donenfeld 2023-02-09 03:26:46 UTC
> As long as it matches my expectations perhaps we can use fixed properties
> (tpm2_get_tpm_pt) to detect it and then use that as a gate to starting the
> hwrng thread.

More precisely, you'd gate registering a hwrng device at all, so that no thread is started for it.
Comment 24 Bell 2023-02-09 05:51:13 UTC
Created attachment 303706 [details]
tpm2_getcap on ASUS_Zephyrys_G15_6800HS
Comment 25 Bell 2023-02-09 05:55:21 UTC
(In reply to Mario Limonciello (AMD) from comment #22)
> > then we can add a condition when it detects AMD CPU then not start hwrng
> > thread.
> 
> Could those affected please share the values for `tpm2_getcap
> properties-fixed` from their system?  Also please share the APU in your
> system and your BIOS version.
> 
> I'd like to compare the details against the advisory linked above to make
> sure we're looking at the same issue manifesting in Linux.
> 
> > For now, let's just see if we can fix it with kernel patches rather than
> BIOS
> > update.
> 
> As long as it matches my expectations perhaps we can use fixed properties
> (tpm2_get_tpm_pt) to detect it and then use that as a gate to starting the
> hwrng thread.

ok, I added an attachment that includes the information you need.
I use the latest 311 BIOS.
I also try to find out what is the earliest kernel that we can trigger stutter manually. Seems this problem appears between 4.19.5 to 5.4. Anyway, that is another topic, I will keep on working on it.
Thank you very much.
Comment 26 Jason A. Donenfeld 2023-02-09 14:40:06 UTC
From your file:

TPM2_PT_MANUFACTURER:
  raw: 0x414D4400
  value: "AMD"
TPM2_PT_VENDOR_STRING_1:
  raw: 0x414D4400
  value: "AMD"
TPM2_PT_FIRMWARE_VERSION_1:
  raw: 0x3004E
TPM2_PT_FIRMWARE_VERSION_2:
  raw: 0x20005

I wonder if those are useful selectors?
Comment 27 Bradley Fourie 2023-02-09 14:40:16 UTC
Created attachment 303707 [details]
tpm2_getcap (ASUS Strix G15 AMD Advantage Edition)

tpm2_getcap on ASUS G513QY (Ryzen 9 5900HX) with latest BIOS (320)
Comment 28 Mario Limonciello (AMD) 2023-02-09 14:57:30 UTC
> I wonder if those are useful selectors?

Yes; those are exactly the ones I was going to suggest using as well.

As you're more familiar in this area, do you want to start a patch?

I'll confirm the correct values to use for TPM2_PT_FIRMWARE_VERSION_1/TPM2_PT_FIRMWARE_VERSION_2 and get back with those later.
Comment 29 Jason A. Donenfeld 2023-02-09 15:03:00 UTC
Seems like Bradley has a much lower VERSION_2, so I would imagine VERSION_1 is the cut off we care about.

I can try to start a patch, but I don't actually know much about tpm subsystem, just rng stuff (I maintain random.c).
Comment 30 Mario Limonciello (AMD) 2023-02-09 15:06:38 UTC
> Seems like Bradley has a much lower VERSION_2, so I would imagine VERSION_1
> is the cut off we care about.

They're concatenated together.

So if that was the right version to go off of it would be
0x0003004E 0x0002005 (or generally viewed in tools like fwupd as as "3.4e.2.5")

> I can try to start a patch, but I don't actually know much about tpm
> subsystem, just rng stuff (I maintain random.c).

I guess it could be right where the TPM gets registered with rng that such a check like this makes sense.
Comment 31 Tom Englund 2023-02-09 15:09:37 UTC
sorry a bit unsure how to "attach" something here, but this is the getcap on my acer with an 5800H http://ix.io/4nyQ that is affected aswell.
Comment 32 Jason A. Donenfeld 2023-02-09 15:32:28 UTC
@Mario - okay I sent a little draft patch for you to tweak and test and insert the right numbers. I just made up some garbage.

https://lore.kernel.org/lkml/20230209153120.261904-1-Jason@zx2c4.com/
Comment 33 Mario Limonciello (AMD) 2023-02-10 03:39:29 UTC
Created attachment 303709 [details]
potential patch

Much appreciated.  I'll attach an updated patch with the versions I expect should avoid triggering this.

This sure seems like the same issue as the advisory, but I won't be submitting it to the mailing lists until I can confirm that.  Feel free to use the patch in the mean time.
Comment 34 Bell 2023-02-10 04:11:59 UTC
(In reply to Mario Limonciello (AMD) from comment #33)
> Created attachment 303709 [details]
> potential patch
> 
> Much appreciated.  I'll attach an updated patch with the versions I expect
> should avoid triggering this.
Thanks for your and Jason's work very much.
I test it out in the 6.2-rc7 and it did solve the problem. Hope this can go into the patch ASAP.
Comment 35 Balazs Vinarz 2023-02-11 14:10:55 UTC
I'm having pretty much the same problem with my Asus Pro B550C/CSM@5700G.
Does any of you still need the output of tpm2_getcap?
I have been using a pretty old BIOS version (2407) with AGESA 1.2.0.3B(IIRC) and I saw the changelog for possibly fix stuttering because of fTPM, but 2804 didn't shipped the fix as well.
Comment 36 Mario Limonciello (AMD) 2023-02-11 14:26:41 UTC
If your BIOS change log claims it was fixed but you're still affected yes I would like to see what tpm firmware version claims.
Comment 37 Balazs Vinarz 2023-02-11 14:40:27 UTC
Created attachment 303715 [details]
tpm2_getcap properties-fixed

Added
Comment 38 Mario Limonciello (AMD) 2023-02-11 14:59:56 UTC
Thanks, that version doesn't have the stuttering fix from the Windows advisory. The "potential patch" should help you.
Comment 39 Mario Limonciello (AMD) 2023-02-13 19:35:05 UTC
Just want to update where this is..

I've confirmed at least the firmware versions mentioned in the patch are not causing stuttering in audio playback when doing this for 15 minutes:

# sudo cat /dev/hwrng > /dev/null" 

This was on one of AMD's reference designs.  However, I don't yet have the inverse confirmation of the failure on the same hardware (without the patch on the same hardware but older fTPM).  Anti-rollback protection means reproducing the failure is a lot more difficult.

Will continue to try to reproduce the failure but I'm "leaning" towards submitting the patch "as is".

If it turns out to be a different root cause than the advisory, then another option is to drop the version check part of it entirely and forbid all AMD fTPM's from RNG.
Comment 40 Jason A. Donenfeld 2023-02-13 19:42:28 UTC
One of these might help https://www.amazon.com/s?k=eeprom+clip+programmer for loading an old firmware (unless there are efuses or something).
Comment 41 Mario Limonciello (AMD) 2023-02-13 19:44:24 UTC
Thanks, but the efuses are blown for rollback protection.  An older firmware will not be able to boot even when flashed from a hardware SPI programmer.
Comment 42 Tom Englund 2023-02-13 20:37:58 UTC
Created attachment 303720 [details]
tpm2_getcap (ACER nitro 5 AN515-45 bios v1.10)

attaching tpm2_getcap from this acer, i can reliably reproduce this with

"sudo cat /dev/hwrng > /dev/null"

in 2 minutes of running or so.

with the patch applied to 6.1.11. it still occurs. but will this however only fix this cause for stutter. wont other things like systemd using tpm2 for encryption, boot time measuring etc cause it aswell eventually? or is my versions not included in that patch hm.
Comment 43 Mario Limonciello (AMD) 2023-02-13 20:42:27 UTC
> TPM2_PT_FIRMWARE_VERSION_1:
>  raw: 0x30039
> TPM2_PT_FIRMWARE_VERSION_2:
>  raw: 0x5

The patch /should/ be disabling fTPM use for RNG on your system.

> > or is my versions not included in that patch hm.

Your system should have shown this message in the kernel log when the patch was applied:

"AMD fTPM version 0x%llx causes system stutter and "
"will not be used for random number generation\n",
((u64)val1 << 32) | val2);

Did you see that message?  It's possible I made a mistake in the patch too.
Comment 44 Tom Englund 2023-02-13 20:48:00 UTC
(In reply to Mario Limonciello (AMD) from comment #43)
> > TPM2_PT_FIRMWARE_VERSION_1:
> >  raw: 0x30039
> > TPM2_PT_FIRMWARE_VERSION_2:
> >  raw: 0x5
> 
> The patch /should/ be disabling fTPM use for RNG on your system.
> 
> > > or is my versions not included in that patch hm.
> 
> Your system should have shown this message in the kernel log when the patch
> was applied:
> 
> "AMD fTPM version 0x%llx causes system stutter and "
> "will not be used for random number generation\n",
> ((u64)val1 << 32) | val2);
> 
> Did you see that message?  It's possible I made a mistake in the patch too.

hm nothing in dmesg about that. im fairly sure the patch got applied. il see if i didnt mess something up in patching it.
Comment 45 Mario Limonciello (AMD) 2023-02-13 21:02:26 UTC
Created attachment 303721 [details]
potential patch (v2)

You did prompt me to double check and I did find a minor logic error that could cause the wrong paths to match, but I don't expect it was the reason you didn't see the message.

If anything it would have blocked more systems from using RNG than before.  In any case, attaching a new version.
Comment 46 Tom Englund 2023-02-13 21:33:59 UTC
(In reply to Mario Limonciello (AMD) from comment #45)
> Created attachment 303721 [details]
> potential patch (v2)
> 
> You did prompt me to double check and I did find a minor logic error that
> could cause the wrong paths to match, but I don't expect it was the reason
> you didn't see the message.
> 
> If anything it would have blocked more systems from using RNG than before. 
> In any case, attaching a new version.

sorry for causing troubles, but thats great finding that nonetheless hehe. yeah it was my mistake i added the patch to my pkgbuild source but it wasnt actually patching it in the prepare function. now its applied. i also added a "dev_warn(&chip->dev, "Test RNG Defective\n");" at the top of the tpm_is_rng_defective function just to make sure. and surely, dmesg has now.
[    0.438285] tpm tpm0: Test RNG Defective
[    0.441444] tpm tpm0: AMD fTPM version 0x3003900000005 causes system stutter and will not be used for random number generation

im not sure if its intended but now "sudo cat /dev/hwrng > /dev/null" "cat: /dev/hwrng: No such device" it seems /dev/hwrng is missing. so i cant reproduce it as reliably as before i guess. il run normal tasks for a while and report back otherwise.
Comment 47 Jason A. Donenfeld 2023-02-13 21:53:49 UTC
(In reply to Tom Englund from comment #46)
> im not sure if its intended but now "sudo cat /dev/hwrng > /dev/null" "cat:
> /dev/hwrng: No such device" it seems /dev/hwrng is missing. so i cant
> reproduce it as reliably as before i guess. il run normal tasks for a while
> and report back otherwise.

That's intended [good] behavior and means the patch is working as expected.

@Mario - by the way, does this bug affect only the RNG or all aspects of the fTPM? If it affects all operations, maybe the check should abort registration very early on and not expose any functionality that might trigger this.
Comment 48 Mario Limonciello (AMD) 2023-02-13 22:01:19 UTC
RNG is the most likely way to trigger it because so many applications rely upon random numbers and so it's used more regularly.

If another use case does pop up that is triggering it with regularity, I agree with you we should abort init earlier on.  But also removing fTPM functionality can be a lot more detrimental to someone's system if they were storing important data and no longer have access to it after a kernel upgrade.

So let's keep that in mind.  If we have to revert to that approach it should be moving this function for detection around and we would probably need some sort of parameter to allow a user to turn it back on.
Comment 49 Bell 2023-02-14 10:54:14 UTC
> I also try to find out what is the earliest kernel that we can trigger
> stutter manually. Seems this problem appears between 4.19.5 to 5.4. Anyway,
> that is another topic, I will keep on working on it.

Quick update:

I test it all the way back to 4.18 kernel with ubuntu 18.04, which is the oldest ubuntu that can support TPM hwrng. (Earlier than that will get a "device not found" error when using "sudo cat /dev/hwrng")

Unfortunately, I still can manually trigger the stutter.

I wondering if this problem already exists (I mean manually trigger it) once TPM hwrng was implemented after 4.1x.x kernel.

But due to poor hardware support in old kernels, It is very hard for me to do kernel bisection in newer hardware, so I will not do that.

btw, the patch can go into the mainline before the 6.2 release? Or do we have to wait for more until everything is sure?

Thanks!
Comment 50 Jason A. Donenfeld 2023-02-14 13:00:33 UTC
> btw, the patch can go into the mainline before the 6.2 release? Or do we have
> to wait for more until everything is sure?

If Mario CCs Linus (and Thorsten) on the patch, it might make it in if Linus takes it directly. Otherwise it'll go through Jarkko's tree, and who knows how long that'll take to percolate. @Mario - if you do intend to CC Linus, you might want to do that todayish, since we're already in the last rc.
Comment 51 Jason A. Donenfeld 2023-02-14 13:14:35 UTC
Hi @Mario --

+	if ((val1 & (6<<16)) == (6<<16)) {
+		if (val1 >= 0x60000 && val2 >= 0x180006)
+			return false;
+	} else if ((val1 & (3<<16)) == (3<<16)) {
+		if (val1 >= 0x30057 && val2 >= 0x5)
+			return false;
+	}

The logic here still looks weird with that bitmask, since it doesn't preclude a version 7 series. Maybe just do it the boring way of:

	if ((val1 >> 16) == 6) {
		if ((((u64)val1 << 32) | val2) >= 0x0006000000180006ULL)
			return false;
	} else if ((val1 >> 16) == 3) {
		if ((((u64)val1 << 32) | val2) >= 0x0003005700000005ULL)
			return false;
	}

The compiler will then do the right thing with those 64-bit comparisons.

We can actually clean up the ORing in the whole function though by hoisting it out. I'll upload a new patch in a second.
Comment 52 Jason A. Donenfeld 2023-02-14 13:23:03 UTC
Created attachment 303725 [details]
potential patch (v3)

@Mario - let me know how this looks to you. Perhaps it's suitable for submission?
Comment 53 Jason A. Donenfeld 2023-02-14 13:25:21 UTC
Created attachment 303726 [details]
potential patch (v3)

Er, rather, this.
Comment 54 Tom Englund 2023-02-14 16:13:46 UTC
so after a whole days of heavy usage, compilations, benchmarking, youtubeing, gaming ive yet to hit a stutter with the patch applied. so i would say this regression is dealt with. i dont have any encrypted disks or such and as far as i can tell the only thing using the tpm on my system is systemd for boot time measuring. so in my case its atleast working as intended.
Comment 55 Mario Limonciello (AMD) 2023-02-14 18:22:47 UTC
> @Mario - let me know how this looks to you.
> Er, rather, this.

Thanks for the enhancements!  I think Co-authored-by should be Co-developed-by IIRC?  Also needs your S-o-b when we're both working on something.

> Perhaps it's suitable for submission?

My team still hasn't been able to reproduce the failure as mentioned above, but I think it's reasonable.  At best this helps a lot of people, and at worst we find out some day it's not enough and need to block more.

> If Mario CCs Linus (and Thorsten) on the patch, it might make it in if Linus
> takes it directly. Otherwise it'll go through Jarkko's tree, and who knows
> how long that'll take to percolate. @Mario - if you do intend to CC Linus,
> you might want to do that todayish, since we're already in the last rc.

If I can get your ack to change that tag and add the S-o-b, I'll send it out today.
Comment 56 Jason A. Donenfeld 2023-02-14 18:47:09 UTC
Yea, feel free to change the tag and add my S-o-b.
Comment 57 Bell 2023-02-16 02:39:59 UTC
I forgot to mention it, but last week I reached out to ASUS to ask about fTPM fix BIOS. They said the fTPM fix doesn't include their laptop line-up, but they did fix it on the desktop motherboard.

So that is unfortunate. I don't know how hard to add a minor fix into a laptop BIOS, or just pure laziness to them.
Comment 58 Balazs Vinarz 2023-02-19 05:16:20 UTC
(In reply to Bell from comment #57)
> I forgot to mention it, but last week I reached out to ASUS to ask about
> fTPM fix BIOS. They said the fTPM fix doesn't include their laptop line-up,
> but they did fix it on the desktop motherboard.
> 
> So that is unfortunate. I don't know how hard to add a minor fix into a
> laptop BIOS, or just pure laziness to them.

will they ship a fix for the whole AM4 and AM5 lineup?
Comment 59 Guenter Roeck 2023-02-23 20:05:29 UTC
I have a MSI B550 Mortar with 5900X and BIOS version 1.D0. This is the latest BIOS version for this board, and it supposed to support AGESA 1207 which, according to the inormation in AMD pa-410, is supposed to include the fix. However, with patch version 3 applied, I get

tpm tpm0: AMD fTPM version 0x3005400000005 causes system stutter; hwrng disabled

Where does 0x0003005700000005ULL come from ? Is it known that version 0x3005400000005 is indeed bad, or is that just a guess ?
Comment 60 Mario Limonciello (AMD) 2023-02-23 20:08:05 UTC
> Where does 0x0003005700000005ULL come from ? Is it known that version
> 0x3005400000005 is indeed bad, or is that just a guess ?

You should be able to trigger the problem with this version.  As other reporters have said it can take some time though.

This version string comes from internal information at AMD about where the fix is included.
Comment 61 Guenter Roeck 2023-02-23 20:09:15 UTC
(In reply to Mario Limonciello (AMD) from comment #60)
> > Where does 0x0003005700000005ULL come from ? Is it known that version
> > 0x3005400000005 is indeed bad, or is that just a guess ?
> 
> You should be able to trigger the problem with this version.  As other
> reporters have said it can take some time though.
> 
> This version string comes from internal information at AMD about where the
> fix is included.

Thanks!
Comment 62 Mario Limonciello (AMD) 2023-02-27 13:21:12 UTC
Can someone affected please test V2 (https://lore.kernel.org/lkml/20230220180729.23862-1-mario.limonciello@amd.com/T/)

and if all looks good reply to it with a tested tag? IE

Tested-by: Foo Bar <foo@bar.com>
Comment 63 Bell 2023-02-27 13:29:15 UTC
> Can someone affected please test V2

um, what is the difference between V3 and V2? I think there is only some logical changes in V3.

It's ok that I try the V3 patch in the attachment?

anyway, I will test V2 and give a result asap, thanks!
Comment 64 Mario Limonciello (AMD) 2023-02-27 13:34:33 UTC
Sorry the versioning gets confusing because we versioned both the patches in this bug report as well as the ones "submitted" to upstream.

Please test the "v2 that was submitted upstream".  We need testers of this version for it to be accepted even if it's logically the same.
Comment 65 Jason A. Donenfeld 2023-02-27 13:56:07 UTC
We need `Tested-by:` tags specifically for https://lore.kernel.org/lkml/20230220180729.23862-1-mario.limonciello@amd.com/T/ . Otherwise the upstream maintainer won't apply it. He's usually pretty tuned out of discussions and such, so that kind of very visible and explicit "we know that this does work!" is needed. So if https://lore.kernel.org/lkml/20230220180729.23862-1-mario.limonciello@amd.com/T/ works, just reply here with your `Tested-by: First Last <email>` and Mario will send that onward.
Comment 66 Bell 2023-02-27 14:44:14 UTC
Tested-by: Bell <1138267643@qq.com>

I applied and tested the upstream patch in both the mainline and the tag v6.2 release, and everything looks fine.

I got a dmesg about waring tpm defection, which shows the patch works.

Open all my daily drive software, I didn't notice any difference.
Comment 67 reach622 2023-02-28 02:36:44 UTC
I applied the patch to 6.2.1, and I get this in the logs:

tpm tpm0: AMD fTPM version 0x3003900000005 causes system stutter; hwrng disabled

It looks like the patch is working.
Comment 68 Branko Grubić 2023-02-28 18:45:36 UTC
Created attachment 303811 [details]
tpm2_getcap properties-fixed (LENOVO IdeaPad 5 Pro 14ACN6)

Hi,

Not sure is it useful at all (feel free to skip), I was not able to test the patch yet. Here is my system info:
LENOVO IdeaPad 5 Pro 14ACN6 BIOS/Firmware GECN33WW(V1.17)
AMD Ryzen 5 5600U
currently running Fedora shipped kernel: 6.1.14-200.fc37.x86_64

Latest BIOS/Firmware but I doubt they will ship an update for this issue with new version. This one is recently released ~1month.

No real option in BIOS/Firmware to disable it, or use other implementation. Tried some tricks to "unlock" hidden BIOS/firmware features (https://wiki.archlinux.org/title/Lenovo_IdeaPad_5_Pro_14ACN6 | 5.1.1 Hidden BIOS menus ), there was an option there (3 options), tried one, didn't help.
Comment 69 Jason A. Donenfeld 2023-02-28 19:19:29 UTC
This patch is now in the maintainer's tree: https://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git/commit/?id=ffe4a34b29aa672f1fa05f49a7e4c3ea4ab4e12f

So hopefully it'll make it to Linus' tree in the next rc.
Comment 70 Balazs Vinarz 2023-03-05 06:07:26 UTC
This fix will be introduced in the 6.3-RC series, right?
Comment 71 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-06 08:05:40 UTC
(In reply to Balazs Vinarz from comment #70)
> This fix will be introduced in the 6.3-RC series, right?

Sadly it did not -- and I mentioned the patch to Linus in my weekly report, but he didn't pick it up.

Will ask the maintainer in a few days to send it to Linus this week, if that hasn't happened by then
Comment 72 proteve 2023-03-06 19:02:01 UTC
arch linux/hp laptop 4500u/only AMD apu video card/tpm disabled from bios

kernel 6.2.2 games stutters like hell
downgrading back to 6.1 problem solved.
stutter started wtih 6.2. i had no problem with 6.1 but i have tpm disabled since a long time ago.
Comment 73 Mario Limonciello (AMD) 2023-03-06 19:08:21 UTC
(In reply to proteve from comment #72)
> arch linux/hp laptop 4500u/only AMD apu video card/tpm disabled from bios
> 
> kernel 6.2.2 games stutters like hell
> downgrading back to 6.1 problem solved.
> stutter started wtih 6.2. i had no problem with 6.1 but i have tpm disabled
> since a long time ago.

If you've got your TPM disabled but are getting a stutter issue, then I am afraid you're hitting a different problem.

Please open your own issue and if possible bisect it.

https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Comment 74 Balazs Vinarz 2023-03-08 20:37:55 UTC
That's pretty much the same, what I experience. I hoped the patch would resolve the problem, however it doesn't matter which TPM I choose (discrete or fTPM) or add the modules to the blacklist.
I have two custom tkg kernels at the moment with somewhat the same shutter, but I'll give it a go with the 5.15 LTS tomorrow.
5700G here with latest BIOS.
Comment 75 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-09 09:20:50 UTC
(In reply to Balazs Vinarz from comment #74)
> That's pretty much the same, what I experience.

Just to clarify: but you experience that since a while already, and not only since 6.2, like proteve@mail.com apparently does (FWIW, who thankfully filed a separate issue here to avoid confusion: https://bugzilla.kernel.org/show_bug.cgi?id=217158 )?
Comment 76 Bell 2023-03-10 15:44:09 UTC
wow, this issue has been passed over 1 month.

I think if this bug causes stuttering on Linus's PC he will be mad AF lol.

I understand why It can take such a long time to go into the mainline.
Because not every contributor is full-time in kernel dev. I don't want to push people too much that is rude.

Anyway, I will wait patiently and still appreciate everyone's effort/work. At worst, It might be merged until some big changes happen in the TPM tree (how long it will take? idk.)
Comment 77 Balazs Vinarz 2023-03-11 05:36:52 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #75)
> (In reply to Balazs Vinarz from comment #74)
> > That's pretty much the same, what I experience.
> 
> Just to clarify: but you experience that since a while already, and not only
> since 6.2, like proteve@mail.com apparently does (FWIW, who thankfully filed
> a separate issue here to avoid confusion:
> https://bugzilla.kernel.org/show_bug.cgi?id=217158 )?

Not really, I just recently brought the 5700G.
Just to confirm, 5.15 is not as bad as 6.2, however I can still see some shutter.
Comment 78 Branko Grubić 2023-03-14 07:15:20 UTC
I just upgraded to 6.2.6-200.fc37.x86_64 (6.2.6 should have this workaround included if I'm not wrong), but I don't see anything in the kernel log recognizing that devices has problematic TPM...

Is that expected?

Also output of `tpm2_getcap properties-fixed` is same as before comment#68 (so no firmware update in beteween).
Comment 79 Bell 2023-03-14 07:31:08 UTC
Created attachment 303945 [details]
attachment-6574-0.html

What is the way you check kernel messages?
I think dmesg will be clean up after few minutes you boot into system
Try to use journalctl -k —no-pager | grep tpm instead.




------------------ Original ------------------
From: bugzilla-daemon <bugzilla-daemon@kernel.org&gt;
Date: Tue,Mar 14,2023 3:15 PM
To: 1138267643 <1138267643@qq.com&gt;
Subject: Re: [Bug 216989] Kernels after 6.1 experience AMD Ryzen fTPM stutter
Comment 80 Echo J. 2023-03-14 07:43:41 UTC
I still see no TPM kernel messages with 6.2.6 for some reason with that journalctl command

My TPM2_PT_FIRMWARE_VERSION_1 is 0x3002A and VERSION_2 is 0x5 (and the MANUFACTURER is 0x414D4400) 🐸
Comment 81 Bell 2023-03-14 10:31:59 UTC
Hmm, I guess we will have to wait for AMD engineers to respond to you. But you can use your computer as normal to see If this problem still happens.
Also sorry for the carp attachment I created. I am on a car using a third-party email client to reply to this thread. I have no idea why it created an attachment. Never mind.
Comment 82 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-14 10:39:17 UTC
FWIW, if anyone still has issues, please file a new report and afterwards briefly mention it here while dropping a link to the report, as things otherwise quickly might get confusing
Comment 83 Echo J. 2023-03-14 10:41:25 UTC
Actually I realized that Arch still ships 6.2.5 🐸

So disregard comment 80 (I can't edit it)
Comment 84 Bell 2023-03-14 10:47:35 UTC
(In reply to Bell from comment #79)

> Try to use journalctl -k —no-pager | grep tpm instead.

this is my fault, the correct command is "journalctl -b --no-pager | grep tpm"
My phone auto-completion turned "--" into "—".

Anyway, this patch works perfectly on my Arch Linux with 6.1.19-LTS and 6.2.6 (still in testing repo btw)

I tend to think this is a user issue rather patch itself.

Just keep using your system, If the problem still happens, like Thorsten said, create a new report :) thanks.
Comment 85 Branko Grubić 2023-03-17 18:42:22 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #82)
> FWIW, if anyone still has issues, please file a new report and afterwards
> briefly mention it here while dropping a link to the report, as things
> otherwise quickly might get confusing

Hi,

Created bug 217212

Regards,
Branko
Comment 86 Balazs Vinarz 2023-03-18 05:57:19 UTC
(In reply to Echo J. from comment #80)
> I still see no TPM kernel messages with 6.2.6 for some reason with that
> journalctl command
> 
> My TPM2_PT_FIRMWARE_VERSION_1 is 0x3002A and VERSION_2 is 0x5 (and the
> MANUFACTURER is 0x414D4400) 🐸

(In reply to Branko Grubić from comment #78)
> I just upgraded to 6.2.6-200.fc37.x86_64 (6.2.6 should have this workaround
> included if I'm not wrong), but I don't see anything in the kernel log
> recognizing that devices has problematic TPM...
> 
> Is that expected?
> 
> Also output of `tpm2_getcap properties-fixed` is same as before comment#68
> (so no firmware update in beteween).

I don't think this was backported to the 6.2 series by the distro maintainers. The commit was merged to the 6.3-rc2 in Linus's tree:
https://github.com/torvalds/linux/commit/f1324bbc4011ed8aef3f4552210fc429bcd616da
Comment 87 Bell 2023-03-18 06:00:00 UTC
(In reply to Balazs Vinarz from comment #86)

> I don't think this was backported to the 6.2 series by the distro
> maintainers.

pls check the latest bug report, I can confirm this patch works in fedora.

And we will have a further discussion in that newer report :)
Comment 88 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-18 06:02:37 UTC
(In reply to Balazs Vinarz from comment #86)

> I don't think this was backported to the 6.2 series

It was to the upstream 6.1.y and 6.2.y series early this week, as can be seen by searching in trees like this:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=linux-6.2.y

If distros picked those releases or that fix up is obviously a fifferent story.
Comment 89 Balazs Vinarz 2023-03-18 06:13:31 UTC
Sorry, my bad:
I also sent an edit to my previous reply and that wasn't updated.
So I've found the patch in the meanwhile:
https://github.com/archlinux/linux/commit/e143354b441786c4f356f7c9b1852bc723dbd81b
Comment 90 Maurizio Avogadro 2023-03-18 18:00:39 UTC
(In reply to Mario Limonciello (AMD) from comment #38)
> Thanks, that version doesn't have the stuttering fix from the Windows
> advisory. The "potential patch" should help you.

Hi Mario

here, on a MSI MAG X570S TORPEDO MAX mobo with latest stable BIOS (v7D54A4, AGESA v1.2.0.7) I have

TPM2_PT_FIRMWARE_VERSION_1:
  raw: 0x30054
TPM2_PT_FIRMWARE_VERSION_2:
  raw: 0x5

just like the author of comment #37; I never experienced any stuttering.
I just let

cat /dev/hwrng > /dev/null

running for 25 minutes on kernel v6.2.5 while watching YouTube and everything was smooth and fine all the time.
Since upgrade to kernel v6.2.7 the message

tpm tpm0: AMD fTPM version 0x3005400000005 causes system stutter; hwrng disabled

started to appear on dmesg and the /dev/hwrng device is no longer available.
The AMD advisory states that AGESA v1.2.0.7 fixed the issue and having the TPM2 hwrng available can improve entropy.
Can you please investigate a little bit further?
Is there any kernel parameter available to force-enable the fTPM hwrng?

Thanks
Comment 91 Jason A. Donenfeld 2023-03-18 18:07:01 UTC
From a security perspective, on your platform, I don't think there is any computable difference in terms of entropy with or without that particular hwrng. It's unlikely to make a difference in ways that matter.
Comment 92 Maurizio Avogadro 2023-03-18 18:11:45 UTC
(In reply to Jason A. Donenfeld from comment #91)
> From a security perspective, on your platform, I don't think there is any
> computable difference in terms of entropy with or without that particular
> hwrng. It's unlikely to make a difference in ways that matter.

Thanks Jason, this reassures me. Nonetheless, this version of fTPM should probably be excluded from the workaround.
Comment 93 Maurizio Avogadro 2023-03-18 19:33:23 UTC
(In reply to Mario Limonciello (AMD) from comment #60)
> > Where does 0x0003005700000005ULL come from ? Is it known that version
> > 0x3005400000005 is indeed bad, or is that just a guess ?
> 
> You should be able to trigger the problem with this version.  As other
> reporters have said it can take some time though.
> 
> This version string comes from internal information at AMD about where the
> fix is included.

Ooops, my bad: didn't read thoroughly all the thread. Sorry for noise.
Comment 94 Mario Limonciello (AMD) 2023-03-19 01:45:32 UTC
> Nonetheless, this version of fTPM should probably be excluded from the
> workaround.

In discussion with others I've found out it's actually a combination of factors that can lead to this behavior but one of those is the fTPM version that the patch will guard against.  Those other factors can't be detected from Linux and so it would mean a hardcoded list of systems that have the "bad" fTPM version but don't have this issue which isn't really sustainable.

As Jason pointed out it's likely not worth doing this.
Comment 95 A. James Lewis 2023-03-20 20:34:32 UTC
I have an older INTEL Asus RoG Strix Z370-F system, with the latest 3004 BIOS (which is from 2001), and I noticed pronounced audio stutters since upgrading to a 6.1 kernel.  After finding this bug, and seeing a few comments which suggested it /might/ affect other implementations, I decided to disable my TPM to test, and when looking in the BIOS to do that, I noticed that the upgrade to version 3004 (needed for Resizeable BAR), had left the setting on "Firmware", so I switched it back to "Hardware" and immediately the audio issues were gone.

I have to conclude that this issue is not limited to AMD and is /does/ impact other fTPM implementations.
Comment 96 Bell 2023-03-21 05:29:53 UTC
(In reply to A. James Lewis from comment #95)

> I have to conclude that this issue is not limited to AMD and is /does/
> impact other fTPM implementations.

Thanks for this information very much. 

If the Intel platform can meet this problem when using fTPM, I guess there must be something wrong with fTPM design or Linux TPM implementations.

Anyway, I hope someday someone can find out exactly why.
Comment 97 A. James Lewis 2023-03-21 06:10:26 UTC
OK, now I have to apologise for jumping to conclusions, because after more than an hour of not stuttering, it happened again... so now I cannot jump to the conclusion that it's directly related to this issue, although it does seem to be much reduced.

It's definitely something that started when I jumped from 5.19 to 6.1, but I'll have to see if I can isolate it more precisely.
Comment 98 Mario Limonciello (AMD) 2023-05-12 17:00:04 UTC
*** Bug 217122 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.