Bug 216989
Summary: | Kernels after 6.1 experience AMD Ryzen fTPM stutter | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | reach622 |
Component: | x86-64 | Assignee: | Mario Limonciello (AMD) (mario.limonciello) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | 1138267643, aidas957, bitlord0xff, blueadep7, bp, bradleyfourie, Jason, linux, mario.limonciello, mavoga, moviuro.kernel, proteve, regressions, t-5, tomenglund26, vinibali1 |
Priority: | P1 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 6.1.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
tpm2_getcap on ASUS_Zephyrys_G15_6800HS
tpm2_getcap (ASUS Strix G15 AMD Advantage Edition) potential patch tpm2_getcap properties-fixed tpm2_getcap (ACER nitro 5 AN515-45 bios v1.10) potential patch (v2) potential patch (v3) potential patch (v3) tpm2_getcap properties-fixed (LENOVO IdeaPad 5 Pro 14ACN6) |
Description
reach622
2023-02-02 02:49:48 UTC
Hey, Let me add some extra information to help. 1. this issue can happen in 6.2-rc6 without loading third-party kernel modules. (NVIDIA or Virtualbox and so) 2. some guy on the Desktop/Laptop who can disable ftpm and did eliminate the problem. 3. this problem can happen in newer AMD processors from the 4000 series to the 6000 series. 4. this problem isn't caused by the dedicated graphics card I guess, here are some combinations that stuttering can happen: AMD(built-in GPU) + NVIDIA Laptop AMD(No built-in GPU) + AMD(dedicated) Desktop AMD(built-in GPU) + AMD(dedicated) Laptop/Desktop AMD + AMD(Built-in GPU only) Laptop all suffer from this. Hope this can help :) This stutter is introduced for you in kernel 6.1? Can you please bisect it? Kernel hackers rarely game so your best bet is bisecting. (In reply to Artem S. Tashkinov from comment #3) > Kernel hackers rarely game so your best bet is bisecting. FWIW, it's *totally fine* to report a regression without bisecting it, as sometimes the maintainers or someone else will know about the problem and the culprit already (or might have an idea what might cause it). But if that's not the case the duty to find the culprit in the end is the reporters, as explained in https://docs.kernel.org/admin-guide/reporting-regressions.html I guess no need to bisect: https://www.amd.com/en/support/kb/faq/pa-410 but look for a BIOS update... Many ASUS laptops (G15) do not have a latest BIOS update that fixes the issue. (In reply to Borislav Petkov from comment #5) > I guess no need to bisect: > https://www.amd.com/en/support/kb/faq/pa-410 Well, as mentioned in https://lore.kernel.org/all/3a196414-68d8-29c9-24cc-2b8cb4c9d358@leemhuis.info/ this seems to be something that didn't happen in 6.0, so this afaics still makes it a regression. set CONFIG_HW_RANDOM_TPM=n in the build config does avoid this lag. So I guess most of the manufacturers didn't follow AMD instructions. Well......Now we need to check what CONFIG_HW_RANDOM_TPM does and how 6.0.x->6.1.x affect this. (In reply to Bell from comment #8) > set CONFIG_HW_RANDOM_TPM=n in the build config does avoid this lag. > So I guess most of the manufacturers didn't follow AMD instructions. > Well......Now we need to check what CONFIG_HW_RANDOM_TPM does and how > 6.0.x->6.1.x affect this. ok, I found a way to trigger this bug in kernel 6.0.x. run "sudo cat /dev/hwrng > /dev/null" for around 5-15 minutes, and here you go. so there must be something that keeps calling the hardware random numbers generator in 6.1.x as time goes on, at a certain point, minor error stack together and lead to another error (something overflow I guess?) plus, if you use a simple rust monitoring program (https://bugs.archlinux.org/task/77340), you will notice timeout errors in 6.1.x even when you do nothing. However, this isn't the case in 6.0.x as long as you are not calling hwrng. (In reply to Bell from comment #9) > ok, I found a way to trigger this bug in kernel 6.0.x. In that case, could you try to bisect this? Maybe starting by trying before and after the two big random merges during the merge window of 6.1 might be a good idea, but OTOH it also could be a bad idea. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10) > (In reply to Bell from comment #9) > > > ok, I found a way to trigger this bug in kernel 6.0.x. > > In that case, could you try to bisect this? Maybe starting by trying before > and after the two big random merges during the merge window of 6.1 might be > a good idea, but OTOH it also could be a bad idea. oh, I start git bisect 2 days ago, from v6.0.12 to v6.1 And I just get a very promising result. ------------------------------------------------------------------------------ b006c439d58db625318bf2207feabf847510a8a6 is the first bad commit commit b006c439d58db625318bf2207feabf847510a8a6 Author: Dominik Brodowski <linux@dominikbrodowski.net> Date: Thu Sep 22 15:59:31 2022 +0200 hwrng: core - start hwrng kthread also for untrusted sources Start the hwrng kthread even if the hwrng source has a quality setting of zero. Then, every crng reseed interval, one batch of data from this zero-quality hwrng source will be mixed into the CRNG pool. This patch is based on the assumption that data from a hwrng source will not actively harm the CRNG state. Instead, many hwrng sources (such as TPM devices), even though they are assigend a quality level of zero, actually provide some entropy, which is good enough to mix into the CRNG pool every once in a while. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> drivers/char/hw_random/core.c | 36 ++++++++++-------------------------- 1 file changed, 10 insertions(+), 26 deletions(-) ------------------------------------------------------------------------------ I am gonna change it in the 6.2-rc6 and see if I can fix it. I comment out the code like this in drivers/char/hw_random/core.c in 6.2-rc6 Just for the test. --------------------------------------------- /* if necessary, start hwrng thread */ /* if (!hwrng_fill) { hwrng_fill = kthread_run(hwrng_fillfn, NULL, "hwrng"); if (IS_ERR(hwrng_fill)) { pr_err("hwrng_fill thread creation failed\n"); hwrng_fill = NULL; } } */ --------------------------------------------- And the problem disappears. So that's pretty much all I can do now --- Locate the problem. I am not familiar with kernel development and how the code exactly works. But if you want me to help test, feel free to send me a message :) (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10) > (In reply to Bell from comment #9) > > > ok, I found a way to trigger this bug in kernel 6.0.x. > > In that case, could you try to bisect this? Maybe starting by trying before > and after the two big random merges during the merge window of 6.1 might be > a good idea, but OTOH it also could be a bad idea. I'm sorry, I just realized that I misunderstood what you said. You mean to find out why multiple timeouts trigger this bug. It's not a good idea to regression. Please give me some more time. I'm ready to bisect from 5.19. Sorry again for my misunderstanding. (In reply to Bell from comment #13) > I'm sorry, I just realized that I misunderstood what you said. Whatever, what you did afaics is (nearly all that you did), as if I understood you right, the problem is caused by b006c439d58db625318bf2207feabf847510a8a6. But one more detail might be helpful: does that commit revert clearly in latest mainline and does the problem disappear then? What you wrote in comment 12 indicates that is likely, but testing if a complete revert does the trick might be helpful. thx for you work, (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #14) > (In reply to Bell from comment #13) > > I'm sorry, I just realized that I misunderstood what you said. > > Whatever, what you did afaics is (nearly all that you did), as if I > understood you right, the problem is caused by > b006c439d58db625318bf2207feabf847510a8a6. But one more detail might be > helpful: does that commit revert clearly in latest mainline and does the > problem disappear then? What you wrote in comment 12 indicates that is > likely, but testing if a complete revert does the trick might be helpful. > > thx for you work, No, the test shows the problem is still there. I think "start hwrng kthread for untrusted sources" is only the surface reason. As I said before if you manually call the /dev/hwrng around 5-15 minutes, you can trigger it from 6.0.x to 6.2-rc2 too. So we can "fix" it by commenting out the code that starts hwrng kthread. But that is not really a "fix", It just avoids constant call /dev/hwrng in kthread. What I really want to figure out now is that: 1. Why hwrng called out too many time can lead a problem. 2. Is this problem introduced by the kernel? Or It is a hardware design failure cant be fixed? I will do some further tests and bisect to find out. If this is a design failure , well, then we can add a condition when it detects AMD CPU then not start hwrng thread. I am new to kernel development, thank you very much! (In reply to Bell from comment #15) Unfortunately, the oldest kernel I can run on my laptop is 5.15.91-LTS, anything below that will get a kernel panic in the boot process. But I think this is enough for the test. In 5.15.91, the stuttering still happens when I call the /dev/hwrng. It is still possibly a software bug, but I tend to assume this is a hardware bug. So what we can do now? Send an e-mail to both AMD engineers and Dominik? (The author commit b006c439d58db625318bf2207feabf847510a8a6) (In reply to Bell from comment #16) > Send an e-mail to both AMD engineers Won't likely change anything, unless the firmware update Boris pointed to doesn't help. > and Dominik? > (The author commit b006c439d58db625318bf2207feabf847510a8a6) I'll take care of that, but give me a minute please (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #17) > (In reply to Bell from comment #16) > Won't likely change anything, unless the firmware update Boris pointed to > doesn't help. The issue is that some PCs (especially laptops) didn't get that update at all So there needs to be some sort of workaround for the affected PCs (In reply to Echo J. from comment #18) > The issue is that some PCs (especially laptops) didn't get that update at all > > So there needs to be some sort of workaround for the affected PCs That is why I sent an e-mail to ASUS to ask about ftpm update. most manufacturers like to give their consumer line-up a middle finger you know. I never expect they can follow AMD instructions and update BIOS or so. For now, let's just see if we can fix it with kernel patches rather than BIOS update. For what its worth its affecting most acer laptops aswell. ive tried contacting them through reddit/their community forums/and email. the furthest i got was through email with conversations going back and forth which eventually ended in the person recommending me to contact my store i bought it from and RMA it.. so unless something magical happends and the manufacturers starts caring and actually update their bios'es quite a lot of ryzen laptops will face this if they have ftpm/tpm enabled. If you don't have a BIOS update available yet, and your TPM is thus problematic, you should be able to workaround this with something like this (depending on which modules you're using; `lsmod|grep tpm` will help): # echo blacklist tpm > /etc/modprobe.d/disable-tpm.conf # echo blacklist tpm_crb >> /etc/modprobe.d/disable-tpm.conf # echo blacklist tpm_tis >> /etc/modprobe.d/disable-tpm.conf # echo blacklist tpm_tis_core >> /etc/modprobe.d/disable-tpm.conf Alternatively, disable the buggy "AMD Platform Security Processor" in your laptop's BIOS menu. > then we can add a condition when it detects AMD CPU then not start hwrng > thread. Could those affected please share the values for `tpm2_getcap properties-fixed` from their system? Also please share the APU in your system and your BIOS version. I'd like to compare the details against the advisory linked above to make sure we're looking at the same issue manifesting in Linux. > For now, let's just see if we can fix it with kernel patches rather than BIOS > update. As long as it matches my expectations perhaps we can use fixed properties (tpm2_get_tpm_pt) to detect it and then use that as a gate to starting the hwrng thread. > As long as it matches my expectations perhaps we can use fixed properties
> (tpm2_get_tpm_pt) to detect it and then use that as a gate to starting the
> hwrng thread.
More precisely, you'd gate registering a hwrng device at all, so that no thread is started for it.
Created attachment 303706 [details]
tpm2_getcap on ASUS_Zephyrys_G15_6800HS
(In reply to Mario Limonciello (AMD) from comment #22) > > then we can add a condition when it detects AMD CPU then not start hwrng > > thread. > > Could those affected please share the values for `tpm2_getcap > properties-fixed` from their system? Also please share the APU in your > system and your BIOS version. > > I'd like to compare the details against the advisory linked above to make > sure we're looking at the same issue manifesting in Linux. > > > For now, let's just see if we can fix it with kernel patches rather than > BIOS > > update. > > As long as it matches my expectations perhaps we can use fixed properties > (tpm2_get_tpm_pt) to detect it and then use that as a gate to starting the > hwrng thread. ok, I added an attachment that includes the information you need. I use the latest 311 BIOS. I also try to find out what is the earliest kernel that we can trigger stutter manually. Seems this problem appears between 4.19.5 to 5.4. Anyway, that is another topic, I will keep on working on it. Thank you very much. From your file: TPM2_PT_MANUFACTURER: raw: 0x414D4400 value: "AMD" TPM2_PT_VENDOR_STRING_1: raw: 0x414D4400 value: "AMD" TPM2_PT_FIRMWARE_VERSION_1: raw: 0x3004E TPM2_PT_FIRMWARE_VERSION_2: raw: 0x20005 I wonder if those are useful selectors? Created attachment 303707 [details]
tpm2_getcap (ASUS Strix G15 AMD Advantage Edition)
tpm2_getcap on ASUS G513QY (Ryzen 9 5900HX) with latest BIOS (320)
> I wonder if those are useful selectors?
Yes; those are exactly the ones I was going to suggest using as well.
As you're more familiar in this area, do you want to start a patch?
I'll confirm the correct values to use for TPM2_PT_FIRMWARE_VERSION_1/TPM2_PT_FIRMWARE_VERSION_2 and get back with those later.
Seems like Bradley has a much lower VERSION_2, so I would imagine VERSION_1 is the cut off we care about. I can try to start a patch, but I don't actually know much about tpm subsystem, just rng stuff (I maintain random.c). > Seems like Bradley has a much lower VERSION_2, so I would imagine VERSION_1 > is the cut off we care about. They're concatenated together. So if that was the right version to go off of it would be 0x0003004E 0x0002005 (or generally viewed in tools like fwupd as as "3.4e.2.5") > I can try to start a patch, but I don't actually know much about tpm > subsystem, just rng stuff (I maintain random.c). I guess it could be right where the TPM gets registered with rng that such a check like this makes sense. sorry a bit unsure how to "attach" something here, but this is the getcap on my acer with an 5800H http://ix.io/4nyQ that is affected aswell. @Mario - okay I sent a little draft patch for you to tweak and test and insert the right numbers. I just made up some garbage. https://lore.kernel.org/lkml/20230209153120.261904-1-Jason@zx2c4.com/ Created attachment 303709 [details]
potential patch
Much appreciated. I'll attach an updated patch with the versions I expect should avoid triggering this.
This sure seems like the same issue as the advisory, but I won't be submitting it to the mailing lists until I can confirm that. Feel free to use the patch in the mean time.
(In reply to Mario Limonciello (AMD) from comment #33) > Created attachment 303709 [details] > potential patch > > Much appreciated. I'll attach an updated patch with the versions I expect > should avoid triggering this. Thanks for your and Jason's work very much. I test it out in the 6.2-rc7 and it did solve the problem. Hope this can go into the patch ASAP. I'm having pretty much the same problem with my Asus Pro B550C/CSM@5700G. Does any of you still need the output of tpm2_getcap? I have been using a pretty old BIOS version (2407) with AGESA 1.2.0.3B(IIRC) and I saw the changelog for possibly fix stuttering because of fTPM, but 2804 didn't shipped the fix as well. If your BIOS change log claims it was fixed but you're still affected yes I would like to see what tpm firmware version claims. Created attachment 303715 [details]
tpm2_getcap properties-fixed
Added
Thanks, that version doesn't have the stuttering fix from the Windows advisory. The "potential patch" should help you. Just want to update where this is.. I've confirmed at least the firmware versions mentioned in the patch are not causing stuttering in audio playback when doing this for 15 minutes: # sudo cat /dev/hwrng > /dev/null" This was on one of AMD's reference designs. However, I don't yet have the inverse confirmation of the failure on the same hardware (without the patch on the same hardware but older fTPM). Anti-rollback protection means reproducing the failure is a lot more difficult. Will continue to try to reproduce the failure but I'm "leaning" towards submitting the patch "as is". If it turns out to be a different root cause than the advisory, then another option is to drop the version check part of it entirely and forbid all AMD fTPM's from RNG. One of these might help https://www.amazon.com/s?k=eeprom+clip+programmer for loading an old firmware (unless there are efuses or something). Thanks, but the efuses are blown for rollback protection. An older firmware will not be able to boot even when flashed from a hardware SPI programmer. Created attachment 303720 [details]
tpm2_getcap (ACER nitro 5 AN515-45 bios v1.10)
attaching tpm2_getcap from this acer, i can reliably reproduce this with
"sudo cat /dev/hwrng > /dev/null"
in 2 minutes of running or so.
with the patch applied to 6.1.11. it still occurs. but will this however only fix this cause for stutter. wont other things like systemd using tpm2 for encryption, boot time measuring etc cause it aswell eventually? or is my versions not included in that patch hm.
> TPM2_PT_FIRMWARE_VERSION_1: > raw: 0x30039 > TPM2_PT_FIRMWARE_VERSION_2: > raw: 0x5 The patch /should/ be disabling fTPM use for RNG on your system. > > or is my versions not included in that patch hm. Your system should have shown this message in the kernel log when the patch was applied: "AMD fTPM version 0x%llx causes system stutter and " "will not be used for random number generation\n", ((u64)val1 << 32) | val2); Did you see that message? It's possible I made a mistake in the patch too. (In reply to Mario Limonciello (AMD) from comment #43) > > TPM2_PT_FIRMWARE_VERSION_1: > > raw: 0x30039 > > TPM2_PT_FIRMWARE_VERSION_2: > > raw: 0x5 > > The patch /should/ be disabling fTPM use for RNG on your system. > > > > or is my versions not included in that patch hm. > > Your system should have shown this message in the kernel log when the patch > was applied: > > "AMD fTPM version 0x%llx causes system stutter and " > "will not be used for random number generation\n", > ((u64)val1 << 32) | val2); > > Did you see that message? It's possible I made a mistake in the patch too. hm nothing in dmesg about that. im fairly sure the patch got applied. il see if i didnt mess something up in patching it. Created attachment 303721 [details]
potential patch (v2)
You did prompt me to double check and I did find a minor logic error that could cause the wrong paths to match, but I don't expect it was the reason you didn't see the message.
If anything it would have blocked more systems from using RNG than before. In any case, attaching a new version.
(In reply to Mario Limonciello (AMD) from comment #45) > Created attachment 303721 [details] > potential patch (v2) > > You did prompt me to double check and I did find a minor logic error that > could cause the wrong paths to match, but I don't expect it was the reason > you didn't see the message. > > If anything it would have blocked more systems from using RNG than before. > In any case, attaching a new version. sorry for causing troubles, but thats great finding that nonetheless hehe. yeah it was my mistake i added the patch to my pkgbuild source but it wasnt actually patching it in the prepare function. now its applied. i also added a "dev_warn(&chip->dev, "Test RNG Defective\n");" at the top of the tpm_is_rng_defective function just to make sure. and surely, dmesg has now. [ 0.438285] tpm tpm0: Test RNG Defective [ 0.441444] tpm tpm0: AMD fTPM version 0x3003900000005 causes system stutter and will not be used for random number generation im not sure if its intended but now "sudo cat /dev/hwrng > /dev/null" "cat: /dev/hwrng: No such device" it seems /dev/hwrng is missing. so i cant reproduce it as reliably as before i guess. il run normal tasks for a while and report back otherwise. (In reply to Tom Englund from comment #46) > im not sure if its intended but now "sudo cat /dev/hwrng > /dev/null" "cat: > /dev/hwrng: No such device" it seems /dev/hwrng is missing. so i cant > reproduce it as reliably as before i guess. il run normal tasks for a while > and report back otherwise. That's intended [good] behavior and means the patch is working as expected. @Mario - by the way, does this bug affect only the RNG or all aspects of the fTPM? If it affects all operations, maybe the check should abort registration very early on and not expose any functionality that might trigger this. RNG is the most likely way to trigger it because so many applications rely upon random numbers and so it's used more regularly. If another use case does pop up that is triggering it with regularity, I agree with you we should abort init earlier on. But also removing fTPM functionality can be a lot more detrimental to someone's system if they were storing important data and no longer have access to it after a kernel upgrade. So let's keep that in mind. If we have to revert to that approach it should be moving this function for detection around and we would probably need some sort of parameter to allow a user to turn it back on. > I also try to find out what is the earliest kernel that we can trigger
> stutter manually. Seems this problem appears between 4.19.5 to 5.4. Anyway,
> that is another topic, I will keep on working on it.
Quick update:
I test it all the way back to 4.18 kernel with ubuntu 18.04, which is the oldest ubuntu that can support TPM hwrng. (Earlier than that will get a "device not found" error when using "sudo cat /dev/hwrng")
Unfortunately, I still can manually trigger the stutter.
I wondering if this problem already exists (I mean manually trigger it) once TPM hwrng was implemented after 4.1x.x kernel.
But due to poor hardware support in old kernels, It is very hard for me to do kernel bisection in newer hardware, so I will not do that.
btw, the patch can go into the mainline before the 6.2 release? Or do we have to wait for more until everything is sure?
Thanks!
> btw, the patch can go into the mainline before the 6.2 release? Or do we have
> to wait for more until everything is sure?
If Mario CCs Linus (and Thorsten) on the patch, it might make it in if Linus takes it directly. Otherwise it'll go through Jarkko's tree, and who knows how long that'll take to percolate. @Mario - if you do intend to CC Linus, you might want to do that todayish, since we're already in the last rc.
Hi @Mario -- + if ((val1 & (6<<16)) == (6<<16)) { + if (val1 >= 0x60000 && val2 >= 0x180006) + return false; + } else if ((val1 & (3<<16)) == (3<<16)) { + if (val1 >= 0x30057 && val2 >= 0x5) + return false; + } The logic here still looks weird with that bitmask, since it doesn't preclude a version 7 series. Maybe just do it the boring way of: if ((val1 >> 16) == 6) { if ((((u64)val1 << 32) | val2) >= 0x0006000000180006ULL) return false; } else if ((val1 >> 16) == 3) { if ((((u64)val1 << 32) | val2) >= 0x0003005700000005ULL) return false; } The compiler will then do the right thing with those 64-bit comparisons. We can actually clean up the ORing in the whole function though by hoisting it out. I'll upload a new patch in a second. Created attachment 303725 [details]
potential patch (v3)
@Mario - let me know how this looks to you. Perhaps it's suitable for submission?
Created attachment 303726 [details]
potential patch (v3)
Er, rather, this.
so after a whole days of heavy usage, compilations, benchmarking, youtubeing, gaming ive yet to hit a stutter with the patch applied. so i would say this regression is dealt with. i dont have any encrypted disks or such and as far as i can tell the only thing using the tpm on my system is systemd for boot time measuring. so in my case its atleast working as intended. > @Mario - let me know how this looks to you. > Er, rather, this. Thanks for the enhancements! I think Co-authored-by should be Co-developed-by IIRC? Also needs your S-o-b when we're both working on something. > Perhaps it's suitable for submission? My team still hasn't been able to reproduce the failure as mentioned above, but I think it's reasonable. At best this helps a lot of people, and at worst we find out some day it's not enough and need to block more. > If Mario CCs Linus (and Thorsten) on the patch, it might make it in if Linus > takes it directly. Otherwise it'll go through Jarkko's tree, and who knows > how long that'll take to percolate. @Mario - if you do intend to CC Linus, > you might want to do that todayish, since we're already in the last rc. If I can get your ack to change that tag and add the S-o-b, I'll send it out today. Yea, feel free to change the tag and add my S-o-b. I forgot to mention it, but last week I reached out to ASUS to ask about fTPM fix BIOS. They said the fTPM fix doesn't include their laptop line-up, but they did fix it on the desktop motherboard. So that is unfortunate. I don't know how hard to add a minor fix into a laptop BIOS, or just pure laziness to them. (In reply to Bell from comment #57) > I forgot to mention it, but last week I reached out to ASUS to ask about > fTPM fix BIOS. They said the fTPM fix doesn't include their laptop line-up, > but they did fix it on the desktop motherboard. > > So that is unfortunate. I don't know how hard to add a minor fix into a > laptop BIOS, or just pure laziness to them. will they ship a fix for the whole AM4 and AM5 lineup? I have a MSI B550 Mortar with 5900X and BIOS version 1.D0. This is the latest BIOS version for this board, and it supposed to support AGESA 1207 which, according to the inormation in AMD pa-410, is supposed to include the fix. However, with patch version 3 applied, I get tpm tpm0: AMD fTPM version 0x3005400000005 causes system stutter; hwrng disabled Where does 0x0003005700000005ULL come from ? Is it known that version 0x3005400000005 is indeed bad, or is that just a guess ? > Where does 0x0003005700000005ULL come from ? Is it known that version
> 0x3005400000005 is indeed bad, or is that just a guess ?
You should be able to trigger the problem with this version. As other reporters have said it can take some time though.
This version string comes from internal information at AMD about where the fix is included.
(In reply to Mario Limonciello (AMD) from comment #60) > > Where does 0x0003005700000005ULL come from ? Is it known that version > > 0x3005400000005 is indeed bad, or is that just a guess ? > > You should be able to trigger the problem with this version. As other > reporters have said it can take some time though. > > This version string comes from internal information at AMD about where the > fix is included. Thanks! Can someone affected please test V2 (https://lore.kernel.org/lkml/20230220180729.23862-1-mario.limonciello@amd.com/T/) and if all looks good reply to it with a tested tag? IE Tested-by: Foo Bar <foo@bar.com> > Can someone affected please test V2
um, what is the difference between V3 and V2? I think there is only some logical changes in V3.
It's ok that I try the V3 patch in the attachment?
anyway, I will test V2 and give a result asap, thanks!
Sorry the versioning gets confusing because we versioned both the patches in this bug report as well as the ones "submitted" to upstream. Please test the "v2 that was submitted upstream". We need testers of this version for it to be accepted even if it's logically the same. We need `Tested-by:` tags specifically for https://lore.kernel.org/lkml/20230220180729.23862-1-mario.limonciello@amd.com/T/ . Otherwise the upstream maintainer won't apply it. He's usually pretty tuned out of discussions and such, so that kind of very visible and explicit "we know that this does work!" is needed. So if https://lore.kernel.org/lkml/20230220180729.23862-1-mario.limonciello@amd.com/T/ works, just reply here with your `Tested-by: First Last <email>` and Mario will send that onward. Tested-by: Bell <1138267643@qq.com> I applied and tested the upstream patch in both the mainline and the tag v6.2 release, and everything looks fine. I got a dmesg about waring tpm defection, which shows the patch works. Open all my daily drive software, I didn't notice any difference. I applied the patch to 6.2.1, and I get this in the logs: tpm tpm0: AMD fTPM version 0x3003900000005 causes system stutter; hwrng disabled It looks like the patch is working. Created attachment 303811 [details] tpm2_getcap properties-fixed (LENOVO IdeaPad 5 Pro 14ACN6) Hi, Not sure is it useful at all (feel free to skip), I was not able to test the patch yet. Here is my system info: LENOVO IdeaPad 5 Pro 14ACN6 BIOS/Firmware GECN33WW(V1.17) AMD Ryzen 5 5600U currently running Fedora shipped kernel: 6.1.14-200.fc37.x86_64 Latest BIOS/Firmware but I doubt they will ship an update for this issue with new version. This one is recently released ~1month. No real option in BIOS/Firmware to disable it, or use other implementation. Tried some tricks to "unlock" hidden BIOS/firmware features (https://wiki.archlinux.org/title/Lenovo_IdeaPad_5_Pro_14ACN6 | 5.1.1 Hidden BIOS menus ), there was an option there (3 options), tried one, didn't help. This patch is now in the maintainer's tree: https://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git/commit/?id=ffe4a34b29aa672f1fa05f49a7e4c3ea4ab4e12f So hopefully it'll make it to Linus' tree in the next rc. This fix will be introduced in the 6.3-RC series, right? (In reply to Balazs Vinarz from comment #70) > This fix will be introduced in the 6.3-RC series, right? Sadly it did not -- and I mentioned the patch to Linus in my weekly report, but he didn't pick it up. Will ask the maintainer in a few days to send it to Linus this week, if that hasn't happened by then arch linux/hp laptop 4500u/only AMD apu video card/tpm disabled from bios kernel 6.2.2 games stutters like hell downgrading back to 6.1 problem solved. stutter started wtih 6.2. i had no problem with 6.1 but i have tpm disabled since a long time ago. (In reply to proteve from comment #72) > arch linux/hp laptop 4500u/only AMD apu video card/tpm disabled from bios > > kernel 6.2.2 games stutters like hell > downgrading back to 6.1 problem solved. > stutter started wtih 6.2. i had no problem with 6.1 but i have tpm disabled > since a long time ago. If you've got your TPM disabled but are getting a stutter issue, then I am afraid you're hitting a different problem. Please open your own issue and if possible bisect it. https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html That's pretty much the same, what I experience. I hoped the patch would resolve the problem, however it doesn't matter which TPM I choose (discrete or fTPM) or add the modules to the blacklist. I have two custom tkg kernels at the moment with somewhat the same shutter, but I'll give it a go with the 5.15 LTS tomorrow. 5700G here with latest BIOS. (In reply to Balazs Vinarz from comment #74) > That's pretty much the same, what I experience. Just to clarify: but you experience that since a while already, and not only since 6.2, like proteve@mail.com apparently does (FWIW, who thankfully filed a separate issue here to avoid confusion: https://bugzilla.kernel.org/show_bug.cgi?id=217158 )? wow, this issue has been passed over 1 month. I think if this bug causes stuttering on Linus's PC he will be mad AF lol. I understand why It can take such a long time to go into the mainline. Because not every contributor is full-time in kernel dev. I don't want to push people too much that is rude. Anyway, I will wait patiently and still appreciate everyone's effort/work. At worst, It might be merged until some big changes happen in the TPM tree (how long it will take? idk.) (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #75) > (In reply to Balazs Vinarz from comment #74) > > That's pretty much the same, what I experience. > > Just to clarify: but you experience that since a while already, and not only > since 6.2, like proteve@mail.com apparently does (FWIW, who thankfully filed > a separate issue here to avoid confusion: > https://bugzilla.kernel.org/show_bug.cgi?id=217158 )? Not really, I just recently brought the 5700G. Just to confirm, 5.15 is not as bad as 6.2, however I can still see some shutter. I just upgraded to 6.2.6-200.fc37.x86_64 (6.2.6 should have this workaround included if I'm not wrong), but I don't see anything in the kernel log recognizing that devices has problematic TPM... Is that expected? Also output of `tpm2_getcap properties-fixed` is same as before comment#68 (so no firmware update in beteween). Created attachment 303945 [details] attachment-6574-0.html What is the way you check kernel messages? I think dmesg will be clean up after few minutes you boot into system Try to use journalctl -k —no-pager | grep tpm instead. ------------------ Original ------------------ From: bugzilla-daemon <bugzilla-daemon@kernel.org> Date: Tue,Mar 14,2023 3:15 PM To: 1138267643 <1138267643@qq.com> Subject: Re: [Bug 216989] Kernels after 6.1 experience AMD Ryzen fTPM stutter I still see no TPM kernel messages with 6.2.6 for some reason with that journalctl command My TPM2_PT_FIRMWARE_VERSION_1 is 0x3002A and VERSION_2 is 0x5 (and the MANUFACTURER is 0x414D4400) 🐸 Hmm, I guess we will have to wait for AMD engineers to respond to you. But you can use your computer as normal to see If this problem still happens. Also sorry for the carp attachment I created. I am on a car using a third-party email client to reply to this thread. I have no idea why it created an attachment. Never mind. FWIW, if anyone still has issues, please file a new report and afterwards briefly mention it here while dropping a link to the report, as things otherwise quickly might get confusing Actually I realized that Arch still ships 6.2.5 🐸 So disregard comment 80 (I can't edit it) (In reply to Bell from comment #79) > Try to use journalctl -k —no-pager | grep tpm instead. this is my fault, the correct command is "journalctl -b --no-pager | grep tpm" My phone auto-completion turned "--" into "—". Anyway, this patch works perfectly on my Arch Linux with 6.1.19-LTS and 6.2.6 (still in testing repo btw) I tend to think this is a user issue rather patch itself. Just keep using your system, If the problem still happens, like Thorsten said, create a new report :) thanks. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #82) > FWIW, if anyone still has issues, please file a new report and afterwards > briefly mention it here while dropping a link to the report, as things > otherwise quickly might get confusing Hi, Created bug 217212 Regards, Branko (In reply to Echo J. from comment #80) > I still see no TPM kernel messages with 6.2.6 for some reason with that > journalctl command > > My TPM2_PT_FIRMWARE_VERSION_1 is 0x3002A and VERSION_2 is 0x5 (and the > MANUFACTURER is 0x414D4400) 🐸 (In reply to Branko Grubić from comment #78) > I just upgraded to 6.2.6-200.fc37.x86_64 (6.2.6 should have this workaround > included if I'm not wrong), but I don't see anything in the kernel log > recognizing that devices has problematic TPM... > > Is that expected? > > Also output of `tpm2_getcap properties-fixed` is same as before comment#68 > (so no firmware update in beteween). I don't think this was backported to the 6.2 series by the distro maintainers. The commit was merged to the 6.3-rc2 in Linus's tree: https://github.com/torvalds/linux/commit/f1324bbc4011ed8aef3f4552210fc429bcd616da (In reply to Balazs Vinarz from comment #86) > I don't think this was backported to the 6.2 series by the distro > maintainers. pls check the latest bug report, I can confirm this patch works in fedora. And we will have a further discussion in that newer report :) (In reply to Balazs Vinarz from comment #86) > I don't think this was backported to the 6.2 series It was to the upstream 6.1.y and 6.2.y series early this week, as can be seen by searching in trees like this: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=linux-6.2.y If distros picked those releases or that fix up is obviously a fifferent story. Sorry, my bad: I also sent an edit to my previous reply and that wasn't updated. So I've found the patch in the meanwhile: https://github.com/archlinux/linux/commit/e143354b441786c4f356f7c9b1852bc723dbd81b (In reply to Mario Limonciello (AMD) from comment #38) > Thanks, that version doesn't have the stuttering fix from the Windows > advisory. The "potential patch" should help you. Hi Mario here, on a MSI MAG X570S TORPEDO MAX mobo with latest stable BIOS (v7D54A4, AGESA v1.2.0.7) I have TPM2_PT_FIRMWARE_VERSION_1: raw: 0x30054 TPM2_PT_FIRMWARE_VERSION_2: raw: 0x5 just like the author of comment #37; I never experienced any stuttering. I just let cat /dev/hwrng > /dev/null running for 25 minutes on kernel v6.2.5 while watching YouTube and everything was smooth and fine all the time. Since upgrade to kernel v6.2.7 the message tpm tpm0: AMD fTPM version 0x3005400000005 causes system stutter; hwrng disabled started to appear on dmesg and the /dev/hwrng device is no longer available. The AMD advisory states that AGESA v1.2.0.7 fixed the issue and having the TPM2 hwrng available can improve entropy. Can you please investigate a little bit further? Is there any kernel parameter available to force-enable the fTPM hwrng? Thanks From a security perspective, on your platform, I don't think there is any computable difference in terms of entropy with or without that particular hwrng. It's unlikely to make a difference in ways that matter. (In reply to Jason A. Donenfeld from comment #91) > From a security perspective, on your platform, I don't think there is any > computable difference in terms of entropy with or without that particular > hwrng. It's unlikely to make a difference in ways that matter. Thanks Jason, this reassures me. Nonetheless, this version of fTPM should probably be excluded from the workaround. (In reply to Mario Limonciello (AMD) from comment #60) > > Where does 0x0003005700000005ULL come from ? Is it known that version > > 0x3005400000005 is indeed bad, or is that just a guess ? > > You should be able to trigger the problem with this version. As other > reporters have said it can take some time though. > > This version string comes from internal information at AMD about where the > fix is included. Ooops, my bad: didn't read thoroughly all the thread. Sorry for noise. > Nonetheless, this version of fTPM should probably be excluded from the
> workaround.
In discussion with others I've found out it's actually a combination of factors that can lead to this behavior but one of those is the fTPM version that the patch will guard against. Those other factors can't be detected from Linux and so it would mean a hardcoded list of systems that have the "bad" fTPM version but don't have this issue which isn't really sustainable.
As Jason pointed out it's likely not worth doing this.
I have an older INTEL Asus RoG Strix Z370-F system, with the latest 3004 BIOS (which is from 2001), and I noticed pronounced audio stutters since upgrading to a 6.1 kernel. After finding this bug, and seeing a few comments which suggested it /might/ affect other implementations, I decided to disable my TPM to test, and when looking in the BIOS to do that, I noticed that the upgrade to version 3004 (needed for Resizeable BAR), had left the setting on "Firmware", so I switched it back to "Hardware" and immediately the audio issues were gone. I have to conclude that this issue is not limited to AMD and is /does/ impact other fTPM implementations. (In reply to A. James Lewis from comment #95) > I have to conclude that this issue is not limited to AMD and is /does/ > impact other fTPM implementations. Thanks for this information very much. If the Intel platform can meet this problem when using fTPM, I guess there must be something wrong with fTPM design or Linux TPM implementations. Anyway, I hope someday someone can find out exactly why. OK, now I have to apologise for jumping to conclusions, because after more than an hour of not stuttering, it happened again... so now I cannot jump to the conclusion that it's directly related to this issue, although it does seem to be much reduced. It's definitely something that started when I jumped from 5.19 to 6.1, but I'll have to see if I can isolate it more precisely. *** Bug 217122 has been marked as a duplicate of this bug. *** |