Bug 217212 - AMD Ryzen fTPM stutter even after #216989
Summary: AMD Ryzen fTPM stutter even after #216989
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: AMD Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-17 18:39 UTC by Branko Grubić
Modified: 2023-08-08 00:12 UTC (History)
6 users (show)

See Also:
Kernel Version: 6.2.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
tpm2_getcap properties-fixed (LENOVO IdeaPad 5 Pro 14ACN6) (1.75 KB, text/plain)
2023-03-17 18:39 UTC, Branko Grubić
Details
change the value to see if it is defected (666 bytes, text/x-csrc)
2023-03-18 06:18 UTC, Bell
Details

Description Branko Grubić 2023-03-17 18:39:55 UTC
Created attachment 303974 [details]
tpm2_getcap properties-fixed (LENOVO IdeaPad 5 Pro 14ACN6)

Hi,

As I originally commented bug 216989 comment 68 my system seems to be affected by this firmware bug, and it doesn't have a firmware update by the vendor to address this issue.


Even with the fix in linux 6.2.6 I'm still experiencing same issue as before bug 216989, comment 78 and don't see in kernel log message mentioning that workaround is applied to avoid issues with buggy version of firmware. (`dmesg | grep -i tpm` or `journalctl -b -k | grep -i tpm`)

Currently running Fedora 37 with linux:
6.2.6-200.fc37.x86_64

Hardware:  
LENOVO IdeaPad 5 Pro 14ACN6 BIOS/Firmware GECN33WW(V1.17)  
AMD Ryzen 5 5600U

I cannot be 100% sure that this is the issue, but it started happening recently then I found the fTPM issue announcement and 216989, and issue is manifested as described everywhere. Everything works ok and randomly everything stutters from audio, mouse cursor ..., for a second or two, and then everything goes back to normal. (general system load is fine (cpu, memory usage, disk I/O).
Comment 1 Jason A. Donenfeld 2023-03-17 18:46:07 UTC
Can you bisect?
Comment 2 reach622 2023-03-17 19:27:39 UTC
Does it still happen if the kernel is compiled with CONFIG_HW_RANDOM_TPM=n ?
Comment 3 Bell 2023-03-18 04:00:56 UTC
um, since you are using Lenovo product, I think there is an option in BIOS that you can completely true off fTPM.

The patch itself isn't truly solved the problem, the goal is helping people who can't disable fTPM in BIOS (like me).

But if you don't have this option in your BIOS, feel free to report this bug.
Comment 4 Bell 2023-03-18 05:48:12 UTC
quick update, I installed Fedora 37 and updated it to kernel 6.2.6 on my ASUS system.

Can confirm this patch works in the 6.2.6-fedora kernel.

I check your "tpm2_getcap" file and looks like you have a newer firmware version that might not be recognized as "defected" in the patch.

need some investment in it.
Comment 5 Bell 2023-03-18 06:18:16 UTC
Created attachment 303978 [details]
change the value to see if it is defected

another quick update:

I grab Mario's patch and turn it into a simple C program that can input the fTPM version to see if it is included in "defected" list.

It turns out, your fTPM version is not included.
Comment 6 reach622 2023-03-18 07:00:23 UTC
It might mean that more fTPM versions might need to be included in the patch.
Comment 7 Mario Limonciello (AMD) 2023-03-19 01:41:43 UTC
There can very well be multiple issues that manifest as a stutter, but they don't all have the same root cause as the HW RNG functionality.

Please lets follow the above suggestion to turn off HW RNG in the kernel config as described above and see if it still happens for you or not.
Comment 8 Branko Grubić 2023-03-20 08:39:16 UTC
Thanks everyone for comments suggestions, and sorry for the late answer.

Haven't compiled kernel for some time. Tried yesterday rebuild it as rpm package, not building kernel manually (I was not sure re-using Fedora kernel config would work, if they have some custom downstream or backported patches).

But I'm not sure if I did everything right.

I just manually changed in config (commenting line (like other lines are)):

```
$ grep CONFIG_HW_RANDOM_TPM /boot/config-$(uname -r)
# CONFIG_HW_RANDOM_TPM is not set
```

But I cannot verify is it disabled or not. Is there a way to verify it?

Device node (/dev/hwrng) still exists, but if you try:

```
ls -l /dev/hwrng 
crw-------. 1 root root 10, 183 Mar 20 09:28 /dev/hwrng

cat /dev/hwrng > /dev/null
cat: /dev/hwrng: No such device
```

If this confirms it is disabled, I'll continue using this kernel for few days and verify if stutter is still happening.


Once again, sorry for not answering everyone or testing all options. Bisecting wouldn't be an easy option this hardware (it takes a lot of time to compile and makes a lot of noise/heat).

Regards,
Branko
Comment 9 Mario Limonciello (AMD) 2023-03-20 20:32:16 UTC
Sounds right to me.

You might also look at bug 217158, which is pointing fingers at a very specific combination of graphical software is causing it.
Comment 10 Branko Grubić 2023-03-21 08:04:35 UTC
(In reply to Mario Limonciello (AMD) from comment #9)
> Sounds right to me.
> 
> You might also look at bug 217158, which is pointing fingers at a very
> specific combination of graphical software is causing it.

I saw that bug already mentioned on the original fTPM report, but problem for me started before kernel 6.2 (most likely on 6.1 as everyone else (I have this system for 6~8 months).

Regarding kernel built without CONFIG_HW_RANDOM_TPM, so far it's running for 24h (haven't tested it all 24h) and to this moment I haven't experienced any stutter.

I'll continue using it like this and see in next day or two how it behaves.
Comment 11 Branko Grubić 2023-03-23 07:42:25 UTC
After few days of use with kernel built without CONFIG_HW_RANDOM_TPM, I couldn't reproduce this issue. Using computer the same way as I do every day.

Is it possible to verify if this specific firmware version is also affected?
Comment 12 Mario Limonciello (AMD) 2023-03-23 13:41:46 UTC
>Is it possible to verify if this specific firmware version is also affected?

Yeah, I checked with internal team, but it shouldn't be.  Can you confirm the AGESA version in your BIOS?

And you are on the latest BIOS?

>After few days of use with kernel built without CONFIG_HW_RANDOM_TPM, I
>couldn't reproduce this issue. Using computer the same way as I do every day.

Something we can "consider" is to just disable fTPM RNG entirely for AMD.
Comment 13 Jason A. Donenfeld 2023-03-23 14:23:43 UTC
> Something we can "consider" is to just disable fTPM RNG entirely for AMD.

The quality of bug reporting has dramatically decreased in the last weeks, so I wouldn't make any decisions based on the quasi-information from here. Until we get some really solid bug reports where it's clear that a rigorous process is being carried out, I suspect we should treat these as "user testing error" or "an unrelated bug".
Comment 14 Branko Grubić 2023-03-23 17:15:57 UTC
(In reply to Mario Limonciello (AMD) from comment #12)
> >Is it possible to verify if this specific firmware version is also affected?
> 
> Yeah, I checked with internal team, but it shouldn't be.  Can you confirm
> the AGESA version in your BIOS?

What I got from the BIOS is:
CezannePI-FP6 1.0.0.B

> 
> And you are on the latest BIOS?

Yes, this is at the moment latest officially available BIOS (double checked on the support website again, it's still the same version (https://download.lenovo.com/consumer/mobiles/gecn33ww.txt)).

> 
> >After few days of use with kernel built without CONFIG_HW_RANDOM_TPM, I
> >couldn't reproduce this issue. Using computer the same way as I do every
> day.
> 
> Something we can "consider" is to just disable fTPM RNG entirely for AMD.
Comment 15 Mario Limonciello (AMD) 2023-03-23 19:13:20 UTC
> What I got from the BIOS is:
> CezannePI-FP6 1.0.0.B

Yes; the AMD fix was introduced two versions before this.

You might be hitting an OEM/model specific BIOS bug.  

> Until we get some really solid bug reports where it's clear that a rigorous
> process is being carried out, I suspect we should treat these as "user
> testing error" or "an unrelated bug

Maybe we should just blacklist this system from registering fTPM with RNG?
Comment 16 manliodp 2023-05-06 20:15:57 UTC
Hello,
I can report the same issue even after the supposed fix and it's very bothersome as this impacts gaming, browsing and media fruition.
Please expose a parameter someway to set TPM HW RNG optional without the need to recompile the kernel, I think this is very crucial for everyday PC usage.
Leave the choice up to the user.

Thanks a lot in advance
Comment 17 Bell 2023-05-07 02:59:25 UTC
Seems like some people still suffering from this problem even though their fTPM version is not listed as "defection".

so it is a good idea to add a parameter for the user to manually control fTPM? or just block them all? @ Mario Limonciello (AMD)
Comment 18 Jason A. Donenfeld 2023-05-07 10:15:52 UTC
With all of these one-off reports, I think we need a compelling test description to confirm what the actual issue is. So far I haven't heard a compelling description that this is caused by fTPM other than reporters mentioning it in this bug report.
Comment 19 manliodp 2023-05-07 12:22:41 UTC
It's very difficult to reproduce the issue in a sistematical way.

Is it so "wrong" to leave power to the user?

Is there a "convenience" to avoid parameterization of something that is causing issues to some users?

Unluckily I don't have the possibility to disabile TPM in UEFI so please make of optional on the OS layer.

Linux has always bene about choice.

Just my opinion.

Thanks
Comment 20 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-05-07 12:32:02 UTC
(In reply to manliodp from comment #19)
> Linux has always bene about choice.

Nope: http://www.islinuxaboutchoice.com/
Comment 21 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-05-07 12:35:06 UTC
(In reply to manliodp from comment #16)
> I can report the same issue even after the supposed fix 

Then please open a separate bug and drop the link here; we got a few reports already that had similar symptoms, but turned out to be different issues. Trying to sort this out in a ticket about a bug that solved the issue for some people just gets confusing, hence it's a recipe to make developers ignore a bug.
Comment 22 manliodp 2023-05-07 13:02:15 UTC
Sorry but this bug is in "NEW" status and already separate from the original one with the fix you are mentioning.
Comment 23 Bell 2023-05-07 13:09:41 UTC
My problem was already solved. but I pick up this again and see if I can find more useful information.

First, I finally boot into the 4.15.0 kernel on my laptop (Ubuntu 16.04.7LTS)

Second, I found that in 4.15.0, the /dev/hwrng doesn't exist (No such devices or files), then I download 4.16.0-rc1 from Ubuntu archive, which this time, /dev/hwrng is actually a thing I can cat it.

Third, "sudo cat /dev/hwrng > /dev/null" or just "sudo cat /dev/hwrng" can still trigger stuttering in 4.16.o-rc1

Fourth, looking into the 4.15 to 4.16 changelog, I didn't see any change on hwrng or TPM or so. So why 4.15 -> 4.16 can make a different on /dev/hwrng? 

need git bisect to find out.

but after that many tests, I start to doubt if the problem was actually caused by the fTPM. we need a lot more tests (different hardware/settings/kernels). gonna try my best to find more hardware.
Comment 24 Bell 2023-05-07 13:54:43 UTC
well, I borrowed two Lenovo laptops from my mates, one is 6800H and one is 5800H. both show "defective firmware" (by booting into 6.2.8 Arch ISO and checking dmsg)

however, I can't reproduce any stuttering in 6.1.0 kernel (test on a USB Arch installation), even ten "cat /dev/hwrng" running in parallel for half an hour.

so yeah, I have no idea now.
Comment 25 Jason A. Donenfeld 2023-05-07 14:38:45 UTC
(In reply to Bell from comment #23)
> My problem was already solved. but I pick up this again and see if I can
> find more useful information.
> 
> First, I finally boot into the 4.15.0 kernel on my laptop (Ubuntu 16.04.7LTS)
> 

Vendor kernels can go through vendors.

Note You need to log in before you can comment on or make changes to this bug.