Bug 217804 - REGRESSION WITH BISECT: TPM patch breaks S3 on some Intel systems
Summary: REGRESSION WITH BISECT: TPM patch breaks S3 on some Intel systems
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2023-08-17 20:59 UTC by Todd Brandt
Modified: 2023-09-07 13:59 UTC (History)
10 users (show)

See Also:
Kernel Version: 6.5.0-rc6
Subsystem:
Regression: Yes
Bisected commit-id: 554b841d470338a3b1d6335b14ee1cd0c8f5d754


Attachments
6.5.0-rc5-otcpl-alder-lake1-dmesg-boot.log (123.81 KB, text/plain)
2023-08-18 16:56 UTC, Todd Brandt
Details
6.5.0-rc5-otcpl-alder-lake2-dmesg-boot.log (118.67 KB, text/plain)
2023-08-18 16:57 UTC, Todd Brandt
Details
6.5.0-rc5-otcpl-jasper-lake-dmesg-boot.log (125.94 KB, text/plain)
2023-08-18 16:58 UTC, Todd Brandt
Details
6.5.0-rc5-otcpl-raptor-lake.dmesg-boot.log (124.85 KB, text/plain)
2023-08-18 16:59 UTC, Todd Brandt
Details
6.5.0-rc5-otcpl-rocket-lake-dmesg-boot.log (110.46 KB, text/plain)
2023-08-18 16:59 UTC, Todd Brandt
Details
6.5.0-rc6-otcpl-alder-lake1-dmesg-boot.log (124.01 KB, text/plain)
2023-08-18 17:00 UTC, Todd Brandt
Details
6.5.0-rc6-otcpl-alder-lake2-dmesg-boot.log (121.39 KB, text/plain)
2023-08-18 17:00 UTC, Todd Brandt
Details
6.5.0-rc6-otcpl-jasper-lake-dmesg-boot.log (125.93 KB, text/plain)
2023-08-18 17:00 UTC, Todd Brandt
Details
6.5.0-rc6-otcpl-raptor-lake-dmesg-boot.log (126.35 KB, text/plain)
2023-08-18 17:01 UTC, Todd Brandt
Details
6.5.0-rc6-otcpl-rocket-lake-dmesg-boot.log (110.51 KB, text/plain)
2023-08-18 17:01 UTC, Todd Brandt
Details
6.5.0-rc6-mariofix-otcpl-alder-lake1-dmesg-boot-and-s3.log (103.05 KB, text/plain)
2023-08-18 17:38 UTC, Todd Brandt
Details
6.5.0-rc6-mariofix-otcpl-alder-lake2-dmesg-boot-and-s3.log (109.92 KB, text/plain)
2023-08-18 17:39 UTC, Todd Brandt
Details
6.5.0-rc6-mariofix-otcpl-jasper-lake-dmesg-boot-and-s3.log (96.63 KB, text/plain)
2023-08-18 17:39 UTC, Todd Brandt
Details
6.5.0-rc6-mariofix-otcpl-raptor-lake-dmesg-boot-and-s3.log (111.57 KB, text/plain)
2023-08-18 17:40 UTC, Todd Brandt
Details
6.5.0-rc6-mariofix-otcpl-rocket-lake-dmesg-boot-and-s3.log (97.87 KB, text/plain)
2023-08-18 17:40 UTC, Todd Brandt
Details
otcpl-alder-lake1-tpm2_getcap.log (4.69 KB, text/plain)
2023-08-18 18:01 UTC, Todd Brandt
Details

Description Todd Brandt 2023-08-17 20:59:31 UTC
While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing a crash and reboot situation when S3 suspend is initiated. To reproduce it, this call is all that's required "sudo sleepgraph -m mem -rtcwake 15". I've bisected the issue to this patch:

commit 554b841d470338a3b1d6335b14ee1cd0c8f5d754
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Wed Aug 2 07:25:33 2023 -0500

    tpm: Disable RNG for all AMD fTPMs
    
    The TPM RNG functionality is not necessary for entropy when the CPU
    already supports the RDRAND instruction. The TPM RNG functionality
    was previously disabled on a subset of AMD fTPM series, but reports
    continue to show problems on some systems causing stutter root caused
    to TPM RNG functionality.
    
    Expand disabling TPM RNG use for all AMD fTPMs whether they have versions
    that claim to have fixed or not. To accomplish this, move the detection
    into part of the TPM CRB registration and add a flag indicating that
    the TPM should opt-out of registration to hwrng.

By reverting this patch in 6.5.0-rc6 the problem goes away, so it's pretty clear that this commit is at fault. I've done further debugging and I've found that if I simply comment out these lines in 6.5.0-rc6 the problem goes away. So the "crb_check_flags" call is the root cause.

diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
index 9eb1a1859012..20ce8102e6bd 100644
--- a/drivers/char/tpm/tpm_crb.c
+++ b/drivers/char/tpm/tpm_crb.c
@@ -826,9 +826,9 @@ static int crb_acpi_add(struct acpi_device *device)
        if (rc)
                goto out;
 
-       rc = crb_check_flags(chip);
-       if (rc)
-               goto out;
+//     rc = crb_check_flags(chip);
+//     if (rc)
+//             goto out;
 
        rc = tpm_chip_register(chip);
Comment 1 Todd Brandt 2023-08-17 23:18:08 UTC
Narrowing it even further, the problem actually isn't in any new code, it's from an existing call inside the new code. So tpm2_get_tpm_pt with these arguments causes the crash. By removing these lines, the problem goes away.

diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
index 9eb1a1859012..2ecaa3b7d1aa 100644
--- a/drivers/char/tpm/tpm_crb.c
+++ b/drivers/char/tpm/tpm_crb.c
@@ -472,9 +472,9 @@ static int crb_check_flags(struct tpm_chip *chip)
        if (ret)
                return ret;
 
-       ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL);
-       if (ret)
-               goto release;
+//     ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL);
+//     if (ret)
+//             goto release;
 
        if (val == 0x414D4400U /* AMD */)
                chip->flags |= TPM_CHIP_FLAG_HWRNG_DISABLED;
Comment 2 Mario Limonciello (AMD) 2023-08-18 01:20:08 UTC
Can you post the kernel log?  Particularly including the crash.
Does the TPM otherwise work before suspend?
Without the patch does the TPM work after suspend?
Comment 3 Raymond Jay Golo 2023-08-18 10:50:34 UTC
The (In reply to Todd Brandt from comment #1)
> Narrowing it even further, the problem actually isn't in any new code, it's
> from an existing call inside the new code. So tpm2_get_tpm_pt with these
> arguments causes the crash. By removing these lines, the problem goes away.
> 
> diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
> index 9eb1a1859012..2ecaa3b7d1aa 100644
> --- a/drivers/char/tpm/tpm_crb.c
> +++ b/drivers/char/tpm/tpm_crb.c
> @@ -472,9 +472,9 @@ static int crb_check_flags(struct tpm_chip *chip)
>         if (ret)
>                 return ret;
>  
> -       ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL);
> -       if (ret)
> -               goto release;
> +//     ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL);
> +//     if (ret)
> +//             goto release;
>  
>         if (val == 0x414D4400U /* AMD */)
>                 chip->flags |= TPM_CHIP_FLAG_HWRNG_DISABLED;

This also helps with a different problem I've encountered where a Lenovo Legion laptop can't decrypt the root partition after updating to 6.4.11 since it can't seem to access TPM stored keys. After applying this patch, boot succeeds as expected.
Comment 4 Mario Limonciello (AMD) 2023-08-18 13:48:53 UTC
> This also helps with a different problem I've encountered where a Lenovo
> Legion laptop can't decrypt the root partition after updating to 6.4.11 since
> it can't seem to access TPM stored keys. After applying this patch, boot
> succeeds as expected.

Can you please share a kernel log from 6.4.10 and another from 6.4.11?
Comment 5 Raymond Jay Golo 2023-08-18 13:57:08 UTC
(In reply to Mario Limonciello (AMD) from comment #4)
> > This also helps with a different problem I've encountered where a Lenovo
> > Legion laptop can't decrypt the root partition after updating to 6.4.11
> since
> > it can't seem to access TPM stored keys. After applying this patch, boot
> > succeeds as expected.
> 
> Can you please share a kernel log from 6.4.10 and another from 6.4.11?

Unfortunately, it's already outside work hours and the affected laptop is at my office. I'll try to forward the dmesgs on Tuesday if ever.
Comment 6 Mario Limonciello (AMD) 2023-08-18 13:59:41 UTC
> Unfortunately, it's already outside work hours and the affected laptop is at
> my office. I'll try to forward the dmesgs on Tuesday if ever.

While waiting for logs with useful information can you tell me more about your system?  Is it a fTPM, dTPM or Pluton?  Is it Intel or AMD system?

Assuming you had a way to finish your boot (backup key or similar) with 6.4.11 did this totally break TPM functionality?  IE could you still run things like
# tpm2_getcap properties-fixed
Comment 7 Raymond Jay Golo 2023-08-18 14:18:22 UTC
(In reply to Mario Limonciello (AMD) from comment #6)
> > Unfortunately, it's already outside work hours and the affected laptop is
> at
> > my office. I'll try to forward the dmesgs on Tuesday if ever.
> 
> While waiting for logs with useful information can you tell me more about
> your system?  Is it a fTPM, dTPM or Pluton?  Is it Intel or AMD system?
> 
> Assuming you had a way to finish your boot (backup key or similar) with
> 6.4.11 did this totally break TPM functionality?  IE could you still run
> things like
> # tpm2_getcap properties-fixed

It's an Intel system. Not sure if it's fTPM or dTPM but it's for sure not Pluton.
Comment 8 Mario Limonciello (AMD) 2023-08-18 14:24:35 UTC
> It's an Intel system. Not sure if it's fTPM or dTPM but it's for sure not
> Pluton.

As it's a Lenovo system It's also possible it got caught up in some other commits that happened in 6.4.11 related to interrupt handling.

These are the two sets of commits that happened TPM related in 6.4.11:

d78177be9b01 tpm_tis: Opt-in interrupts
27722a5a5c30 tpm: tpm_tis: Fix UPX-i11 DMI_MATCH condition
6b718101cd99 tpm/tpm_tis: Disable interrupts for Lenovo P620 devices
bc3d1e146f83 tpm/tpm_tis: Disable interrupts for TUXEDO InfinityBook S 15/17 Gen7

d75c2b5e06bc tpm: Add a helper for checking hwrng enabled
872fe964648c tpm: Disable RNG for all AMD fTPMs
Comment 9 Mario Limonciello (AMD) 2023-08-18 14:25:16 UTC
You might try this patch: https://lore.kernel.org/stable/20230810182433.518523-1-jarkko@kernel.org/
Comment 10 Raymond Jay Golo 2023-08-18 14:32:49 UTC
(In reply to Mario Limonciello (AMD) from comment #8)
> > It's an Intel system. Not sure if it's fTPM or dTPM but it's for sure not
> > Pluton.
> 
> As it's a Lenovo system It's also possible it got caught up in some other
> commits that happened in 6.4.11 related to interrupt handling.
> 
> These are the two sets of commits that happened TPM related in 6.4.11:
> 
> d78177be9b01 tpm_tis: Opt-in interrupts
> 27722a5a5c30 tpm: tpm_tis: Fix UPX-i11 DMI_MATCH condition
> 6b718101cd99 tpm/tpm_tis: Disable interrupts for Lenovo P620 devices
> bc3d1e146f83 tpm/tpm_tis: Disable interrupts for TUXEDO InfinityBook S 15/17
> Gen7
> 
> d75c2b5e06bc tpm: Add a helper for checking hwrng enabled
> 872fe964648c tpm: Disable RNG for all AMD fTPMs

I already tried with the d78177be9b01 revert and it didn't work. As for 6b718101cd99, I didn't think it matches the PRODUCT_VERSION since mine is a Legion Laptop and not a CPU workstation type.
Comment 11 Mario Limonciello (AMD) 2023-08-18 14:36:59 UTC
> As for 6b718101cd99, I didn't think it matches the PRODUCT_VERSION since mine
> is a Legion Laptop and not a CPU workstation type.

Right; that's why I suggested the other patch that matches all Lenovo.  But...

> I already tried with the d78177be9b01 revert and it didn't work. 

If reverting that didn't help then I suspect it is caused by the property query command at initialization for some reason.
Comment 12 Todd Brandt 2023-08-18 16:11:18 UTC
(In reply to Mario Limonciello (AMD) from comment #2)
> Can you post the kernel log?  Particularly including the crash.
> Does the TPM otherwise work before suspend?
> Without the patch does the TPM work after suspend?

Sadly the S3 crash wipes the dmesg logs, so there's nothing to share, but also I should note that these 5 systems don't have TPM chips. So it's not a problem with TPM hardware, it's a problem with the lack of TPM hardware.

I'm digging through all the machines I can to find others that show the issue, but so far it's only with the test systems. I can share the boot dmesg logs from one of them for 6.5.0-rc5 and 6.5.0-rc6 if you think that will help?
Comment 13 Mario Limonciello (AMD) 2023-08-18 16:13:58 UTC
>  So it's not a problem with TPM hardware, it's a problem with the lack of TPM
>  hardware.

When you say lack of TPM hardware; do the machines revert to Intel fTPM when no dTPM is installed?  Perhaps that's where the issue is.

> I'm digging through all the machines I can to find others that show the
> issue, but so far it's only with the test systems. I can share the boot dmesg
> logs from one of them for 6.5.0-rc5 and 6.5.0-rc6 if you think that will
> help?

Yeah, I think that would be really helpful.

Assuming they have fTPM the other thing that I think we need is the output from this command on both kernels:

# tpm2_getcap properties-fixed
Comment 14 Todd Brandt 2023-08-18 16:20:30 UTC
oh hey, another big piece of the puzzle! the dmesg boot logs do in fact change between 6.5.0-rc5 and 6.5.0-rc6. I'll go through and attach boot logs for the 5 affected machines, in the mean time here's a grep of the TPM parts of boot dmesg for 2 of them:

otcpl alder lake 6.5.0-rc5 boot

[    0.000000] efi: ACPI=0x44bfe000 ACPI 2.0=0x44bfe014 TPMFinalLog=0x44a6b000 SMBIOS=0x41793000 MEMATTR=0x3ea86018 ESRT=0x3eaa0698 TPMEventLog=0x3ea71018 
[    0.003958] ACPI: TPM2 0x0000000044BEC000 00004C (v04 INTEL  ADL-P-M  00000002      01000013)
[    0.004020] ACPI: Reserving TPM2 table memory at [mem 0x44bec000-0x44bec04b]

otcpl alder lake 6.5.0-rc6 boot

[    0.000000] efi: ACPI=0x44bfe000 ACPI 2.0=0x44bfe014 TPMFinalLog=0x44a6b000 SMBIOS=0x41793000 MEMATTR=0x3ea86018 ESRT=0x3eaa0698 TPMEventLog=0x3ea71018 
[    0.004051] ACPI: TPM2 0x0000000044BEC000 00004C (v04 INTEL  ADL-P-M  00000002      01000013)
[    0.004119] ACPI: Reserving TPM2 table memory at [mem 0x44bec000-0x44bec04b]
[    1.982340] ima: No TPM chip found, activating TPM-bypass!

otcpl jasper lake 6.5.0-rc5 boot

[    0.000000] efi: ACPI=0x76bfe000 ACPI 2.0=0x76bfe014 TPMFinalLog=0x76b20000 SMBIOS=0x73a8f000 MEMATTR=0x708d6018 ESRT=0x716f9b98 TPMEventLog=0x70e6a018 
[    0.004317] ACPI: TPM2 0x0000000076BF8000 00004C (v04 INTEL  JSL-ULX  00000002      01000013)
[    0.004386] ACPI: Reserving TPM2 table memory at [mem 0x76bf8000-0x76bf804b]
[    1.174248] pinctrl core: registered pin 24 (SPI0_TPM_CSB) on INT34C8:00

otcpl jasper lake 6.5.0-rc6 boot

[    0.000000] efi: ACPI=0x76bfe000 ACPI 2.0=0x76bfe014 TPMFinalLog=0x76b20000 SMBIOS=0x73a8f000 MEMATTR=0x708d6018 ESRT=0x716f9b98 TPMEventLog=0x70e6a018 
[    0.004776] ACPI: TPM2 0x0000000076BF8000 00004C (v04 INTEL  JSL-ULX  00000002      01000013)
[    0.004851] ACPI: Reserving TPM2 table memory at [mem 0x76bf8000-0x76bf804b]
[    1.013354] ima: No TPM chip found, activating TPM-bypass!
[    1.150460] pinctrl core: registered pin 24 (SPI0_TPM_CSB) on INT34C8:00
Comment 15 Mario Limonciello (AMD) 2023-08-18 16:27:36 UTC
> oh hey, another big piece of the puzzle! the dmesg boot logs do in fact
> change between 6.5.0-rc5 and 6.5.0-rc6.

That's good news for root causing this issue.
It's bad news for the impact of it though.

As a guess; does this "fix" it?  If so, I think we still need to dig a level deeper of why tpm2_get_tpm_pt isn't working.

diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
index 9eb1a18590123..b0e9931fe436c 100644
--- a/drivers/char/tpm/tpm_crb.c
+++ b/drivers/char/tpm/tpm_crb.c
@@ -472,8 +472,7 @@ static int crb_check_flags(struct tpm_chip *chip)
        if (ret)
                return ret;

-       ret = tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL);
-       if (ret)
+       if (tpm2_get_tpm_pt(chip, TPM2_PT_MANUFACTURER, &val, NULL))
                goto release;

        if (val == 0x414D4400U /* AMD */)
Comment 16 Mario Limonciello (AMD) 2023-08-18 16:31:31 UTC
And this should at least let us dig a level deeper on the error:

diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 93545be190a50..77893e39dcffb 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -391,8 +391,10 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip *chip, u32 property_id,  u32 *value,
        int rc;

        rc = tpm_buf_init(&buf, TPM2_ST_NO_SESSIONS, TPM2_CC_GET_CAPABILITY);
-       if (rc)
+       if (rc) {
+               dev_err(&chip->dev, "tpm_buf_init failed with %d\n", rc);
                return rc;
+       }
        tpm_buf_append_u32(&buf, TPM2_CAP_TPM_PROPERTIES);
        tpm_buf_append_u32(&buf, property_id);
        tpm_buf_append_u32(&buf, 1);
@@ -410,6 +412,8 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip *chip, u32 property_id,  u32 *value,
                        *value = be32_to_cpu(out->value);
                else
                        rc = -ENODATA;
+       } else {
+               dev_err(&chip->dev, "tpm_transmit_cmd failed with %d\n", rc);
        }
        tpm_buf_destroy(&buf);
        return rc;
Comment 17 Todd Brandt 2023-08-18 16:46:49 UTC
(In reply to Mario Limonciello (AMD) from comment #16)
> And this should at least let us dig a level deeper on the error:
> 
> diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
> index 93545be190a50..77893e39dcffb 100644
> --- a/drivers/char/tpm/tpm2-cmd.c
> +++ b/drivers/char/tpm/tpm2-cmd.c
> @@ -391,8 +391,10 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip *chip, u32
> property_id,  u32 *value,
>         int rc;
> 
>         rc = tpm_buf_init(&buf, TPM2_ST_NO_SESSIONS, TPM2_CC_GET_CAPABILITY);
> -       if (rc)
> +       if (rc) {
> +               dev_err(&chip->dev, "tpm_buf_init failed with %d\n", rc);
>                 return rc;
> +       }
>         tpm_buf_append_u32(&buf, TPM2_CAP_TPM_PROPERTIES);
>         tpm_buf_append_u32(&buf, property_id);
>         tpm_buf_append_u32(&buf, 1);
> @@ -410,6 +412,8 @@ ssize_t tpm2_get_tpm_pt(struct tpm_chip *chip, u32
> property_id,  u32 *value,
>                         *value = be32_to_cpu(out->value);
>                 else
>                         rc = -ENODATA;
> +       } else {
> +               dev_err(&chip->dev, "tpm_transmit_cmd failed with %d\n", rc);
>         }
>         tpm_buf_destroy(&buf);
>         return rc;

will do, I'm building a kernel now with both patches applied and will test on all 5 affected machines, it'll take a couple hours.
Comment 18 Todd Brandt 2023-08-18 16:56:54 UTC
Created attachment 304889 [details]
6.5.0-rc5-otcpl-alder-lake1-dmesg-boot.log
Comment 19 Todd Brandt 2023-08-18 16:57:53 UTC
Created attachment 304890 [details]
6.5.0-rc5-otcpl-alder-lake2-dmesg-boot.log
Comment 20 Todd Brandt 2023-08-18 16:58:20 UTC
Created attachment 304891 [details]
6.5.0-rc5-otcpl-jasper-lake-dmesg-boot.log
Comment 21 Todd Brandt 2023-08-18 16:59:11 UTC
Created attachment 304892 [details]
6.5.0-rc5-otcpl-raptor-lake.dmesg-boot.log
Comment 22 Todd Brandt 2023-08-18 16:59:36 UTC
Created attachment 304893 [details]
6.5.0-rc5-otcpl-rocket-lake-dmesg-boot.log
Comment 23 Todd Brandt 2023-08-18 17:00:00 UTC
Created attachment 304894 [details]
6.5.0-rc6-otcpl-alder-lake1-dmesg-boot.log
Comment 24 Todd Brandt 2023-08-18 17:00:24 UTC
Created attachment 304895 [details]
6.5.0-rc6-otcpl-alder-lake2-dmesg-boot.log
Comment 25 Todd Brandt 2023-08-18 17:00:49 UTC
Created attachment 304896 [details]
6.5.0-rc6-otcpl-jasper-lake-dmesg-boot.log
Comment 26 Todd Brandt 2023-08-18 17:01:13 UTC
Created attachment 304897 [details]
6.5.0-rc6-otcpl-raptor-lake-dmesg-boot.log
Comment 27 Todd Brandt 2023-08-18 17:01:36 UTC
Created attachment 304898 [details]
6.5.0-rc6-otcpl-rocket-lake-dmesg-boot.log
Comment 28 Mario Limonciello (AMD) 2023-08-18 17:12:02 UTC
The problem shows up in your log (looked at alder lake with rc6).

[    1.132521] tpm_crb: probe of INTC6001:00 failed with error 378

So the issue is that the probe now fails.  Looking at the call path I think the failure is likely from tpm_transmit() called from tpm_transmit_cmd().
I think that return code was populated from header->return_code meaning it's a TPM_RC return code.

It happens to match TPM2_CC_GET_CAPABILITY, which is what the request was originally populated with.  This makes me think that the TPM might not have been ready at the time of the transaction.
Comment 29 Todd Brandt 2023-08-18 17:27:42 UTC
I just tried running 6.5.0-rc6 with your two patches applies and it fixed the S3 crash on all 5 machines. So you're on the right track. I'll attach the dmesg boot logs from the 5 machines...
Comment 30 Todd Brandt 2023-08-18 17:38:56 UTC
Created attachment 304899 [details]
6.5.0-rc6-mariofix-otcpl-alder-lake1-dmesg-boot-and-s3.log
Comment 31 Todd Brandt 2023-08-18 17:39:19 UTC
Created attachment 304900 [details]
6.5.0-rc6-mariofix-otcpl-alder-lake2-dmesg-boot-and-s3.log
Comment 32 Todd Brandt 2023-08-18 17:39:43 UTC
Created attachment 304901 [details]
6.5.0-rc6-mariofix-otcpl-jasper-lake-dmesg-boot-and-s3.log
Comment 33 Todd Brandt 2023-08-18 17:40:06 UTC
Created attachment 304902 [details]
6.5.0-rc6-mariofix-otcpl-raptor-lake-dmesg-boot-and-s3.log
Comment 34 Todd Brandt 2023-08-18 17:40:31 UTC
Created attachment 304903 [details]
6.5.0-rc6-mariofix-otcpl-rocket-lake-dmesg-boot-and-s3.log
Comment 35 Mario Limonciello (AMD) 2023-08-18 17:47:57 UTC
> [    1.226178] tpm tpm0: tpm_transmit_cmd failed with 378

Looking at the source:

	/* the command code is where the return code will be */
	u32 cc = be32_to_cpu(header->return_code);

It does confirm my suspicion of where the error comes from is that it's the return_code never gets updated.

tpm_transmit uses tpm_try_transmit which calls crb_send()

I *guess* this means that crb_send() doesn't work early on for some reason.

Can you please try running this command when the system boots up with those patches in place?  Just one system is fine.

# tpm2_getcap properties-fixed
Comment 36 Todd Brandt 2023-08-18 18:01:14 UTC
Created attachment 304904 [details]
otcpl-alder-lake1-tpm2_getcap.log

Outputs for "tpm2_getcap properties-fixed" on 6.5-rc5, 6.5-rc6, and 6.5-rc6-mariofix
Comment 37 Todd Brandt 2023-08-18 18:19:28 UTC
(In reply to Mario Limonciello (AMD) from comment #35)
> > [    1.226178] tpm tpm0: tpm_transmit_cmd failed with 378
> 
> Looking at the source:
> 
>       /* the command code is where the return code will be */
>       u32 cc = be32_to_cpu(header->return_code);
> 
> It does confirm my suspicion of where the error comes from is that it's the
> return_code never gets updated.
> 
> tpm_transmit uses tpm_try_transmit which calls crb_send()
> 
> I *guess* this means that crb_send() doesn't work early on for some reason.
> 
> Can you please try running this command when the system boots up with those
> patches in place?  Just one system is fine.
> 
> # tpm2_getcap properties-fixed

ok, I just attached a run of tpm2_getcap for all three kernels on the alder lake1.
Comment 38 Mario Limonciello (AMD) 2023-08-18 18:20:10 UTC
OK, based on M/L discussion I've posted the patch.

https://lore.kernel.org/linux-kernel/20230818181516.19167-1-mario.limonciello@amd.com/T/#u
Comment 39 Todd Brandt 2023-08-18 19:03:19 UTC
(In reply to Mario Limonciello (AMD) from comment #38)
> OK, based on M/L discussion I've posted the patch.
> 
> https://lore.kernel.org/linux-kernel/20230818181516.19167-1-mario.
> limonciello@amd.com/T/#u

Excellent, I've just tested it on the very tip of upstream to be sure and it works. I'll also do a full 24 hour S3 block this weekend with the patch to be absolutely sure. I'll keep the bug open til the patch makes it upstream.

So it would appear the bug is in the Intel TPM and this new patch just exposed it. 
I'm glad we know about it now. Thanks for your help Mario!
Comment 40 jarkko 2023-08-18 23:36:56 UTC
(In reply to Raymond Jay Golo from comment #5)
> (In reply to Mario Limonciello (AMD) from comment #4)
> > > This also helps with a different problem I've encountered where a Lenovo
> > > Legion laptop can't decrypt the root partition after updating to 6.4.11
> > since
> > > it can't seem to access TPM stored keys. After applying this patch, boot
> > > succeeds as expected.
> > 
> > Can you please share a kernel log from 6.4.10 and another from 6.4.11?
> 
> Unfortunately, it's already outside work hours and the affected laptop is at
> my office. I'll try to forward the dmesgs on Tuesday if ever.

Can you also post the laptop model. Thanks.

BR, Jarkko
Comment 41 Ronan Pigott 2023-08-19 02:22:08 UTC
I'm not OP, but noticed a regression on 6.4.11 where the tpm fails to initialize, causing my LUKSv2 volume to not unlock. Reverting to 6.4.10 successfully fixed the issue for me. My journal has just the following additional warning compared to 6.4.10:

  Aug 18 15:30:37 kernel: tpm_crb: probe of MSFT0101:00 failed with error 378

Additionally, I recompiled my kernel (6.4.11-arch1-1) with Mario's patch [1] applied, and in my case it resolves the issue.

My desktop is a Dell XPS 8950. I am currently booted from the patched 6.4.11, and fwupdmgr gives the following information on the TPM device:

> TPM:
>   Device ID:       c6a80ac3a22083423992a3cb15018989f37834d6
>   Summary:         TPM 2.0 Device
>   Current version: 600.18.0.0
>   Vendor:          Intel (TPM:INTC)
>   GUIDs:           ff71992e-52f7-5eea-94ef-883e56e034c6 <- system-tpm
>                    34801700-3a50-5b05-820c-fe14580e4c2d <-
>                    TPM\VEN_INTC&DEV_0000
>                    93532b61-86ce-57cf-8a31-f5a5553966c7 <-
>                    TPM\VEN_INTC&MOD_ADL
>                    03f304f4-223e-54f4-b2c1-c3cf3b5817c6 <-
>                    TPM\VEN_INTC&DEV_0000&VER_2.0
>                    12e61a33-eef7-58d6-855d-ece38114294d <-
>                    TPM\VEN_INTC&MOD_ADL&VER_2.0

The output of `tpm2_getcap properties-fixed` is identical to Todd's, except for one prop:

TPM2_PT_FIRMWARE_VERSION_2:
  raw: 0x0

[1] https://lore.kernel.org/all/20230818181516.19167-1-mario.limonciello@amd.com/
Comment 42 Nadim Kobeissi 2023-08-21 18:52:54 UTC
This bug is causing a TPM failure on my laptop as well (MSI Stealth A13v):


> archangel:/home/nadim # dmesg | grep -i tpm
> [    0.000000] efi: ACPI=0x64912000 ACPI 2.0=0x64912014
> TPMFinalLog=0x649ac000 SMBIOS=0x67c5a000 SMBIOS 3.0=0x67c59000
> MEMATTR=0x5dd8b018 ESRT=0x5f2cd698 MOKvar=0x5a6c0000 RNG=0x64830018
> TPMEventLog=0x5a684018 
> [    0.013304] ACPI: TPM2 0x0000000064837000 00004C (v04 MSI_NB MEGABOOK
> 00000001 AMI  00000000)
> [    0.013325] ACPI: Reserving TPM2 table memory at [mem
> 0x64837000-0x6483704b]
> [    0.846149] tpm_crb: probe of MSFT0101:00 failed with error 378
> [    0.893380] ima: No TPM chip found, activating TPM-bypass!


> archangel:/home/nadim # uname -a
> Linux archangel.home 6.4.11-1-default #1 SMP PREEMPT_DYNAMIC Thu Aug 17
> 04:57:43 UTC 2023 (2a5b3f6) x86_64 x86_64 x86_64 GNU/Linux
Comment 43 Mario Limonciello (AMD) 2023-08-21 19:25:05 UTC
Here's the v2 submission.  

https://lore.kernel.org/stable/20230821140230.1168-1-mario.limonciello@amd.com/

Feel free to reply with a Tested-by: tag if it works for you.
Comment 44 Raymond Jay Golo 2023-08-22 03:21:44 UTC
(In reply to Mario Limonciello (AMD) from comment #6)
> > Unfortunately, it's already outside work hours and the affected laptop is
> at
> > my office. I'll try to forward the dmesgs on Tuesday if ever.
> 
> While waiting for logs with useful information can you tell me more about
> your system?  Is it a fTPM, dTPM or Pluton?  Is it Intel or AMD system?
> 
> Assuming you had a way to finish your boot (backup key or similar) with
> 6.4.11 did this totally break TPM functionality?  IE could you still run
> things like
> # tpm2_getcap properties-fixed

TPM2_PT_FAMILY_INDICATOR:
  raw: 0x322E3000
  value: "2.0"
TPM2_PT_LEVEL:
  raw: 0
TPM2_PT_REVISION:
  raw: 0x8A
  value: 1.38
TPM2_PT_DAY_OF_YEAR:
  raw: 0x8
TPM2_PT_YEAR:
  raw: 0x7E2
TPM2_PT_MANUFACTURER:
  raw: 0x494E5443
  value: "INTC"
TPM2_PT_VENDOR_STRING_1:
  raw: 0x496E7465
  value: "Inte"
TPM2_PT_VENDOR_STRING_2:
  raw: 0x6C000000
  value: "l"
TPM2_PT_VENDOR_STRING_3:
  raw: 0x0
  value: ""
TPM2_PT_VENDOR_STRING_4:
  raw: 0x0
  value: ""
TPM2_PT_VENDOR_TPM_TYPE:
  raw: 0x0
TPM2_PT_FIRMWARE_VERSION_1:
  raw: 0x1930001
TPM2_PT_FIRMWARE_VERSION_2:
  raw: 0x0
TPM2_PT_INPUT_BUFFER:
  raw: 0x400
TPM2_PT_HR_TRANSIENT_MIN:
  raw: 0x3
TPM2_PT_HR_PERSISTENT_MIN:
  raw: 0x7
TPM2_PT_HR_LOADED_MIN:
  raw: 0x3
TPM2_PT_ACTIVE_SESSIONS_MAX:
  raw: 0x40
TPM2_PT_PCR_COUNT:
  raw: 0x18
TPM2_PT_PCR_SELECT_MIN:
  raw: 0x3
TPM2_PT_CONTEXT_GAP_MAX:
  raw: 0xFFFF
TPM2_PT_NV_COUNTERS_MAX:
  raw: 0x80
TPM2_PT_NV_INDEX_MAX:
  raw: 0x800
TPM2_PT_MEMORY:
  raw: 0x6
TPM2_PT_CLOCK_UPDATE:
  raw: 0x400000
TPM2_PT_CONTEXT_HASH:
  raw: 0xB
TPM2_PT_CONTEXT_SYM:
  raw: 0x6
TPM2_PT_CONTEXT_SYM_SIZE:
  raw: 0x100
TPM2_PT_ORDERLY_COUNT:
  raw: 0xFF
TPM2_PT_MAX_COMMAND_SIZE:
  raw: 0xF80
TPM2_PT_MAX_RESPONSE_SIZE:
  raw: 0xF80
TPM2_PT_MAX_DIGEST:
  raw: 0x20
TPM2_PT_MAX_OBJECT_CONTEXT:
  raw: 0x3A0
TPM2_PT_MAX_SESSION_CONTEXT:
  raw: 0xF8
TPM2_PT_PS_FAMILY_INDICATOR:
  raw: 0x1
TPM2_PT_PS_LEVEL:
  raw: 0x0
TPM2_PT_PS_REVISION:
  raw: 0x103
TPM2_PT_PS_DAY_OF_YEAR:
  raw: 0x0
TPM2_PT_PS_YEAR:
  raw: 0x0
TPM2_PT_SPLIT_MAX:
  raw: 0x80
TPM2_PT_TOTAL_COMMANDS:
  raw: 0x64
TPM2_PT_LIBRARY_COMMANDS:
  raw: 0x64
TPM2_PT_VENDOR_COMMANDS:
  raw: 0x0
TPM2_PT_NV_BUFFER_MAX:
  raw: 0x800
TPM2_PT_MODES:
  raw: 0x0

I'll test using the previous post's v2 submission and reply later.
Comment 45 Nadim Kobeissi 2023-08-22 04:40:07 UTC
I'd like to confirm that the following patch by Mario fixed the problem for me:

https://lore.kernel.org/linux-kernel/20230818181516.19167-1-mario.limonciello@amd.com/

I hope this information is useful. Thanks.
Comment 46 Raymond Jay Golo 2023-08-22 05:44:50 UTC
> Here's the v2 submission.  
> 
> https://lore.kernel.org/stable/20230821140230.1168-1-mario.limonciello@amd.
> com/
> 
> Feel free to reply with a Tested-by: tag if it works for you.

I can also confirm that the above works for the problem in my machine.
Comment 47 Todd Brandt 2023-08-22 08:06:54 UTC
Confirmed that it works for me as well, all 5 affected test machines are working properly with the latest patch applied both to 6.5.0-rc6 and 6.5.0-rc7.
Comment 48 Todd Brandt 2023-08-22 08:09:30 UTC
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Comment 49 Raymond Jay Golo 2023-08-23 05:01:30 UTC
(In reply to jarkko from comment #40)
> (In reply to Raymond Jay Golo from comment #5)
> > (In reply to Mario Limonciello (AMD) from comment #4)
> > > > This also helps with a different problem I've encountered where a
> Lenovo
> > > > Legion laptop can't decrypt the root partition after updating to 6.4.11
> > > since
> > > > it can't seem to access TPM stored keys. After applying this patch,
> boot
> > > > succeeds as expected.
> > > 
> > > Can you please share a kernel log from 6.4.10 and another from 6.4.11?
> > 
> > Unfortunately, it's already outside work hours and the affected laptop is
> at
> > my office. I'll try to forward the dmesgs on Tuesday if ever.
> 
> Can you also post the laptop model. Thanks.
> 
> BR, Jarkko

The model is Lenovo Legion Y540. Sorry for the late reply. Missed it due to the activity and me being on a short vacation.
Comment 50 jarkko 2023-08-23 17:34:31 UTC
On Wed Aug 23, 2023 at 8:01 AM EEST,  wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217804
>
> --- Comment #49 from Raymond Jay Golo (rjgolo@gmail.com) ---
> (In reply to jarkko from comment #40)
> > (In reply to Raymond Jay Golo from comment #5)
> > > (In reply to Mario Limonciello (AMD) from comment #4)
> > > > > This also helps with a different problem I've encountered where a
> > Lenovo
> > > > > Legion laptop can't decrypt the root partition after updating to
> 6.4.11
> > > > since
> > > > > it can't seem to access TPM stored keys. After applying this patch,
> > boot
> > > > > succeeds as expected.
> > > > 
> > > > Can you please share a kernel log from 6.4.10 and another from 6.4.11?
> > > 
> > > Unfortunately, it's already outside work hours and the affected laptop is
> > at
> > > my office. I'll try to forward the dmesgs on Tuesday if ever.
> > 
> > Can you also post the laptop model. Thanks.
> > 
> > BR, Jarkko
>
> The model is Lenovo Legion Y540. Sorry for the late reply. Missed it due to
> the
> activity and me being on a short vacation.

OK, cool. I'll mention this in the commit message. Thanks for providing
the details.

BR, Jarkko
Comment 51 chriscjsus 2023-08-26 02:27:43 UTC
This also affects 6.1 LTS but not 5.15 or earlier lts.
Comment 52 Mikhail Novosyolov 2023-08-30 13:55:58 UTC
The following machine with Intel is affected at kernel v6.1.46:

https://linux-hardware.org/?probe=91d3472ba6
From dmesg:
tpm_crb: probe of MSFT0101:00 failed with error 378

Will test it with ther patch.
Comment 53 Marcos Alano 2023-09-07 13:51:42 UTC
I'm getting an error similar to Mikhail:
tpm_crb: probe of INTC6000:00 failed with error 378
The difference is my kernel is 6.5.1.
Comment 54 Mario Limonciello (AMD) 2023-09-07 13:59:12 UTC
It's fixed by this commit: https://github.com/torvalds/linux/commit/8f7f35e5aa6f2182eabcfa3abef4d898a48e9aa8

It is in the process of backporting to stable kernels as well.

Note You need to log in before you can comment on or make changes to this bug.