Bug 219383

Summary: System reboot on S3 sleep/wakeup test
Product: Platform Specific/Hardware Reporter: Mike Seo (mikeseohyungjin)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: NEW ---    
Severity: normal CC: jarkko
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: A fix candidate
Fix candidate v2
Fix candidate v3
Final Fix

Description Mike Seo 2024-10-14 00:46:26 UTC
I'm working for LG laptops, and I have run serveral LG PC with ubuntu OS. You may know, most LG laptops has intel soc.
I found out a critical issue, system reboot on S3 sleep/wake up.

Enviornments:
- PC BIOS : Phoenix Technologies
- Intel Jasperlake or Intel Lunarlake 
- OS Ubuntu 22.04(Jasperlake), 24.04.1(Lunarlake)
- linux kernel version 6.x.0(Jasperlake) or up-to-date 6.11(Lunarlake)

Symptom:

Running the aging scripts like below, system reboots.
-------------------------
#!/bin/bash
<snip>
for (( i=1; i<=10000 ; i++ ))
sudo rtcwake -m mem -s 10 >> ${LOG} 2>&1
<snip>
-------------------------
The scripts works like below,
1. waits 10 secs
2. echo mem > /sys/power/state
3. waits 10 secs again and wake up system like press power button.


My analysis:

I had reproduced several times to find that BIOS side triggered the system reboots.
| pm_suspend() | syscore_suspend() | acpi_suspend_enter() | ... |  < BIOS > |  ...| acpi_suspend_enter() |  syscore_resume() | ...|

Debugging on BIOS, TPM2 can generate cold reset when it detects something wrong after TPM resuming.
In the BIOS code, if there are active PCR banks that are not supported by the Platform mask, it supposes to be update the TPM allocations and reboot the machine.

It means that something in linux kernel side can effect operations of  tpm when going to sleep.
So, I have debuggered and traced the functions related to tpm, such as tpm_chip_start whenever the symptoms represented.

In normal case, tpm_chip_start() called once like below,
 tpm_pm_suspend()-> tpm_chip_start().
but issued case, additionally called below
 hwrng_fillfn ->
  rng_get_data ->
    tpm_hwrng_read ->
      tpm_get_random ->
        tpm_find_get_ops ->
           tpm_try_get_ops ->
             tpm_chip_start ->

I found out that when running hwrng_fillfn(), related to Hardware random number generator,  called during system_sleep, it can cause system reboots.
To Verify it, I have tested with custom kernel which includes below patch.

-----------------------
From 373e92bb6d471c5fb42bacb97a4caf5375df5522 Mon Sep 17 00:00:00 2001
From: mike Seo <mikeseohyungjin@gmail.com>
Date: Thu, 10 Oct 2024 14:04:57 +0900
Subject: [PATCH] test_patch

test_patch for reboot while sleep/wakeup

Signed-off-by: mike Seo <mikeseohyungjin@gmail.com>
---
 drivers/char/hw_random/core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index 57c51efa5..d3f0059a4 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/uaccess.h>
+#include <linux/suspend.h>
 
 #define RNG_MODULE_NAME		"hw_random"
 
@@ -469,6 +470,22 @@ static struct attribute *rng_dev_attrs[] = {
 
 ATTRIBUTE_GROUPS(rng_dev);
 
+
+static int hwrng_pm_notification(struct notifier_block *nb, unsigned long action, void *data)
+{
+
+	switch (action) {
+	case PM_SUSPEND_PREPARE:
+		is_suspend_prepare = 1;
+		break;
+	default:
+		is_suspend_prepare = 0;
+		break;
+	}
+	return 0;
+}
+
+static struct notifier_block pm_notifier = { .notifier_call = hwrng_pm_notification };
 static int hwrng_fillfn(void *unused)
 {
 	size_t entropy, entropy_credit = 0; /* in 1/1024 of a bit */
@@ -478,6 +495,9 @@ static int hwrng_fillfn(void *unused)
 		unsigned short quality;
 		struct hwrng *rng;
 
+		while (is_suspend_prepare)
+			msleep(500);
+
 		rng = get_current_rng();
 		if (IS_ERR(rng) || !rng)
 			break;
@@ -549,6 +569,7 @@ int hwrng_register(struct hwrng *rng)
 			goto out_unlock;
 	}
 	mutex_unlock(&rng_mutex);
+	WARN_ON(register_pm_notifier(&pm_notifier));
 	return 0;
 out_unlock:
 	mutex_unlock(&rng_mutex);
-- 
2.43.0
------------------------

And I had passed over 10000 times of s3 wake/sleep aging test.

Can you make some patches for this issue and merges?

Thank you,
Mike
Comment 1 Mike Seo 2024-10-14 00:47:26 UTC
*** Bug 219368 has been marked as a duplicate of this bug. ***
Comment 2 Mike Seo 2024-10-21 01:56:29 UTC
Dear All,

It could be a big help if anyone responds....
Comment 3 Borislav Petkov 2024-10-21 16:02:21 UTC
Looks like TPM. CCing the proper people.

On Mon, Oct 14, 2024 at 12:46:26AM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219383
> 
>             Bug ID: 219383
>            Summary: System reboot on S3 sleep/wakeup test
>            Product: Platform Specific/Hardware
>            Version: 2.5
>           Hardware: All
>                 OS: Linux
>             Status: NEW
>           Severity: normal
>           Priority: P3
>          Component: x86-64
>           Assignee: platform_x86_64@kernel-bugs.osdl.org
>           Reporter: mikeseohyungjin@gmail.com
>         Regression: No
> 
> I'm working for LG laptops, and I have run serveral LG PC with ubuntu OS. You
> may know, most LG laptops has intel soc.
> I found out a critical issue, system reboot on S3 sleep/wake up.
> 
> Enviornments:
> - PC BIOS : Phoenix Technologies
> - Intel Jasperlake or Intel Lunarlake 
> - OS Ubuntu 22.04(Jasperlake), 24.04.1(Lunarlake)
> - linux kernel version 6.x.0(Jasperlake) or up-to-date 6.11(Lunarlake)
> 
> Symptom:
> 
> Running the aging scripts like below, system reboots.
> -------------------------
> #!/bin/bash
> <snip>
> for (( i=1; i<=10000 ; i++ ))
> sudo rtcwake -m mem -s 10 >> ${LOG} 2>&1
> <snip>
> -------------------------
> The scripts works like below,
> 1. waits 10 secs
> 2. echo mem > /sys/power/state
> 3. waits 10 secs again and wake up system like press power button.
> 
> 
> My analysis:
> 
> I had reproduced several times to find that BIOS side triggered the system
> reboots.
> | pm_suspend() | syscore_suspend() | acpi_suspend_enter() | ... |  < BIOS > | 
> ...| acpi_suspend_enter() |  syscore_resume() | ...|
> 
> Debugging on BIOS, TPM2 can generate cold reset when it detects something
> wrong
> after TPM resuming.
> In the BIOS code, if there are active PCR banks that are not supported by the
> Platform mask, it supposes to be update the TPM allocations and reboot the
> machine.
> 
> It means that something in linux kernel side can effect operations of  tpm
> when
> going to sleep.
> So, I have debuggered and traced the functions related to tpm, such as
> tpm_chip_start whenever the symptoms represented.
> 
> In normal case, tpm_chip_start() called once like below,
>  tpm_pm_suspend()-> tpm_chip_start().
> but issued case, additionally called below
>  hwrng_fillfn ->
>   rng_get_data ->
>     tpm_hwrng_read ->
>       tpm_get_random ->
>         tpm_find_get_ops ->
>            tpm_try_get_ops ->
>              tpm_chip_start ->
> 
> I found out that when running hwrng_fillfn(), related to Hardware random
> number
> generator,  called during system_sleep, it can cause system reboots.
> To Verify it, I have tested with custom kernel which includes below patch.
> 
> -----------------------
> From 373e92bb6d471c5fb42bacb97a4caf5375df5522 Mon Sep 17 00:00:00 2001
> From: mike Seo <mikeseohyungjin@gmail.com>
> Date: Thu, 10 Oct 2024 14:04:57 +0900
> Subject: [PATCH] test_patch
> 
> test_patch for reboot while sleep/wakeup
> 
> Signed-off-by: mike Seo <mikeseohyungjin@gmail.com>
> ---
>  drivers/char/hw_random/core.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
> index 57c51efa5..d3f0059a4 100644
> --- a/drivers/char/hw_random/core.c
> +++ b/drivers/char/hw_random/core.c
> @@ -25,6 +25,7 @@
>  #include <linux/slab.h>
>  #include <linux/string.h>
>  #include <linux/uaccess.h>
> +#include <linux/suspend.h>
> 
>  #define RNG_MODULE_NAME                "hw_random"
> 
> @@ -469,6 +470,22 @@ static struct attribute *rng_dev_attrs[] = {
> 
>  ATTRIBUTE_GROUPS(rng_dev);
> 
> +
> +static int hwrng_pm_notification(struct notifier_block *nb, unsigned long
> action, void *data)
> +{
> +
> +       switch (action) {
> +       case PM_SUSPEND_PREPARE:
> +               is_suspend_prepare = 1;
> +               break;
> +       default:
> +               is_suspend_prepare = 0;
> +               break;
> +       }
> +       return 0;
> +}
> +
> +static struct notifier_block pm_notifier = { .notifier_call =
> hwrng_pm_notification };
>  static int hwrng_fillfn(void *unused)
>  {
>         size_t entropy, entropy_credit = 0; /* in 1/1024 of a bit */
> @@ -478,6 +495,9 @@ static int hwrng_fillfn(void *unused)
>                 unsigned short quality;
>                 struct hwrng *rng;
> 
> +               while (is_suspend_prepare)
> +                       msleep(500);
> +
>                 rng = get_current_rng();
>                 if (IS_ERR(rng) || !rng)
>                         break;
> @@ -549,6 +569,7 @@ int hwrng_register(struct hwrng *rng)
>                         goto out_unlock;
>         }
>         mutex_unlock(&rng_mutex);
> +       WARN_ON(register_pm_notifier(&pm_notifier));
>         return 0;
>  out_unlock:
>         mutex_unlock(&rng_mutex);
> -- 
> 2.43.0
> ------------------------
> 
> And I had passed over 10000 times of s3 wake/sleep aging test.
> 
> Can you make some patches for this issue and merges?
> 
> Thank you,
> Mike
> 
> -- 
> You may reply to this email to add a comment.
> 
> You are receiving this mail because:
> You are watching the assignee of the bug.
Comment 4 jarkko 2024-10-22 09:38:23 UTC
On Mon Oct 21, 2024 at 7:02 PM EEST, Borislav Petkov wrote:
> Looks like TPM. CCing the proper people.

Thanks for forwarding this to me.

My fixes for an issue with TPM bus encryption and AMD boot time have
been stuck for a while (over a month) [1].

They've also tested by the reporter [2] but seems to be hard to get any
reviewed-by's for this, which is weird given how dead obvious the
changes are.

There's already a decent plan for overhaul of that functionality [3]
so looking into to the new issue you forwarded would make a lot of
sense to me while at it.

BR, Jarkko

[1] https://lore.kernel.org/linux-integrity/20241021053921.33274-1-jarkko@kernel.org/T/#t
[2] https://lore.kernel.org/linux-integrity/CALSz7m3SXE3v-yB=_E3Xf5zCDv6bAYhjb+KHrnZ6J14ay2q9sw@mail.gmail.com/
[3] https://chaos.social/@gromit/113345582873908273
Comment 5 jarkko 2024-10-25 14:37:07 UTC
I think next week I can send the PR for AMD boot-time fixes (thanks to Stefan Berger's help with reviewing them. As soon as I have bandwidth that week, I will look more into this.
Comment 6 Mike Seo 2024-10-28 01:30:43 UTC
Thank you. Mr. jarkko

I have tested if your patch is affectable for 5 days, over 100 hours,  made a based on linux-next, v6.12.0-rc4.
And my issue does not reproduced.
diff is like this, which is a little bit different for your [1] patch.

diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 854546000c92..f223f497587d 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -678,8 +678,10 @@ void tpm_chip_unregister(struct tpm_chip *chip)
 	if (tpm_is_hwrng_enabled(chip))
 		hwrng_unregister(&chip->hwrng);
 	tpm_bios_log_teardown(chip);
-	if (chip->flags & TPM_CHIP_FLAG_TPM2 && !tpm_is_firmware_upgrade(chip))
+	if (chip->flags & TPM_CHIP_FLAG_TPM2 && !tpm_is_firmware_upgrade(chip)){
+		tpm2_end_auth_session(chip);
 		tpm_devs_remove(chip);
+	}
 	tpm_del_char_device(chip);
 }
 EXPORT_SYMBOL_GPL(tpm_chip_unregister);
diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c
index c3fbbf4d3db7..7be817504a53 100644
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -47,8 +47,10 @@ static ssize_t tpm_dev_transmit(struct tpm_chip *chip, struct tpm_space *space,
 
 	if (!ret)
 		ret = tpm2_commit_space(chip, space, buf, &len);
-	else
+	else{
+		tpm2_end_auth_session(chip);
 		tpm2_flush_space(chip);
+	}
 
 out_rc:
 	return ret ? ret : len;
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index 5da134f12c9a..8b2f5f4c682a 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -379,9 +379,10 @@ int tpm_pm_suspend(struct device *dev)
 
 	rc = tpm_try_get_ops(chip);
 	if (!rc) {
-		if (chip->flags & TPM_CHIP_FLAG_TPM2)
+		if (chip->flags & TPM_CHIP_FLAG_TPM2){
+			tpm2_end_auth_session(chip);
 			tpm2_shutdown(chip, TPM2_SU_STATE);
-		else
+		}else
 			rc = tpm1_pm_suspend(chip, tpm_suspend_pcr);
 
 		tpm_put_ops(chip);
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index 511c67061728..e7f887efce24 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -333,6 +333,8 @@ void tpm_buf_append_hmac_session(struct tpm_chip *chip, struct tpm_buf *buf,
 	}
 
 #ifdef CONFIG_TCG_TPM2_HMAC
+	/* The first write to /dev/tpm{rm0} will flush the session. */
+	attributes |= TPM2_SA_CONTINUE_SESSION;
 	/*
 	 * The Architecture Guide requires us to strip trailing zeros
 	 * before computing the HMAC


I hope my test was helpful for your works.
Thank you.
Comment 7 jarkko 2024-10-29 22:34:44 UTC
Created attachment 307087 [details]
A fix candidate
Comment 8 jarkko 2024-10-29 22:35:28 UTC
(In reply to Mike Seo from comment #6)
> Thank you. Mr. jarkko
> 
> I have tested if your patch is affectable for 5 days, over 100 hours,  made
> a based on linux-next, v6.12.0-rc4.
> And my issue does not reproduced.
> diff is like this, which is a little bit different for your [1] patch.
> 
> diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
> index 854546000c92..f223f497587d 100644
> --- a/drivers/char/tpm/tpm-chip.c
> +++ b/drivers/char/tpm/tpm-chip.c
> @@ -678,8 +678,10 @@ void tpm_chip_unregister(struct tpm_chip *chip)
>       if (tpm_is_hwrng_enabled(chip))
>               hwrng_unregister(&chip->hwrng);
>       tpm_bios_log_teardown(chip);
> -     if (chip->flags & TPM_CHIP_FLAG_TPM2 && !tpm_is_firmware_upgrade(chip))
> +     if (chip->flags & TPM_CHIP_FLAG_TPM2 &&
> !tpm_is_firmware_upgrade(chip)){
> +             tpm2_end_auth_session(chip);
>               tpm_devs_remove(chip);
> +     }
>       tpm_del_char_device(chip);
>  }
>  EXPORT_SYMBOL_GPL(tpm_chip_unregister);
> diff --git a/drivers/char/tpm/tpm-dev-common.c
> b/drivers/char/tpm/tpm-dev-common.c
> index c3fbbf4d3db7..7be817504a53 100644
> --- a/drivers/char/tpm/tpm-dev-common.c
> +++ b/drivers/char/tpm/tpm-dev-common.c
> @@ -47,8 +47,10 @@ static ssize_t tpm_dev_transmit(struct tpm_chip *chip,
> struct tpm_space *space,
>  
>       if (!ret)
>               ret = tpm2_commit_space(chip, space, buf, &len);
> -     else
> +     else{
> +             tpm2_end_auth_session(chip);
>               tpm2_flush_space(chip);
> +     }
>  
>  out_rc:
>       return ret ? ret : len;
> diff --git a/drivers/char/tpm/tpm-interface.c
> b/drivers/char/tpm/tpm-interface.c
> index 5da134f12c9a..8b2f5f4c682a 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -379,9 +379,10 @@ int tpm_pm_suspend(struct device *dev)
>  
>       rc = tpm_try_get_ops(chip);
>       if (!rc) {
> -             if (chip->flags & TPM_CHIP_FLAG_TPM2)
> +             if (chip->flags & TPM_CHIP_FLAG_TPM2){
> +                     tpm2_end_auth_session(chip);
>                       tpm2_shutdown(chip, TPM2_SU_STATE);
> -             else
> +             }else
>                       rc = tpm1_pm_suspend(chip, tpm_suspend_pcr);
>  
>               tpm_put_ops(chip);
> diff --git a/drivers/char/tpm/tpm2-sessions.c
> b/drivers/char/tpm/tpm2-sessions.c
> index 511c67061728..e7f887efce24 100644
> --- a/drivers/char/tpm/tpm2-sessions.c
> +++ b/drivers/char/tpm/tpm2-sessions.c
> @@ -333,6 +333,8 @@ void tpm_buf_append_hmac_session(struct tpm_chip *chip,
> struct tpm_buf *buf,
>       }
>  
>  #ifdef CONFIG_TCG_TPM2_HMAC
> +     /* The first write to /dev/tpm{rm0} will flush the session. */
> +     attributes |= TPM2_SA_CONTINUE_SESSION;
>       /*
>        * The Architecture Guide requires us to strip trailing zeros
>        * before computing the HMAC
> 
> 
> I hope my test was helpful for your works.
> Thank you.

I did not expect my fixes to fix this. It was fix for https://bugzilla.kernel.org/show_bug.cgi?id=219229.

I sent pull request for it so should merged soon to mainline: https://lore.kernel.org/linux-integrity/D57UNU1ZUXJS.E2RDQDB8XFKI@kernel.org/T/#u

I attached a fix candidate to the bug.
Comment 9 Mike Seo 2024-10-31 00:02:23 UTC
Dear Mr.jarkko,

I have experienced this bug reported by me, Bug 219383  System reboot on S3 sleep/wakeup test, on various kernel versions among 6.1 to 6.12.

On comment #6, actually I had tested in lg laptop based on lunarlake, uptodate platform of intel. 
But  unstable kernel such as 6.11 and 6.12 can operate normally in the device.
Anyway some devices without your boot time patch also didn't represent this symptom.
So my comment #6 may be wrong.

To correct this, I had japerlake devices which have had this bug.
Let me test your fix candidate and let you know.
Thank you for your supports.
Comment 10 jarkko 2024-10-31 00:46:42 UTC
(In reply to Mike Seo from comment #9)
> Dear Mr.jarkko,
> 
> I have experienced this bug reported by me, Bug 219383  System reboot on S3
> sleep/wakeup test, on various kernel versions among 6.1 to 6.12.
> 
> On comment #6, actually I had tested in lg laptop based on lunarlake,
> uptodate platform of intel. 
> But  unstable kernel such as 6.11 and 6.12 can operate normally in the
> device.
> Anyway some devices without your boot time patch also didn't represent this
> symptom.
> So my comment #6 may be wrong.
> 
> To correct this, I had japerlake devices which have had this bug.
> Let me test your fix candidate and let you know.
> Thank you for your supports.

I refined my fix to be (fully I believe) race-free: https://lore.kernel.org/linux-integrity/20241031004501.165027-1-jarkko@kernel.org/T/#u
Comment 11 jarkko 2024-10-31 00:47:11 UTC
Created attachment 307096 [details]
Fix candidate v2
Comment 12 jarkko 2024-10-31 09:49:39 UTC
So: I believe there was an actual race condition in tpm_pm_suspend(). If the 2nd candidate works for you I'll add "Tested-by: Mike Seo <mikeseohyungjin@gmail.com>" to the bug and it should be done then, and can still make into 6.12.
Comment 13 jarkko 2024-11-01 00:26:23 UTC
Created attachment 307111 [details]
Fix candidate v3
Comment 15 jarkko 2024-11-01 21:28:02 UTC
Created attachment 307118 [details]
Final Fix
Comment 16 jarkko 2024-11-01 21:29:35 UTC
(In reply to Mike Seo from comment #9)
> Dear Mr.jarkko,
> 
> I have experienced this bug reported by me, Bug 219383  System reboot on S3
> sleep/wakeup test, on various kernel versions among 6.1 to 6.12.
> 
> On comment #6, actually I had tested in lg laptop based on lunarlake,
> uptodate platform of intel. 
> But  unstable kernel such as 6.11 and 6.12 can operate normally in the
> device.
> Anyway some devices without your boot time patch also didn't represent this
> symptom.
> So my comment #6 may be wrong.
> 
> To correct this, I had japerlake devices which have had this bug.
> Let me test your fix candidate and let you know.
> Thank you for your supports.

See the attachment ("Final Fix"). It should fully address the issue so if possible check if it works for you, thanks.
Comment 17 Mike Seo 2024-11-03 23:31:05 UTC
Dear Mr. Jarkko,

I tested your Fix candidate v3 with 10487 sleep/wake tests, almost 84 hours, and found out no problem.
Without your patch, the symptom reproduced under 1000 tests in the device, which has kernel v6.8 and ubuntu 22.04.

Let me test your Final Fix as well.
Comment 18 Mike Seo 2024-11-03 23:41:58 UTC
Dear Mr. Jarkko,

I think your Final Fix and Fix candidate v3 are same.
I am sure that your The Final Fix fixes this issue as well.
Thank you for your efforts.
Comment 19 jarkko 2024-11-03 23:57:28 UTC
(In reply to Mike Seo from comment #18)
> Dear Mr. Jarkko,
> 
> I think your Final Fix and Fix candidate v3 are same.
> I am sure that your The Final Fix fixes this issue as well.
> Thank you for your efforts.

Yeah for most part yes. OK I'll add your tested-by before sending the patch to Linus. Thank you for reporting this and help with testing!
Comment 21 jarkko 2024-11-05 09:16:36 UTC
The fix landed. Thanks for the report!