Bug 217626 - Read from RTC broken upon resume from suspend on HP Pro x360 435 G9 w/ AMD Ryzen 7 5825U
Summary: Read from RTC broken upon resume from suspend on HP Pro x360 435 G9 w/ AMD Ry...
Status: RESOLVED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Realtime Clock (show other bugs)
Hardware: AMD Linux
: P3 high
Assignee: timers_realtime-clock
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-03 07:55 UTC by Carsten Hatger
Modified: 2024-02-12 20:56 UTC (History)
7 users (show)

See Also:
Kernel Version: 6.1.34
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output post resume from standby (161.75 KB, text/plain)
2023-07-03 07:55 UTC, Carsten Hatger
Details
[PATCH 1/2] rtc-mc146818-lib: increase delay in mc146818_avoid_UIP (1.90 KB, patch)
2023-07-03 20:54 UTC, Mateusz Jończyk
Details | Diff
[PATCH 2/2] x86/rtc: read RTC multiple times in mach_get_cmos_time (1.40 KB, patch)
2023-07-03 20:55 UTC, Mateusz Jończyk
Details | Diff
attachment-30894-0.html (3.14 KB, text/html)
2023-07-04 07:24 UTC, Carsten Hatger
Details
attachment-13150-0.html (3.65 KB, text/html)
2023-07-04 08:01 UTC, Carsten Hatger
Details
Output of journalctl -k -b all | grep --context=5 'read current time' (3.28 KB, text/plain)
2023-07-04 08:05 UTC, Carsten Hatger
Details
dmesg: RTC access storm on an Intel Kaby Lake laptop (1.05 MB, text/plain)
2023-07-10 19:34 UTC, Mateusz Jończyk
Details
attachment-27807-0.html (2.01 KB, text/html)
2023-07-20 09:37 UTC, Carsten Hatger
Details
attachment-28635-0.html (4.94 KB, text/html)
2023-07-20 09:40 UTC, Carsten Hatger
Details
RTC_issue-w_stack_dump_enabled-dmesg.txt (118.22 KB, text/plain)
2023-07-23 10:02 UTC, Carsten Hatger
Details
[PATCH] x86/rtc, s2idle: debugging messages for rtc access storm in s2idle (1.60 KB, patch)
2023-07-26 09:06 UTC, Mateusz Jończyk
Details | Diff
attachment-27328-0.html (4.13 KB, text/html)
2023-07-27 18:56 UTC, Carsten Hatger
Details
dmesg-RTC_issue.txt (531.32 KB, text/plain)
2023-08-03 08:01 UTC, Carsten Hatger
Details
[PATCH] DEBUG: rtc-mc146818-lib: raise iteration limit, warn for high iteration count (1.56 KB, patch)
2023-08-04 19:27 UTC, Mateusz Jończyk
Details | Diff
attachment-17994-0.html (1.78 KB, text/html)
2023-08-07 18:24 UTC, Carsten Hatger
Details
dmesg_2023-08-16.txt (1.18 MB, text/plain)
2023-08-16 07:54 UTC, Carsten Hatger
Details
attachment-15243-0.html (2.12 KB, text/html)
2023-08-17 09:13 UTC, Carsten Hatger
Details
dmesg-w-pm_debug_messages.txt.gz (180.97 KB, application/gzip)
2023-08-17 13:24 UTC, Carsten Hatger
Details
gpio-post-suspend-w-pm_debug_messages (19.10 KB, application/octet-stream)
2023-08-17 20:41 UTC, Carsten Hatger
Details
gpio-pre-suspend-w-pm_debug_messages (19.10 KB, application/octet-stream)
2023-08-17 20:41 UTC, Carsten Hatger
Details
gpio-post-suspend-wo-pm_debug_messages (19.10 KB, application/octet-stream)
2023-08-17 20:41 UTC, Carsten Hatger
Details
gpio-pre-suspend-wo-pm_debug_messages (19.11 KB, application/octet-stream)
2023-08-17 20:41 UTC, Carsten Hatger
Details
gpio-pre_suspend-6.1.34.txt (32.77 KB, text/plain)
2023-08-18 08:34 UTC, Carsten Hatger
Details
attachment-6924-0.html (2.74 KB, text/html)
2023-08-18 20:38 UTC, Carsten Hatger
Details
dmesg-post-bios_update.txt.gz (47.67 KB, application/gzip)
2023-08-23 19:33 UTC, Carsten Hatger
Details
attachment-15770-0.html (1.82 KB, text/html)
2023-08-23 20:00 UTC, Carsten Hatger
Details

Description Carsten Hatger 2023-07-03 07:55:39 UTC
Created attachment 304533 [details]
dmesg output post resume from standby

Dear all,

read access to the device's RTC fails upon resume from suspend with the following error message grabbed from the kernel ring buffer:

[409610.163139] Unable to read current time from RTC

The issue is rare and more likely to appear if the system has been in standby for more than 24hrs w/o interruption.

Same issue is reported to appear on HP Elite G9 devices, cf. https://bbs.archlinux.org/viewtopic.php?id=283620

The issue screws up the systems time resulting in erroneous dated log entries, failing authentication, ..., thus requiring a reboot.

Please see attached dmesg output (line 2069) for further information. 

AMD staff member suspected commit d2a632a8a11756197deb1341bbb09c09abaf20ce to cause a race condition, thus being responsible for the issue. They suggested to increase the delays introduced by latter commmit to 300us. Unfortunately this does not resolve the issue.

Any help would be appreciated.

Yours,
Carsten
Comment 1 Bagas Sanjaya 2023-07-03 13:08:35 UTC
(In reply to Carsten Hatger from comment #0)
> Created attachment 304533 [details]
> dmesg output post resume from standby
> 
> Dear all,
> 
> read access to the device's RTC fails upon resume from suspend with the
> following error message grabbed from the kernel ring buffer:
> 
> [409610.163139] Unable to read current time from RTC
> 
> The issue is rare and more likely to appear if the system has been in
> standby for more than 24hrs w/o interruption.
> 
> Same issue is reported to appear on HP Elite G9 devices, cf.
> https://bbs.archlinux.org/viewtopic.php?id=283620
> 
> The issue screws up the systems time resulting in erroneous dated log
> entries, failing authentication, ..., thus requiring a reboot.
> 
> Please see attached dmesg output (line 2069) for further information. 
> 
> AMD staff member suspected commit d2a632a8a11756197deb1341bbb09c09abaf20ce
> to cause a race condition, thus being responsible for the issue. They
> suggested to increase the delays introduced by latter commmit to 300us.
> Unfortunately this does not resolve the issue.
> 
> Any help would be appreciated.
> 
> Yours,
> Carsten

The culprit is from more than a year ago. Can you confirm that the latest mainline also have this regression?
Comment 2 Mateusz Jończyk 2023-07-03 20:54:04 UTC
Hello,

Mr Bagas Sanjaya has notified me of this issue (thanks).

> AMD staff member suspected commit d2a632a8a11756197deb1341bbb09c09abaf20ce to
> cause a race condition, thus being responsible for the issue.

Excuse me, but I cannot see how this commit could cause a race condition. All RTC accesses are / should be guarded by the rtc_lock. Could you elaborate on this, or perhaps quote this AMD staff member, if possible?

Regardless, I'll attach a patch that partially reverts this commit.

I'm not sure if this bug is present only during s2idle; it is likely that it would not be visible during normal usage. Is it present in other kernel logs? Try something like this:

    journalctl -k -b all | grep --context=5 'read current time'

I'm also attaching another patch that should make this bug appear more frequently in kernel logs. Could you test both patches? They are prepared on top of 6.4.1.

Both patches only compile-tested.
Comment 3 Mateusz Jończyk 2023-07-03 20:54:53 UTC
Created attachment 304538 [details]
[PATCH 1/2] rtc-mc146818-lib: increase delay in mc146818_avoid_UIP
Comment 4 Mateusz Jończyk 2023-07-03 20:55:19 UTC
Created attachment 304539 [details]
[PATCH 2/2] x86/rtc: read RTC multiple times in mach_get_cmos_time
Comment 5 Carsten Hatger 2023-07-04 07:24:30 UTC
Created attachment 304543 [details]
attachment-30894-0.html

I can confirm the bug to be present in the most recent long term stable
branch release 6.1.34 at least.

I didn't test mainline releases yet because I wanted to reproduce the issue
w/ some AMD staff suggested patch applied to the kernel revision the issue
first appeared in - which totally makes sense to me.

Cheers,
Carsten

On Mon, Jul 3, 2023 at 3:08 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> Bagas Sanjaya (bagasdotme@gmail.com) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |bagasdotme@gmail.com
>
> --- Comment #1 from Bagas Sanjaya (bagasdotme@gmail.com) ---
> (In reply to Carsten Hatger from comment #0)
> > Created attachment 304533 [details]
> > dmesg output post resume from standby
> >
> > Dear all,
> >
> > read access to the device's RTC fails upon resume from suspend with the
> > following error message grabbed from the kernel ring buffer:
> >
> > [409610.163139] Unable to read current time from RTC
> >
> > The issue is rare and more likely to appear if the system has been in
> > standby for more than 24hrs w/o interruption.
> >
> > Same issue is reported to appear on HP Elite G9 devices, cf.
> > https://bbs.archlinux.org/viewtopic.php?id=283620
> >
> > The issue screws up the systems time resulting in erroneous dated log
> > entries, failing authentication, ..., thus requiring a reboot.
> >
> > Please see attached dmesg output (line 2069) for further information.
> >
> > AMD staff member suspected commit
> d2a632a8a11756197deb1341bbb09c09abaf20ce
> > to cause a race condition, thus being responsible for the issue. They
> > suggested to increase the delays introduced by latter commmit to 300us.
> > Unfortunately this does not resolve the issue.
> >
> > Any help would be appreciated.
> >
> > Yours,
> > Carsten
>
> The culprit is from more than a year ago. Can you confirm that the latest
> mainline also have this regression?
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 6 Carsten Hatger 2023-07-04 08:01:17 UTC
Created attachment 304544 [details]
attachment-13150-0.html

Hi Mateusz,

On Mon, Jul 3, 2023 at 10:54 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> Mateusz Jończyk (mat.jonczyk@o2.pl) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |mat.jonczyk@o2.pl
>
> --- Comment #2 from Mateusz Jończyk (mat.jonczyk@o2.pl) ---
> Hello,
>
> Mr Bagas Sanjaya has notified me of this issue (thanks).
>
> > AMD staff member suspected commit
> d2a632a8a11756197deb1341bbb09c09abaf20ce to
> > cause a race condition, thus being responsible for the issue.
>
> Excuse me, but I cannot see how this commit could cause a race condition.
> All
> RTC accesses are / should be guarded by the rtc_lock. Could you elaborate
> on
> this, or perhaps quote this AMD staff member, if possible?
>
> Regardless, I'll attach a patch that partially reverts this commit.
>
> I'm not sure if this bug is present only during s2idle; it is likely that
> it
> would not be visible during normal usage. Is it present in other kernel
> logs?
> Try something like this:
>
>     journalctl -k -b all | grep --context=5 'read current time'
>

please find attached to this mail the context'd system's journal for your
information.


> I'm also attaching another patch that should make this bug appear more
> frequently in kernel logs. Could you test both patches? They are prepared
> on
> top of 6.4.1.
>
> Both patches only compile-tested.
>

the patch #1 you're suggesting is almost identical to AMD's proposal.
However, delays induced have been set to 300 usec instead of 1000 usec.
Since I've been able to reproduce the issue using 300 usec delay in atomic
ctx, I'll do the following:

- stick to 300 usec delays since they're able to trigger the issue
- stick to kernel 6.1.34, since this is the most recent long term stable
version
- apply patch #2 to have more info available in the log(s)
- come back to this report on further occurrence of the issue - this may
take a while, since the issue usually needs more than 24hrs of standby.

Afterwards I could upgrade to the mainline's most recent release in order
to reproduce the issue right there. FYA: I found no further changes
to rtc-mc146818-lib.c since commit d2a632a8a11756197deb1341bbb09c09abaf20ce

Yours,
Carsten


>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 7 Carsten Hatger 2023-07-04 08:05:28 UTC
Created attachment 304545 [details]
Output of journalctl -k -b all | grep --context=5 'read current time'

Ooops, 

forgot to attach the system's journal a produced by 

$ journalctl -k -b all | grep --context=5 'read current time'

Sorry for that,
Carsten
Comment 8 Mateusz Jończyk 2023-07-07 19:01:31 UTC
> Created attachment 304545 [details]
> Output of journalctl -k -b all | grep --context=5 'read current time'

So this happens only during s2idle, sometimes multiple times in a row.

What are the contents of /sys/power/mem_sleep on this laptop?

> - apply patch #2 to have more info available in the log(s)

This patch will probably mask this problem: there will only be a message in dmesg, but the system clock will run as normal afterwards.

I'll send you a third patch once it stops compiling - it prints values returned from all invocations of mc146818_get_time() to see if this issue happens during RTC updates.
Comment 9 Mateusz Jończyk 2023-07-10 19:31:54 UTC
Hello,

I think I know what is going on. At least on my laptop [1], during s2idle, the mach_get_cmos_time() function is executed several hundred times per second (around 400, to be precise).

The stacktraces of the accesses look like so:

[  718.707429] mach_get_cmos_time: returning 1688934290
[  718.707430] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.4.2unif25-nokaslr-00006-ga704b8946697 #386
[  718.707432] Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.69 04/21/2023
[  718.707432] Call Trace:
[  718.707433]  <TASK>
[  718.707434]  dump_stack_lvl+0x37/0x50
[  718.707437]  dump_stack+0x10/0x20
[  718.707440]  mach_get_cmos_time+0xbe/0x100
[  718.707444]  read_persistent_clock64+0x15/0x20
[  718.707448]  timekeeping_resume+0x31/0x160
[  718.707450]  ? __this_cpu_preempt_check+0x13/0x20
[  718.707453]  ? ct_kernel_enter.constprop.0+0x98/0xf0
[  718.707456]  tick_unfreeze+0x4b/0xb0
[  718.707459]  enter_s2idle_proper+0x87/0x170
[  718.707462]  cpuidle_enter_s2idle+0xaa/0x110
[  718.707465]  do_idle+0x21f/0x250
[  718.707468]  cpu_startup_entry+0x20/0x30
[  718.707471]  start_secondary+0x128/0x160
[  718.707472]  secondary_startup_64_no_verify+0xf4/0xfb
[  718.707476]  </TASK>
[  718.709957] mach_get_cmos_time: returning 1688934290
[  718.709958] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.4.2unif25-nokaslr-00006-ga704b8946697 #386
[  718.709959] Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.69 04/21/2023
[  718.709960] Call Trace:
[  718.709961]  <TASK>
[  718.709962]  dump_stack_lvl+0x37/0x50
[  718.709965]  dump_stack+0x10/0x20
[  718.709968]  mach_get_cmos_time+0xbe/0x100
[  718.709972]  ? finish_task_switch+0x99/0x300
[  718.709976]  read_persistent_clock64+0x15/0x20
[  718.709979]  timekeeping_suspend+0x2d/0x2e0
[  718.709982]  tick_freeze+0x5b/0xc0
[  718.709985]  enter_s2idle_proper+0x34/0x170
[  718.709988]  cpuidle_enter_s2idle+0xaa/0x110
[  718.709991]  do_idle+0x21f/0x250
[  718.709994]  cpu_startup_entry+0x20/0x30
[  718.709997]  start_secondary+0x128/0x160
[  718.709999]  secondary_startup_64_no_verify+0xf4/0xfb
[  718.710003]  </TASK>
[  718.712481] mach_get_cmos_time: returning 1688934290

The CPU numbers in the stacktraces switch during the same s2idle session.

There is one special thing about these stacktraces: there is no s2idle_loop() function there (as defined in kernel/power/suspend.c ) or friends. Functions in these stacktraces are executed in the CPU idle thread (swapper), which runs when no program uses the CPU and continually sleeps. See kernel/sched/idle.c (where do_idle() and cpu_startup_entry() are defined). cpuidle_enter_s2idle() is executed AFAIK after every CPU wakeup, which happens multiple times per second. It calls tick_freeze() and later tick_unfreeze(), both of which call mach_get_cmos_time().

Caveat: the above has been shown with "[PATCH 2/2] x86/rtc: read RTC multiple times in mach_get_cmos_time" applied, which might influence things.

It is likely that so frequent RTC accesses uncover things (or that the RTC is overheating, or sth else). During testing of my patch series back then (commit d2a632a8a11 and friends that you mentioned) I did run experiments like

while true; do cat /sys/class/rtc/rtc0/time ; done

but not during such long time periods.

I'll prepare you patches to see if such frequent RTC accesses happen also on your hardware.

Greetings,

Mateusz

[1] HP 17-by0001nw, from 2019-ish year, with a Kaby Lake Intel CPU.
Comment 10 Mateusz Jończyk 2023-07-10 19:34:15 UTC
Created attachment 304600 [details]
dmesg: RTC access storm on an Intel Kaby Lake laptop

dmesg with the stacktraces from my laptop.
Comment 11 Carsten Hatger 2023-07-20 09:37:51 UTC
Created attachment 304670 [details]
attachment-27807-0.html

Dear all,


On Fri, Jul 7, 2023 at 9:01 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #8 from Mateusz Jończyk (mat.jonczyk@o2.pl) ---
> > Created attachment 304545 [details]
> > Output of journalctl -k -b all | grep --context=5 'read current time'
>
> So this happens only during s2idle, sometimes multiple times in a row.
>
> What are the contents of /sys/power/mem_sleep on this laptop?
>

file contents are "s2idle"


>
> > - apply patch #2 to have more info available in the log(s)
>
> This patch will probably mask this problem: there will only be a message in
> dmesg, but the system clock will run as normal afterwards.
>
> I'll send you a third patch once it stops compiling - it prints values
> returned
> from all invocations of mc146818_get_time() to see if this issue happens
> during
> RTC updates.
>
>
THX, I'll apply them and launch another experiment when they become
available.


> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.


Yours,
Carsten
Comment 12 Carsten Hatger 2023-07-20 09:40:35 UTC
Created attachment 304671 [details]
attachment-28635-0.html

Dear Mateusz,


On Mon, Jul 10, 2023 at 9:31 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #9 from Mateusz Jończyk (mat.jonczyk@o2.pl) ---
> Hello,
>
> I think I know what is going on. At least on my laptop [1], during s2idle,
> the
> mach_get_cmos_time() function is executed several hundred times per second
> (around 400, to be precise).
>
> The stacktraces of the accesses look like so:
>
> [  718.707429] mach_get_cmos_time: returning 1688934290
> [  718.707430] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
> 6.4.2unif25-nokaslr-00006-ga704b8946697 #386
> [  718.707432] Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.69
> 04/21/2023
> [  718.707432] Call Trace:
> [  718.707433]  <TASK>
> [  718.707434]  dump_stack_lvl+0x37/0x50
> [  718.707437]  dump_stack+0x10/0x20
> [  718.707440]  mach_get_cmos_time+0xbe/0x100
> [  718.707444]  read_persistent_clock64+0x15/0x20
> [  718.707448]  timekeeping_resume+0x31/0x160
> [  718.707450]  ? __this_cpu_preempt_check+0x13/0x20
> [  718.707453]  ? ct_kernel_enter.constprop.0+0x98/0xf0
> [  718.707456]  tick_unfreeze+0x4b/0xb0
> [  718.707459]  enter_s2idle_proper+0x87/0x170
> [  718.707462]  cpuidle_enter_s2idle+0xaa/0x110
> [  718.707465]  do_idle+0x21f/0x250
> [  718.707468]  cpu_startup_entry+0x20/0x30
> [  718.707471]  start_secondary+0x128/0x160
> [  718.707472]  secondary_startup_64_no_verify+0xf4/0xfb
> [  718.707476]  </TASK>
> [  718.709957] mach_get_cmos_time: returning 1688934290
> [  718.709958] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
> 6.4.2unif25-nokaslr-00006-ga704b8946697 #386
> [  718.709959] Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.69
> 04/21/2023
> [  718.709960] Call Trace:
> [  718.709961]  <TASK>
> [  718.709962]  dump_stack_lvl+0x37/0x50
> [  718.709965]  dump_stack+0x10/0x20
> [  718.709968]  mach_get_cmos_time+0xbe/0x100
> [  718.709972]  ? finish_task_switch+0x99/0x300
> [  718.709976]  read_persistent_clock64+0x15/0x20
> [  718.709979]  timekeeping_suspend+0x2d/0x2e0
> [  718.709982]  tick_freeze+0x5b/0xc0
> [  718.709985]  enter_s2idle_proper+0x34/0x170
> [  718.709988]  cpuidle_enter_s2idle+0xaa/0x110
> [  718.709991]  do_idle+0x21f/0x250
> [  718.709994]  cpu_startup_entry+0x20/0x30
> [  718.709997]  start_secondary+0x128/0x160
> [  718.709999]  secondary_startup_64_no_verify+0xf4/0xfb
> [  718.710003]  </TASK>
> [  718.712481] mach_get_cmos_time: returning 1688934290
>
> The CPU numbers in the stacktraces switch during the same s2idle session.
>
> There is one special thing about these stacktraces: there is no
> s2idle_loop()
> function there (as defined in kernel/power/suspend.c ) or friends.
> Functions in
> these stacktraces are executed in the CPU idle thread (swapper), which runs
> when no program uses the CPU and continually sleeps. See
> kernel/sched/idle.c
> (where do_idle() and cpu_startup_entry() are defined).
> cpuidle_enter_s2idle()
> is executed AFAIK after every CPU wakeup, which happens multiple times per
> second. It calls tick_freeze() and later tick_unfreeze(), both of which
> call
> mach_get_cmos_time().
>
> Caveat: the above has been shown with "[PATCH 2/2] x86/rtc: read RTC
> multiple
> times in mach_get_cmos_time" applied, which might influence things.
>
> It is likely that so frequent RTC accesses uncover things (or that the RTC
> is
> overheating, or sth else). During testing of my patch series back then
> (commit
> d2a632a8a11 and friends that you mentioned) I did run experiments like
>
> while true; do cat /sys/class/rtc/rtc0/time ; done
>
> but not during such long time periods.
>
> I'll prepare you patches to see if such frequent RTC accesses happen also
> on
> your hardware.
>
>
I'd appreciate that very much. Thanks for your support.


> Greetings,
>
> Mateusz
>
> [1] HP 17-by0001nw, from 2019-ish year, with a Kaby Lake Intel CPU.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.


Yours,
Carsten
Comment 13 Carsten Hatger 2023-07-23 10:02:37 UTC
Created attachment 304687 [details]
RTC_issue-w_stack_dump_enabled-dmesg.txt

Hi folks,

finally I managed to reproduce the issue of not being able to read the RTC
post resume from suspend (s2idle), having the stack dumped to the kernel
ring buffer. Please find the output of dmesg attached to this message. The
whole story starts at line 1566.

I'm pretty sure the issue will occur only if the system has been on standby
for more than 24hrs. Please note that the system has been attached to the
AC after being in s2idle for more than 2 dys in order to prevent the
battery from being fully drained.

Any support would be great,

Yours,
Carsten

On Thu, Jul 20, 2023 at 11:40 AM Carsten Hatger <xmb8dsv4@gmail.com> wrote:

> Dear Mateusz,
>
>
> On Mon, Jul 10, 2023 at 9:31 PM <bugzilla-daemon@kernel.org> wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>>
>> --- Comment #9 from Mateusz Jończyk (mat.jonczyk@o2.pl) ---
>> Hello,
>>
>> I think I know what is going on. At least on my laptop [1], during
>> s2idle, the
>> mach_get_cmos_time() function is executed several hundred times per second
>> (around 400, to be precise).
>>
>> The stacktraces of the accesses look like so:
>>
>> [  718.707429] mach_get_cmos_time: returning 1688934290
>> [  718.707430] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
>> 6.4.2unif25-nokaslr-00006-ga704b8946697 #386
>> [  718.707432] Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.69
>> 04/21/2023
>> [  718.707432] Call Trace:
>> [  718.707433]  <TASK>
>> [  718.707434]  dump_stack_lvl+0x37/0x50
>> [  718.707437]  dump_stack+0x10/0x20
>> [  718.707440]  mach_get_cmos_time+0xbe/0x100
>> [  718.707444]  read_persistent_clock64+0x15/0x20
>> [  718.707448]  timekeeping_resume+0x31/0x160
>> [  718.707450]  ? __this_cpu_preempt_check+0x13/0x20
>> [  718.707453]  ? ct_kernel_enter.constprop.0+0x98/0xf0
>> [  718.707456]  tick_unfreeze+0x4b/0xb0
>> [  718.707459]  enter_s2idle_proper+0x87/0x170
>> [  718.707462]  cpuidle_enter_s2idle+0xaa/0x110
>> [  718.707465]  do_idle+0x21f/0x250
>> [  718.707468]  cpu_startup_entry+0x20/0x30
>> [  718.707471]  start_secondary+0x128/0x160
>> [  718.707472]  secondary_startup_64_no_verify+0xf4/0xfb
>> [  718.707476]  </TASK>
>> [  718.709957] mach_get_cmos_time: returning 1688934290
>> [  718.709958] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
>> 6.4.2unif25-nokaslr-00006-ga704b8946697 #386
>> [  718.709959] Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.69
>> 04/21/2023
>> [  718.709960] Call Trace:
>> [  718.709961]  <TASK>
>> [  718.709962]  dump_stack_lvl+0x37/0x50
>> [  718.709965]  dump_stack+0x10/0x20
>> [  718.709968]  mach_get_cmos_time+0xbe/0x100
>> [  718.709972]  ? finish_task_switch+0x99/0x300
>> [  718.709976]  read_persistent_clock64+0x15/0x20
>> [  718.709979]  timekeeping_suspend+0x2d/0x2e0
>> [  718.709982]  tick_freeze+0x5b/0xc0
>> [  718.709985]  enter_s2idle_proper+0x34/0x170
>> [  718.709988]  cpuidle_enter_s2idle+0xaa/0x110
>> [  718.709991]  do_idle+0x21f/0x250
>> [  718.709994]  cpu_startup_entry+0x20/0x30
>> [  718.709997]  start_secondary+0x128/0x160
>> [  718.709999]  secondary_startup_64_no_verify+0xf4/0xfb
>> [  718.710003]  </TASK>
>> [  718.712481] mach_get_cmos_time: returning 1688934290
>>
>> The CPU numbers in the stacktraces switch during the same s2idle session.
>>
>> There is one special thing about these stacktraces: there is no
>> s2idle_loop()
>> function there (as defined in kernel/power/suspend.c ) or friends.
>> Functions in
>> these stacktraces are executed in the CPU idle thread (swapper), which
>> runs
>> when no program uses the CPU and continually sleeps. See
>> kernel/sched/idle.c
>> (where do_idle() and cpu_startup_entry() are defined).
>> cpuidle_enter_s2idle()
>> is executed AFAIK after every CPU wakeup, which happens multiple times per
>> second. It calls tick_freeze() and later tick_unfreeze(), both of which
>> call
>> mach_get_cmos_time().
>>
>> Caveat: the above has been shown with "[PATCH 2/2] x86/rtc: read RTC
>> multiple
>> times in mach_get_cmos_time" applied, which might influence things.
>>
>> It is likely that so frequent RTC accesses uncover things (or that the
>> RTC is
>> overheating, or sth else). During testing of my patch series back then
>> (commit
>> d2a632a8a11 and friends that you mentioned) I did run experiments like
>>
>> while true; do cat /sys/class/rtc/rtc0/time ; done
>>
>> but not during such long time periods.
>>
>> I'll prepare you patches to see if such frequent RTC accesses happen also
>> on
>> your hardware.
>>
>>
> I'd appreciate that very much. Thanks for your support.
>
>
>> Greetings,
>>
>> Mateusz
>>
>> [1] HP 17-by0001nw, from 2019-ish year, with a Kaby Lake Intel CPU.
>>
>> --
>> You may reply to this email to add a comment.
>>
>> You are receiving this mail because:
>> You reported the bug.
>
>
> Yours,
> Carsten
>
Comment 14 Mateusz Jończyk 2023-07-26 09:05:26 UTC
Hello,

I'm sorry for the delay.

It turns out that after removing the modifications to mach_get_cmos_time() that caused it to read the RTC 20 times, the storm of accesses to mach_get_cmos_time() during s2idle disappeared on my laptop. So possibly longer execution times of mach_get_cmos_time() were somehow affecting the cpuidle mechanism.

Please test the following patch on top of unmodified 6.1.x or 6.4.x kernel. I'd like to ask you for a dmesg after several minutes of s2idle. Set CONFIG_LOG_BUF_SHIFT (in "General setup" -> "Kernel log buffer size") to a higher value, or boot with sth like log_buf_len=32M so that the log does not overflow.

It is, however, possible that RTC access storm happens only during some runs of s2idle and it won't be possible to catch it the first time.

Does running something like this:

    while true; do cat /sys/class/rtc/rtc0/time ; sleep 0.05; done

for several minutes trigger this problem (give an "Unable to read current time from RTC" message in dmesg)?
Comment 15 Mateusz Jończyk 2023-07-26 09:06:24 UTC
Created attachment 304698 [details]
[PATCH] x86/rtc, s2idle: debugging messages for rtc access storm in s2idle
Comment 16 Carsten Hatger 2023-07-27 18:56:47 UTC
Created attachment 304714 [details]
attachment-27328-0.html

Hi all, Mateusz


On Wed, Jul 26, 2023 at 11:05 AM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #14 from Mateusz Jończyk (mat.jonczyk@o2.pl) ---
> Hello,
>
> I'm sorry for the delay.
>
> It turns out that after removing the modifications to mach_get_cmos_time()
> that
> caused it to read the RTC 20 times, the storm of accesses to
> mach_get_cmos_time() during s2idle disappeared on my laptop. So possibly
> longer
> execution times of mach_get_cmos_time() were somehow affecting the cpuidle
> mechanism.
>
> Please test the following patch on top of unmodified 6.1.x or 6.4.x
> kernel. I'd
> like to ask you for a dmesg after several minutes of s2idle. Set
> CONFIG_LOG_BUF_SHIFT (in "General setup" -> "Kernel log buffer size") to a
> higher value, or boot with sth like log_buf_len=32M so that the log does
> not
> overflow.
>

I've built a kernel including your patch on top of 6.1.40. I'm gathering
data right now and will come back to you upon there is sufficient data
available.


>
> It is, however, possible that RTC access storm happens only during some
> runs of
> s2idle and it won't be possible to catch it the first time.
>
> Does running something like this:
>
>     while true; do cat /sys/class/rtc/rtc0/time ; sleep 0.05; done
>
> for several minutes trigger this problem (give an "Unable to read current
> time
> from RTC" message in dmesg)?
>

Nope, everythings fine here, tests did run for approx. 60 min :-|


>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.


I did some further research on this bug to further narrow down the search
space. I'd like to draw your attention to this report:
-
https://www.reddit.com/r/debian/comments/popag0/strange_clock_drift_after_suspendresume/


What I'd suggest to conclude from this is the following:
- only laptop devices are affected
- it's not a CPU issue since there are reports from both Intel and AMD
based systems
- the issue is triggered after the system has been suspended for a while.

I'm not a hardware/firmware or BIOS engineer, but I have the strong feeling
that the issue is related to the way power is delivered to the RTC. At some
point in time - while being in s2idle - the system's firmware/BIOS needs to
switch the RTCs power supply from the laptop's battery to the small CR3032
battery cell because the laptop's battery charge level has fallen below
some threshold and thus the RTC is about to lose its state unless the power
supply is switched to the CR3023 backup battery device. This would explain
why the issue occurs only if the system has been in s2idle for some time.
I've no idea whether there is an Interrupt raised or timings do change upon
switching to the alternate powersource. Any help would be appreciated.

I hope this is not too much poking about in the fog for all you experts.

THX so far, Yours,
Carsten
Comment 17 Carsten Hatger 2023-08-03 07:59:53 UTC
Dear all,

it took some time to reproduce the issue w/ debug info enabled. Nevertheless here it comes. Please find attached to this message the latest dmesg info titled dmesg-RTC_issue.txt. 

The whole story starts at line # 6000.

Hope that will help to resolve the bug.

Yours,
Carsten
Comment 18 Carsten Hatger 2023-08-03 08:01:17 UTC
Created attachment 304765 [details]
dmesg-RTC_issue.txt

Latest dmesg info w/ RTC issue recorded at line # 6000
Comment 19 Mateusz Jończyk 2023-08-04 19:20:52 UTC
Hello,

> it took some time to reproduce the issue w/ debug info enabled. Nevertheless
> here it comes. Please find attached to this message the latest dmesg info
> titled dmesg-RTC_issue.txt. 

Thank you. This excludes the "RTC access storm" hypothesis, as there is nothing like this in the log. The RTC accesses are not so frequent.

The RTC read problems happened twice, probably both during s2idle wakeup:

	[241612.509214] Unable to read current time from RTC
	[...]
	[241612.560575] rtc_cmos 00:00: unable to read current time

> I hope this is not too much poking about in the fog for all you experts.

No problem. :) I don't consider myself a Linux kernel expert yet.

> I've no idea whether there is an Interrupt raised or timings do change upon
> switching to the alternate powersource.

The RTC CMOS is a simple beast. There are one-byte registers for seconds, minutes, hours, days of a month, etc. It is not possible to read all of these values at once, which could cause incorrect results when some of them are updated while the time is being read. So there is a UIP (Update In Progress) bit in one of the registers that is set when the RTC is updating its time. This happens once a second and should last somewhere around < 2ms. The kernel should not read the RTC time while UIP=1 as it may be incorrect.

More: https://wiki.osdev.org/CMOS#RTC_Update_In_Progress

The (vanilla) kernel gives up when the UIP is set longer then ~10ms (as this means that the clock is malfunctioning).

This has given me an idea. Try to read the RTC for up to ~10s, but warn if it takes longer then the previous limit of ~10ms. This will give an idea how long it takes when the RTC is malfunctioning.

If the time is reasonable (like <100ms), I'll submit a patch that just increases the max loop count in mach_get_cmos_time(), at least except for when it is used in RTC interrupt emulation with HPET (arch/x86/kernel/hpet.c).

The documentation of the CPUs / chipsets in recent AMD systems is sadly not public. I also am not sure if the RTC is present in the CPU, or somewhere on the motherboard.

Greetings,

Mateusz
Comment 20 Mario Limonciello (AMD) 2023-08-04 19:26:09 UTC
> This has given me an idea. Try to read the RTC for up to ~10s, but warn if it
> takes longer then the previous limit of ~10ms. This will give an idea how
> long it takes when the RTC is malfunctioning.

This sounds like a great idea to me.

> The documentation of the CPUs / chipsets in recent AMD systems is sadly not
> public. I also am not sure if the RTC is present in the CPU, or somewhere on
> the motherboard.

This case it's HPET emulation isn't it? 

In any case; here's the public PPR for a recent APU:
https://www.amd.com/system/files/TechDocs/55901_B1_pub_053.zip

You can see in volume 5 that HPET and RTC are both in FCH (within APU).
Comment 21 Mateusz Jończyk 2023-08-04 19:27:34 UTC
Created attachment 304779 [details]
[PATCH] DEBUG: rtc-mc146818-lib: raise iteration limit, warn for high iteration count

There is an issue that reading the RTC time fails during s2idle in some
circumstances. Try much longer to read the RTC, but print a warning if
the previous limit of 100 loop iterations (~10ms) is exceeded.

Please test this patch, OFC with unmodified delays in mc146818_avoid_UIP().
Comment 22 Mateusz Jończyk 2023-08-04 20:15:48 UTC
> In any case; here's the public PPR for a recent APU:
> https://www.amd.com/system/files/TechDocs/55901_B1_pub_053.zip
>
> You can see in volume 5 that HPET and RTC are both in FCH (within APU).

Thank you and sorry for the accusations. Apparently my Google skills failed me.

> This case it's HPET emulation isn't it? 

Do you mean the software emulation of RTC interrupts that happen when the HPET takes the RTC interrupt line (CONFIG_HPET_EMULATE_RTC, at the end of arch/x86/kernel/hpet.c )? If so, that's not HPET emulation.

The kernel is regularly accessing the RTC in the middle of s2idle. This is done in the CPU idle thread (swapper) and (from what I understand) during every cpuidle wakeup. enter_s2idle_proper() in drivers/cpuidle/cpuidle.c calls tick_freeze() and then tick_unfreeze() and it is in a cpuidle loop. Both of these access the RTC.

The problems here happen only during s2idle or during wakeup from it. Very frequent RTC accesses when the system is running did not trigger them.
Comment 23 Carsten Hatger 2023-08-07 18:24:22 UTC
Created attachment 304790 [details]
attachment-17994-0.html

Hello Mateusz,

I've applied the patch to 6.1.43. Delays in mc146818_avoid_UIP are
currently at 100 usec (unmodified).

I got from the logs that it takes 2.7 days on average to trigger the bug.
I'll come back to you when the system has been in s2idle for that duration
(at least).

Thanks for the patch,
Carsten

On Fri, Aug 4, 2023 at 9:27 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #21 from Mateusz Jończyk (mat.jonczyk@o2.pl) ---
> Created attachment 304779 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=304779&action=edit
> [PATCH] DEBUG: rtc-mc146818-lib: raise iteration limit, warn for high
> iteration
> count
>
> There is an issue that reading the RTC time fails during s2idle in some
> circumstances. Try much longer to read the RTC, but print a warning if
> the previous limit of 100 loop iterations (~10ms) is exceeded.
>
> Please test this patch, OFC with unmodified delays in mc146818_avoid_UIP().
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 24 Carsten Hatger 2023-08-16 07:51:32 UTC
Dear Mateusz, Mario,

I've been running the SUT for a couple of days now but haven't been able to put it into s2idle for more than 2 days without interruption due to personal constraints. I'll continue to test until I've been able to keep the SUT in s2idle for more than 2 days.

Anyway, this morning some interesting detail showed up in the kernel ring buffer I'd like to share with you. Please find below the latest dmesg output w/ your patch applied. 

The file which holds latest kernel ring buffer's contents is named dmesg_2023-08-16.txt, the detail of interest starts at line no. 22829.


Yours,
Carsten
Comment 25 Carsten Hatger 2023-08-16 07:54:51 UTC
Created attachment 304864 [details]
dmesg_2023-08-16.txt

Latest kernel ring buffer contents w/ warning raised in mc146818_avoid_UIP.cold+0x0/0x1d.

Please refer to line no. 22828
Comment 26 Mario Limonciello (AMD) 2023-08-17 01:54:12 UTC
Did you add this extra debugging statement for "executed %d iterations" somewhere?  I've never seen that in kernel code before.  I guess you're counting the number of s2idle loops?

Can you please turn on /sys/power/pm_debug_messages as well?  It's a bit weird to me that the loop would run that many times in the first place.  That means the hardware has been returning control to the kernel a lot more frequently than expected (and maybe that's allowing to trigger this issue).
Comment 27 Carsten Hatger 2023-08-17 09:13:07 UTC
Created attachment 304878 [details]
attachment-15243-0.html

Hi Mario, Mateusz

On Thu, Aug 17, 2023 at 3:54 AM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #26 from Mario Limonciello (AMD) (mario.limonciello@amd.com)
> ---
> Did you add this extra debugging statement for "executed %d iterations"
> somewhere?  I've never seen that in kernel code before.  I guess you're
> counting the number of s2idle loops?
>


The debugging statements stem from
https://bugzilla.kernel.org/attachment.cgi?id=304779 which is the latest
patch provided by Mateusz.
The patch aims to track down the issue of not being able to read from the
RTC in cases of being in s2idle for more than approx. 2,5 days.


> Can you please turn on /sys/power/pm_debug_messages as well?  It's a bit
> weird
> to me that the loop would run that many times in the first place.  That
> means
> the hardware has been returning control to the kernel a lot more frequently
> than expected (and maybe that's allowing to trigger this issue).
>

I'll do so.


>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.


Yours,
Carsten
Comment 28 Carsten Hatger 2023-08-17 13:24:21 UTC
Created attachment 304879 [details]
dmesg-w-pm_debug_messages.txt.gz

Hi Mario, Mateusz,

please find the latest dmesg w/ pm_debug_messages enabled attached to this
email. I'm not quite sure whether the session did work out as expected: The
system did suspend successfully but drained the battery within ~4 hrs down
to 50%. In addition to this, it became quite warm but not hot due to lack
of cooling.

The whole story w/ pm_debug_enabled starts at line # 25410.

I'll continue testing in order to trigger the RTC issue after the system
has been in s2idle for > 2,5 days w/o pm_debug_messages enabled.

Yours,
Carsten
Comment 29 Mateusz Jończyk 2023-08-17 13:46:13 UTC
Hello,

> Anyway, this morning some interesting detail showed up in the kernel ring
> buffer I'd like to share with you. Please find below the latest dmesg output
> w/ your patch applied. 

That's what I have been looking for:

	[783644.771489] mc146818_avoid_UIP: reading the RTC time required 350 loop iterations

and later, in dmesg-w-pm_debug_messages.txt.gz:

	[869685.158594] mc146818_avoid_UIP: reading the RTC time required 480 loop iterations

So it seems that this happens also during shorter s2idle intervals.

> Did you add this extra debugging statement for "executed %d iterations"
> somewhere?  I've never seen that in kernel code before.  I guess you're
> counting the number of s2idle loops?

Yes, this was introduced by my previous patch:

	https://bugzilla.kernel.org/attachment.cgi?id=304698&action=diff

that was counting the number of iterations in s2idle_loop().

This is in addition to cpuidle loops that wake up around the same time as the main s2idle thread (incl. on other CPUs) and read the RTC.
Comment 30 Mario Limonciello (AMD) 2023-08-17 18:57:26 UTC
> please find the latest dmesg w/ pm_debug_messages enabled attached to this
email. I'm not quite sure whether the session did work out as expected: The
system did suspend successfully but drained the battery within ~4 hrs down
to 50%. In addition to this, it became quite warm but not hot due to lack
of cooling.

This might actually be the same cause as https://bugzilla.kernel.org/show_bug.cgi?id=217754.  To confirm can you please give me the output of /sys/kernel/debug/gpio on a kernel that behaves this way and one without?

It was introduced in 6.1.40.

It may also "exacerbate" this issue because it's causing more frequent wakeups exposing the RTC to more reads.

> So it seems that this happens also during shorter s2idle intervals.

If that behavior is caused by the same cause of that bug please pick up my proposed patch for that bug before reading into the results captured inso far.

> Yes, this was introduced by my previous patch:

Got it, thanks.
Comment 31 Carsten Hatger 2023-08-17 20:41:26 UTC
Created attachment 304882 [details]
gpio-post-suspend-w-pm_debug_messages

Hi Mario,  Mateusz,



On Thu, Aug 17, 2023 at 8:57 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #30 from Mario Limonciello (AMD) (mario.limonciello@amd.com)
> ---
> > please find the latest dmesg w/ pm_debug_messages enabled attached to
> this
> email. I'm not quite sure whether the session did work out as expected: The
> system did suspend successfully but drained the battery within ~4 hrs down
> to 50%. In addition to this, it became quite warm but not hot due to lack
> of cooling.
>
> This might actually be the same cause as
> https://bugzilla.kernel.org/show_bug.cgi?id=217754.  To confirm can you
> please
> give me the output of /sys/kernel/debug/gpio on a kernel that behaves this
> way
> and one without?
>

I'm pretty sure the battery draining and temperature rising has been caused
by setting the pm_debug_messages to 1 ... Anyway, I compiled various gpio
states to provide further insight into the new draining/temperature issue.

I found the gpio states to differ pre and post suspend cycles. In order to
separate states from those that are related to pm_debug_messages set to 1,
I've compiled four files. All information gathered from 6.1.43 + the latest
patch (https://bugzilla.kernel.org/attachment.cgi?id=304779) provided by
Mateusz

Please find attached to this message the following information:
- gpio states post clean boot pre 1st suspend cycle, pm_debug_messages set
to 0, named gpio-pre-suspend-wo-pm_debug_messages
- gpio states post 1st suspend cycle,  pm_debug_messages set to 0,
named gpio-post-suspend-wo-pm_debug_messages
- gpio states pre 2nd suspend cycle, pm_debug_messages set to 1,
named gpio-pre-suspend-w-pm_debug_messages
- gpio states post 2nd suspend cycle,  pm_debug_messages set to 1,
named gpio-post-suspend-w-pm_debug_messages



> It was introduced in 6.1.40.
>
> It may also "exacerbate" this issue because it's causing more frequent
> wakeups
> exposing the RTC to more reads.
>

IIRC "my"/the RTC issue is significantly older than 6.1.40. Aside from poor
support for speaker and mic mute It's been one of the reasons to start
building a custom/vanilla kernel instead of relying on the official Ubuntu
ones.


>
> > So it seems that this happens also during shorter s2idle intervals.
>
> If that behavior is caused by the same cause of that bug please pick up my
> proposed patch for that bug before reading into the results captured inso
> far.
>
> > Yes, this was introduced by my previous patch:
>
> Got it, thanks.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.


Hope that helps,
Carsten
Comment 32 Carsten Hatger 2023-08-17 20:41:27 UTC
Created attachment 304883 [details]
gpio-pre-suspend-w-pm_debug_messages
Comment 33 Carsten Hatger 2023-08-17 20:41:27 UTC
Created attachment 304884 [details]
gpio-post-suspend-wo-pm_debug_messages
Comment 34 Carsten Hatger 2023-08-17 20:41:27 UTC
Created attachment 304885 [details]
gpio-pre-suspend-wo-pm_debug_messages
Comment 35 Mario Limonciello (AMD) 2023-08-18 01:42:46 UTC
> I'm pretty sure the battery draining and temperature rising has been caused
by setting the pm_debug_messages to 1 ... Anyway, I compiled various gpio
states to provide further insight into the new draining/temperature issue.

No; turning on the debugging messages shouldn't cause that.

> IIRC "my"/the RTC issue is significantly older than 6.1.40. Aside from poor
support for speaker and mic mute It's been one of the reasons to start
building a custom/vanilla kernel instead of relying on the official Ubuntu
ones.

Yeah I'm not saying it's the solution to your original RTC issue; but your newly described behavior it may explain.  It would also explain why you're able to trigger it more easily now.

> I found the gpio states to differ pre and post suspend cycles.

Can you please give me a sample of this file from 6.1.39 or older before suspend?
Comment 36 Carsten Hatger 2023-08-18 08:33:59 UTC
Created attachment 304886 [details]
gpio-pre_suspend-6.1.34.txt

Hi Mario,


On Fri, Aug 18, 2023 at 3:42 AM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #35 from Mario Limonciello (AMD) (mario.limonciello@amd.com)
> ---
> > I'm pretty sure the battery draining and temperature rising has been
> caused
> by setting the pm_debug_messages to 1 ... Anyway, I compiled various gpio
> states to provide further insight into the new draining/temperature issue.
>
> No; turning on the debugging messages shouldn't cause that.
>

Got it.


> > IIRC "my"/the RTC issue is significantly older than 6.1.40. Aside from
> poor
> support for speaker and mic mute It's been one of the reasons to start
> building a custom/vanilla kernel instead of relying on the official Ubuntu
> ones.
>
> Yeah I'm not saying it's the solution to your original RTC issue; but your
> newly described behavior it may explain.  It would also explain why you're
> able
> to trigger it more easily now.
>
> > I found the gpio states to differ pre and post suspend cycles.
>
> Can you please give me a sample of this file from 6.1.39 or older before
> suspend?
>

Please find attached to this message gpio states pre suspend for kernel
rev. 6.1.34


>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.


THX & yours,
Carsten
Comment 37 Mario Limonciello (AMD) 2023-08-18 14:33:13 UTC
Yeah; your GPIO pin 18 changes to an s0i3 wake source.  Examining a previously shared acpidump by you this is also your touchpad (similar to the other issue reported).

                        GpioInt (Level, ActiveLow, ExclusiveAndWake, PullNone, 0x0000,
                            "\\_SB.GPIO", 0x00, ResourceConsumer, ,
                            )
                            {   // Pin list
                                0x0012
                            }

Please add that patch from that bug to your kernel.  Hopefully it fixes your high power consumption.
Comment 38 Carsten Hatger 2023-08-18 20:38:30 UTC
Created attachment 304905 [details]
attachment-6924-0.html

Mario,

thanks for the patch. Will it make it's path into mainline eventually or
should I keep it within my custom "probook" branch?

Could it be the systems finger print sensor being also a wake source? Its
connected via USB, bus 1, device 3. This device wakes from modern standby
using MS W11, at least.

Output of lsusb:
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 04f2:b6af Chicony Electronics Co., Ltd HP HD Camera
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 04f3:0c7e Elan Microelectronics Corp. ELAN:ARM-M4
Bus 001 Device 002: ID 0489:e0d6 Foxconn / Hon Hai
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

FYI: Further MS W11 behavior:
- The touchscreen doesn't function as a wake source.
- Unplugging the AC doesn't wake the system
- Plugging in the AC wakes the system briefly but reenters modern standby
afterwards immediately
- USB Port plugging untested so far

Yours,
Carsten

On Fri, Aug 18, 2023 at 4:33 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #37 from Mario Limonciello (AMD) (mario.limonciello@amd.com)
> ---
> Yeah; your GPIO pin 18 changes to an s0i3 wake source.  Examining a
> previously
> shared acpidump by you this is also your touchpad (similar to the other
> issue
> reported).
>
>                         GpioInt (Level, ActiveLow, ExclusiveAndWake,
> PullNone,
> 0x0000,
>                             "\\_SB.GPIO", 0x00, ResourceConsumer, ,
>                             )
>                             {   // Pin list
>                                 0x0012
>                             }
>
> Please add that patch from that bug to your kernel.  Hopefully it fixes
> your
> high power consumption.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 39 Mario Limonciello (AMD) 2023-08-18 21:20:41 UTC
> thanks for the patch. Will it make it's path into mainline eventually or
should I keep it within my custom "probook" branch?

It will eventually; but please include it in your branch for the purpose of this issue.  It may "contribute" to the issue this bug is about.

> Could it be the systems finger print sensor being also a wake source? Its
connected via USB, bus 1, device 3. This device wakes from modern standby
using MS W11, at least.

If it can from Windows it can from Linux.  If it's not working let's talk about it in a different issue.
Comment 40 Mateusz Jończyk 2023-08-20 17:07:26 UTC
Hello,

Thank you, Mario. When searching the internet for the error message "Unable to read current time from RTC", I have found it in a log posted on the AMD DRM GitLab:

https://gitlab.freedesktop.org/drm/amd/-/issues/2481

I have been writing with Mr Ivan Stanković there, he has tested my two patches and provided (among others) these two log lines:

[ 7696.434872] mc146818_avoid_UIP: reading the RTC time required 1000 loop iterations 
[ 8835.428427] mc146818_avoid_UIP: reading the RTC time required 933 loop iterations
Comment 41 Mario Limonciello (AMD) 2023-08-21 13:14:57 UTC
Thanks.  I left a comment there suggesting them to pick up this other patch for the GPIO issue.  Hopefully that decreases the numbers for them too.
Comment 42 Carsten Hatger 2023-08-23 19:33:16 UTC
Created attachment 304928 [details]
dmesg-post-bios_update.txt.gz

Dear all,

sorry for the late reply.

I've been able to run a rather long (>2,8 dys) s2idle session without any
errors while reading the RTC. Due to security constraints (CVE-2023-20555),
I had to update the system's BIOS to its latest revision prior to kicking
off the test. This has affected the test results.

@Mateusz: The maximum number of loop iterations in s2idle decreased from
143 within 8,8 hrs down to 53 within 68,5 hrs. There were no recordings of
loop iterations while reading the RTC in mc146818_avoid_UIP.

It remains unclear to me why the system needs to iterate through s2idle
each 1/2 hour on average. But that's another issue.

The vendors BIOS revision for device HP Pro x360 435 G9 increased from
01.11.01 Rev.A to 01.12.00 Rev.A. BIOS components have been updated like
this
- Embedded Controller (EC), from version 17.61.00 to version 17.63.00
- PSP Firmware, from version 0.11.1.7F  to 0.11.1.81
- SMU Firmware, from version 64.66.0 to  64.71.0

For the sake of completeness please find attached to this message the
contents of the kernel ring buffer. The full story starts at line # 2799.

It seems to me, the issue can be solved by updating the system to the
latest BIOS as available today. I'll continue to monitor the logs for a
couple of weeks to make sure I'm on the right track, otherwise I'll come
back here.

Thank you all for your great support, it's been a pleasure to support your
work. Please do not hesitate to contact me in case I could further assist
in resolving this issue.

Yours,
Carsten



On Mon, Aug 21, 2023 at 3:14 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #41 from Mario Limonciello (AMD) (mario.limonciello@amd.com)
> ---
> Thanks.  I left a comment there suggesting them to pick up this other
> patch for
> the GPIO issue.  Hopefully that decreases the numbers for them too.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 43 Mario Limonciello (AMD) 2023-08-23 19:35:25 UTC
This comment was with specifically the high power consumption patch fix included, right?  If so; that could also be used to explain the improvement.
Comment 44 Carsten Hatger 2023-08-23 20:00:14 UTC
Created attachment 304929 [details]
attachment-15770-0.html

HI Mario,

On Wed, Aug 23, 2023 at 9:35 PM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=217626
>
> --- Comment #43 from Mario Limonciello (AMD) (mario.limonciello@amd.com)
> ---
> This comment was with specifically the high power consumption patch fix
> included, right?  If so; that could also be used to explain the
> improvement.
>

Nope, it wasn't. I had this battery draining and lack of cooling issue,
too. But it's quite a long time (>4 months) ago. Unfortunately I haven't
been able to reproduce the issue - so I decided it's somewhat related to
differing states post boot depending on the selected OS and different OS
strategies to init fw/hw successfully.

Anyway, applying your proposed patch is the next thing to do on my todolist
;-) I'll come back to you if it can make a difference, preferably in a
different thread. I'd very much appreciate it if I could make one or more
components to consume less power in s2idle.

Yours,
Carsten


> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 45 Mario Limonciello (AMD) 2023-08-23 20:14:26 UTC
> It remains unclear to me why the system needs to iterate through s2idle
each 1/2 hour on average. But that's another issue.

I suspect it's HP's EC notifying the OS of the updated battery level.  So each time it "drops a percent" or so then the OS wakes up and reads the battery level again.

> Nope, it wasn't.

Got it.  In that case you're probably right that this was improved by BIOS.

> Anyway, applying your proposed patch is the next thing to do on my todolist
;-) I'll come back to you if it can make a difference

OK.

> I'd very much appreciate it if I could make one or more
components to consume less power in s2idle.

If the APU is in the deepest state and you're not getting spurious wakeups from the touchpad (anymore) then I don't see a lot more that can be done from Linux.  I'd expect that you should be on par with Windows then.
Comment 46 Simon Alexander 2023-10-13 07:55:46 UTC
I'm also affected by this.

* Make and model: HP EliteBook 865 16 inch G9 Notebook PC
* OS: fedora 38
* Kernel version: 6.5.6-200.fc38.x86_64
* BIOS Version: U82 Ver. 01.06.00
* BIOS Release Date: 07/03/2023

Seems like I'm the latest version of the BIOS and what fedora ships. 

I experience that waking the PC from sleep will yield either: correct date, date in the past (3 hours behind this morning) or many years in the future (year 2077).
Comment 47 Mario Limonciello (AMD) 2023-10-17 18:20:59 UTC
> HP EliteBook 865 16 inch G9 Notebook PC

Is this Rembrandt or Barcelo?

> I experience that waking the PC from sleep will yield either: correct date,
> date in the past (3 hours behind this morning) or many years in the future
> (year 2077).

When you have the failures are they specific to cycles that are in s2idle for a long time?
Comment 48 Simon Alexander 2023-10-17 18:44:02 UTC
(In reply to Mario Limonciello (AMD) from comment #47)
> > HP EliteBook 865 16 inch G9 Notebook PC
> 
> Is this Rembrandt or Barcelo?


lscpu gives me this: AMD Ryzen 7 PRO 6850U with Radeon Graphics


> > I experience that waking the PC from sleep will yield either: correct date,
> > date in the past (3 hours behind this morning) or many years in the future
> > (year 2077).
> 
> When you have the failures are they specific to cycles that are in s2idle
> for a long time?

I have no idea how to answer that question (I don't know what you're asking). 
I guess it has been in "sleep" for anything between 8 and 24 hours at the time of resuming the computer.
Comment 49 Mario Limonciello (AMD) 2023-10-17 18:52:24 UTC
> lscpu gives me this: AMD Ryzen 7 PRO 6850U with Radeon Graphics

That's Rembrandt; which is a different APU than Carsten reported this on.  This makes a stronger argument that it's a bug with the Linux behavior.

> I have no idea how to answer that question (I don't know what you're asking). 
I guess it has been in "sleep" for anything between 8 and 24 hours at the time of resuming the computer.

There is a hypothesis that it only happens with longer sleep cycles.  Maybe can you try to come up with any correlation to determine if there is a linear relationship in any way?
Comment 50 Simon Alexander 2023-10-17 19:21:28 UTC
(In reply to Mario Limonciello (AMD) from comment #49)
> > lscpu gives me this: AMD Ryzen 7 PRO 6850U with Radeon Graphics
> 
> That's Rembrandt; which is a different APU than Carsten reported this on. 
> This makes a stronger argument that it's a bug with the Linux behavior.
> 
> > I have no idea how to answer that question (I don't know what you're
> asking). 
> I guess it has been in "sleep" for anything between 8 and 24 hours at the
> time of resuming the computer.
> 
> There is a hypothesis that it only happens with longer sleep cycles.  Maybe
> can you try to come up with any correlation to determine if there is a
> linear relationship in any way?

Right, I'm pretty sure (will check) that it has only happened after prolonged sleeps - I've stopped working somewhere between 16 and 21 and started again probably around 7.

unfortunately I can only test this theory when I'm working from home as I also have issues with the machine waking from sleep when out of the charger...! 
I know this because the computer is usually always out of battery the next day -.-
Comment 51 Simon Alexander 2023-10-17 19:23:08 UTC
So to clarify my previous message; I think I only experience this when I wake from sleep, and as such the machine has been asleep for at least 8 hours.
Again will verify.
Comment 52 Mario Limonciello (AMD) 2023-10-17 19:29:11 UTC
> unfortunately I can only test this theory when I'm working from home as I
> also have issues with the machine waking from sleep when out of the
> charger...! 
I know this because the computer is usually always out of battery the next day -.-

Separately from this RTC issue, please reproduce under the context of this script:
https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py

And report an issue here with the report it generates.
https://gitlab.freedesktop.org/drm/amd/-/issues

FYI - there is an open issue with some HP laptops I'm aware that suspending in software followed by closing the lid prevents deep sleep, that might be what you're hitting.

> Again will verify.

OK
Comment 53 Simon Alexander 2023-10-17 19:36:00 UTC
(In reply to Mario Limonciello (AMD) from comment #52)
> > unfortunately I can only test this theory when I'm working from home as I
> > also have issues with the machine waking from sleep when out of the
> > charger...! 
> I know this because the computer is usually always out of battery the next
> day -.-
> 
> Separately from this RTC issue, please reproduce under the context of this
> script:
> https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py
> 
> And report an issue here with the report it generates.
> https://gitlab.freedesktop.org/drm/amd/-/issues
> 
> FYI - there is an open issue with some HP laptops I'm aware that suspending
> in software followed by closing the lid prevents deep sleep, that might be
> what you're hitting.
> 
> > Again will verify.
> 
> OK

Will be sure to do, thanks :)
Comment 54 Mario Limonciello (AMD) 2023-11-09 14:35:55 UTC
For some other unrelated issues I recently submitted a patch that changes the policy to not use ACPI alarm.  Can you please see if this also affects your issue in any way?

https://lore.kernel.org/linux-kernel/20231106162310.85711-1-mario.limonciello@amd.com/

You can also add rtc_cmos.use_acpi_alarm=1 on your kernel command line instead of applying the patch.
Comment 55 Mario Limonciello (AMD) 2023-11-09 14:36:28 UTC
Sorry, typo.  *change policy TO use ACPI alarm.
Comment 56 Mario Limonciello (AMD) 2023-11-16 15:08:34 UTC
A few people have reported this on the Framework 13" product which has an AMD Phoenix APU as well here:

https://community.frame.work/t/responded-fw13-amd-fedora-39-system-clock-advances-50-years-during-overnight-suspend/40188

I've asked in that thread for additional data points using the debugging patch above.
Comment 57 Mario Limonciello (AMD) 2023-11-17 00:37:25 UTC
@Mateusz

> https://wiki.osdev.org/CMOS#RTC_Update_In_Progress

I was reading through this some more.  The kernel implementation is closest to "the second alternative" mentioned in that post.

Something stands out to me though that I wanted to talk about:

> Unfortunately, doing it correctly (waiting until the "Update in progress"
> flag becomes set and then waiting until it becomes clear) is very slow - it
> may take an entire second of waiting/polling before you can read the
> registers. There are 2 alternatives.

Right now the code will only really retry for up to 10ms, but the specs don't actually guarantee in any way that it will be "done" in 10ms.  

They actually only say that another UIP won't occur for 244uS.  That leaves quite a long window of time that it *might* not be valid.

So I would think it makes sense to change the loop to 10000 iterations (notice this is a factor of 10 less than your debugging patch).
This will ensure that you check for up to a full second.

Most cases shouldn't need it, but I would think issue can never be hit again with it because you'll always catch one end of the UIP.
Comment 58 Mateusz Jończyk 2023-11-17 09:02:41 UTC
Hello,

I'm sorry for the delay and not following this bug report.

It is indeed feasible to increase the number of iterations in mc146818_avoid_UIP() to 10000 (delay up to 1s). However, this can potentially cause problems in hpet_rtc_interrupt() in arch/x86/kernel/hpet.c that executes mc146818_get_time() many times per second. I'm also not 100% sure that other code can tolerate long execution times of mc146818_get_time() (but that should be better than failing in this function, which other code does not handle well).

So there would be necessary to add a second parameter to mc146818_get_time() that specifies how long this function may wait. I have been thinking about this, but $life.

Greetings,

Mateusz
Comment 59 Tobias Brohl 2023-11-17 10:05:02 UTC
Hello, I have the same issue withe the clock being off multiple decades. 
It happens even after 2h in suspend on my R7 7840HS (Phoenix, Lenovo 83AM0007GE) 

Greetings
Tobias Brohl
Comment 60 Mateusz Jończyk 2023-11-23 10:27:15 UTC
Hello,

Mario has sent patches to LKML that are going to fix this issue:

https://lore.kernel.org/linux-rtc/20231120141555.458-1-mario.limonciello@amd.com/T/

. Please test them (it does not have to be on top of 6.7-rc2, they also work on older kernels, 6.1 upwards) and provide dmesg output after some suspend time.

Greetings,

Mateusdz
Comment 61 Mateusz Jończyk 2024-02-12 20:56:10 UTC
Hello,

FTR: Patches to fix this issue were included in Linux 6.8.0-rc1:

cef9ecc8e938 rtc: Extend timeout for waiting for UIP to clear to 1s
120931db07b4 rtc: Add support for configuring the UIP timeout for RTC reads
1311a8f0d4b2 rtc: Adjust failure return code for cmos_set_alarm()
af838635a3eb rtc: mc146818-lib: Adjust failure return code for mc146818_get_time()

and released in Linux stable - versions 6.1.76, 6.6.15 and 6.7.3 around end of January. They were not yet shipped in Ubuntu's 6.5 kernel FWIW.

Greetings,

Mateusz

Note You need to log in before you can comment on or make changes to this bug.