Bug 216216

Summary: [BISECT INCLUDED] PNP0501 serial driver takes almost 2 seconds to suspend/resume
Product: Drivers Reporter: Todd Brandt (todd.e.brandt)
Component: SerialAssignee: Russell King (rmk)
Status: RESOLVED CODE_FIX    
Severity: high CC: john.ogness, lenb, pmladek, rjw, tglx
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.19.0-rc1 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: Good freeze on 5.18.0
Bad freeze on 5.19.0-rc1
otcpl-aml-y-5.19.0-rc6-boot-dmesg.txt
otcpl-aml-y-5.19.0-rc6-freeze.html
otcpl-asus-e200-cht-5.19.0-rc6-boot-dmesg.txt
otcpl-asus-e200-cht-5.19.0-rc6-freeze.html
otcpl-whl-u-5.19.0-rc6-boot-dmesg.txt
otcpl-whl-u-5.19.0-rc6-freeze.html
fix pr_flush() timeout on suspend
v2 patch to fix reported issue

Description Todd Brandt 2022-07-07 20:11:08 UTC
Created attachment 301362 [details]
Good freeze on 5.18.0

Ever since 5.19.0-rc1 the serial device has taken almost a second longer in both suspend and resume. This effect is witnessed in half the machines in our lab (~10 machines). It occurs on dell, asus, gigbyte and other machines so it'
s not hardware specific. I've included a good and bad sleepgraph timeline showing the effect. 5.18.0 is ok, 5.19.0-rc1-4 are bad. I've performed a bisect and found the offending commit:

commit 3b604ca81202eea2a917eb6491e90f610fba0ec7
Author: John Ogness <john.ogness@linutronix.de>
Date:   Thu Apr 21 23:28:46 2022 +0206

    printk: add pr_flush()

    Provide a might-sleep function to allow waiting for console printers
    to catch up to the latest logged message.

    Use pr_flush() whenever it is desirable to get buffered messages
    printed before continuing: suspend_console(), resume_console(),
    console_stop(), console_start(), console_unblank().

    Signed-off-by: John Ogness <john.ogness@linutronix.de>
    Reviewed-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Link: https://lore.kernel.org/r/20220421212250.565456-12-john.ogness@linutronix.de

 include/linux/printk.h |  7 +++++
 kernel/printk/printk.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)
Comment 1 Todd Brandt 2022-07-07 20:11:45 UTC
Created attachment 301363 [details]
Bad freeze on 5.19.0-rc1
Comment 2 Todd Brandt 2022-07-07 20:14:51 UTC
This is the device detail itself. from the two timelines.

Good 5.18.0:

serial @ PNP0501 [00:06] {serial} (Total Suspend: 0.135 ms Total Resume: 0.486 ms)

Bad 5.19.0-rc1:

serial @ PNP0501 [00:06] {serial} (Total Suspend: 1077.880 ms Total Resume: 1078.346 ms)
Comment 3 Todd Brandt 2022-07-07 22:03:10 UTC
also, to reproduce the issue, run sleepgraph over a freeze suspend/resume

%> git clone http://github.com/intel/pm-graph.git
%> cd pm-graph
%> sudo ./sleepgraph.py -m freeze -rtcwake 15

The output of the tool will be in a folder named suspend-<date>-<time>
%> cd suspend-220707-194603
%> grep "PNP0501 \[" *.html
<div id="a1327" title="PNP0501 [00:06] {serial} (sync) (1077.244 ms) suspend" class="thread sync" style="left:2.603172%;top:20.000px;height:55.714px;width:73.695250%;">00:06 {serial}</div>
<div id="a1584" title="PNP0501 [00:06] {serial} (sync) (1076.816 ms) resume" class="thread sync" style="left:12.244835%;top:20.000px;height:39.000px;width:86.799960%;">00:06 {serial}</div>
<div id="a2240" title="PNP0501 [00:06] {serial} (sync) (0.001 ms) resume_complete" class="thread sync" style="left:99.192953%;top:20.000px;height:390.000px;width:0.000081%;">00:06 {serial}</div>

The "suspend" and "resume" callbacks take a full second instead of a millisecond. You can see the PNP501 device in the timeline or just grap for it in the html.
Comment 4 Todd Brandt 2022-07-13 16:57:10 UTC
Created attachment 301407 [details]
otcpl-aml-y-5.19.0-rc6-boot-dmesg.txt
Comment 5 Todd Brandt 2022-07-13 16:57:42 UTC
Created attachment 301408 [details]
otcpl-aml-y-5.19.0-rc6-freeze.html
Comment 6 Todd Brandt 2022-07-13 16:58:18 UTC
Created attachment 301409 [details]
otcpl-asus-e200-cht-5.19.0-rc6-boot-dmesg.txt
Comment 7 Todd Brandt 2022-07-13 16:58:49 UTC
Created attachment 301410 [details]
otcpl-asus-e200-cht-5.19.0-rc6-freeze.html
Comment 8 Todd Brandt 2022-07-13 16:59:18 UTC
Created attachment 301411 [details]
otcpl-whl-u-5.19.0-rc6-boot-dmesg.txt
Comment 9 Todd Brandt 2022-07-13 16:59:47 UTC
Created attachment 301412 [details]
otcpl-whl-u-5.19.0-rc6-freeze.html
Comment 10 Todd Brandt 2022-07-13 17:04:36 UTC
I just added 3 new tests for 5.19.0-rc6 on 3 machines that see this issue: otcpl-asus-e200-cht (cherry trail), otcpl-aml-y (amber lake), and otcpl-whl-u (whiskey lake). The kernel has the CONFIG_PRINTK_CALLER option enabled.

The test is a S2idle and is run thusly:
%> sleepgraph -dev -sync -wifi -netfix -display on -gzip -m freeze -rtcwake 15

I've included the dmesg boot logs for all three. The dmesg suspend/resume logs are included in the html timelines by clicking the "dmesg" button in the upper right hand corner of the timeline. There's a "log" button as well that shows other system into.
Comment 11 Todd Brandt 2022-07-13 18:20:33 UTC
UPDATE: I just ran a 5.19.0-rc6 kernel with the offending commit removed and it fixed the problem completely on all 3 machines. To be clear the offending commit is:

commit 3b604ca81202eea2a917eb6491e90f610fba0ec7

I strongly recommend that this commit be pulled (or fixed) before the 5.19 release or 1 in 10 linux machines running 5.19 will take up to 2 seconds longer in suspend/resume.
Comment 12 John Ogness 2022-07-13 20:45:23 UTC
Do you have direct access to a machine that allows you to reproduce this? Some important questions:

1. What is the last message you see on the serial port after freezing?

2. Does it reproduce if you remove "console=tty0" from the boot args? (i.e. if your only console is a serial console)

3. If removing the vt console avoided the problem, do you have the problem if you only have vt as your console? (i.e. remove console=ttyS0 but keep console=tty0)

4. On lkml you claimed to have enabled CONFIG_PRINTK_CALLER but the caller is not shown in the kernel log. This additional information could be helpful.
Comment 13 John Ogness 2022-07-13 21:18:06 UTC
Created attachment 301413 [details]
fix pr_flush() timeout on suspend

I was able to track down the issue. Please try this patch.
Comment 14 John Ogness 2022-07-13 21:57:57 UTC
Created attachment 301414 [details]
v2 patch to fix reported issue

The first patch simply disabled flushing for stop/start. But flushing should only be skipped if the console is suspended. The v2 patch adds the appropriate check for that.
Comment 15 Todd Brandt 2022-07-14 15:28:40 UTC
Thanks, I'm building it now to test, I'll run it on all 30 machines today.

P.S. I did enable CONFIG_PRINTK_DEBUG but it seems a little off on the otcpl-whl-u. It only pre-fixes certain lines (not sure why):

[  125.915605] PM: early resume of devices complete after 1.388 msecs
[  125.917270] cannonlake-pinctrl INT34BB:00: disable wake for pin 180
[  125.926148] serial 00:06: activated
[  126.152582] atkbd serio0: Failed to deactivate keyboard on isa0060/serio0

The other two machines have it working.
Comment 16 Todd Brandt 2022-07-14 22:15:04 UTC
I ran with your fix on 30 systems. On the 7 machines where this problem appears, the patch fixes it completely. On the 23 machines where this problem does not appear, there are no new issues or regressions.

Fix verified, thank you!
Comment 18 Todd Brandt 2022-07-18 18:15:17 UTC
Just completed testing on upstream 5.19.0-rc7 and the fix is there. All machines are working fine now. Thanks John!