Bug 216216
Summary: | [BISECT INCLUDED] PNP0501 serial driver takes almost 2 seconds to suspend/resume | ||
---|---|---|---|
Product: | Drivers | Reporter: | Todd Brandt (todd.e.brandt) |
Component: | Serial | Assignee: | Russell King (rmk) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | john.ogness, lenb, pmladek, rjw, tglx |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 5.19.0-rc1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 178231 | ||
Attachments: |
Good freeze on 5.18.0
Bad freeze on 5.19.0-rc1 otcpl-aml-y-5.19.0-rc6-boot-dmesg.txt otcpl-aml-y-5.19.0-rc6-freeze.html otcpl-asus-e200-cht-5.19.0-rc6-boot-dmesg.txt otcpl-asus-e200-cht-5.19.0-rc6-freeze.html otcpl-whl-u-5.19.0-rc6-boot-dmesg.txt otcpl-whl-u-5.19.0-rc6-freeze.html fix pr_flush() timeout on suspend v2 patch to fix reported issue |
Description
Todd Brandt
2022-07-07 20:11:08 UTC
Created attachment 301363 [details]
Bad freeze on 5.19.0-rc1
This is the device detail itself. from the two timelines. Good 5.18.0: serial @ PNP0501 [00:06] {serial} (Total Suspend: 0.135 ms Total Resume: 0.486 ms) Bad 5.19.0-rc1: serial @ PNP0501 [00:06] {serial} (Total Suspend: 1077.880 ms Total Resume: 1078.346 ms) also, to reproduce the issue, run sleepgraph over a freeze suspend/resume %> git clone http://github.com/intel/pm-graph.git %> cd pm-graph %> sudo ./sleepgraph.py -m freeze -rtcwake 15 The output of the tool will be in a folder named suspend-<date>-<time> %> cd suspend-220707-194603 %> grep "PNP0501 \[" *.html <div id="a1327" title="PNP0501 [00:06] {serial} (sync) (1077.244 ms) suspend" class="thread sync" style="left:2.603172%;top:20.000px;height:55.714px;width:73.695250%;">00:06 {serial}</div> <div id="a1584" title="PNP0501 [00:06] {serial} (sync) (1076.816 ms) resume" class="thread sync" style="left:12.244835%;top:20.000px;height:39.000px;width:86.799960%;">00:06 {serial}</div> <div id="a2240" title="PNP0501 [00:06] {serial} (sync) (0.001 ms) resume_complete" class="thread sync" style="left:99.192953%;top:20.000px;height:390.000px;width:0.000081%;">00:06 {serial}</div> The "suspend" and "resume" callbacks take a full second instead of a millisecond. You can see the PNP501 device in the timeline or just grap for it in the html. Created attachment 301407 [details]
otcpl-aml-y-5.19.0-rc6-boot-dmesg.txt
Created attachment 301408 [details]
otcpl-aml-y-5.19.0-rc6-freeze.html
Created attachment 301409 [details]
otcpl-asus-e200-cht-5.19.0-rc6-boot-dmesg.txt
Created attachment 301410 [details]
otcpl-asus-e200-cht-5.19.0-rc6-freeze.html
Created attachment 301411 [details]
otcpl-whl-u-5.19.0-rc6-boot-dmesg.txt
Created attachment 301412 [details]
otcpl-whl-u-5.19.0-rc6-freeze.html
I just added 3 new tests for 5.19.0-rc6 on 3 machines that see this issue: otcpl-asus-e200-cht (cherry trail), otcpl-aml-y (amber lake), and otcpl-whl-u (whiskey lake). The kernel has the CONFIG_PRINTK_CALLER option enabled. The test is a S2idle and is run thusly: %> sleepgraph -dev -sync -wifi -netfix -display on -gzip -m freeze -rtcwake 15 I've included the dmesg boot logs for all three. The dmesg suspend/resume logs are included in the html timelines by clicking the "dmesg" button in the upper right hand corner of the timeline. There's a "log" button as well that shows other system into. UPDATE: I just ran a 5.19.0-rc6 kernel with the offending commit removed and it fixed the problem completely on all 3 machines. To be clear the offending commit is: commit 3b604ca81202eea2a917eb6491e90f610fba0ec7 I strongly recommend that this commit be pulled (or fixed) before the 5.19 release or 1 in 10 linux machines running 5.19 will take up to 2 seconds longer in suspend/resume. Do you have direct access to a machine that allows you to reproduce this? Some important questions: 1. What is the last message you see on the serial port after freezing? 2. Does it reproduce if you remove "console=tty0" from the boot args? (i.e. if your only console is a serial console) 3. If removing the vt console avoided the problem, do you have the problem if you only have vt as your console? (i.e. remove console=ttyS0 but keep console=tty0) 4. On lkml you claimed to have enabled CONFIG_PRINTK_CALLER but the caller is not shown in the kernel log. This additional information could be helpful. Created attachment 301413 [details]
fix pr_flush() timeout on suspend
I was able to track down the issue. Please try this patch.
Created attachment 301414 [details]
v2 patch to fix reported issue
The first patch simply disabled flushing for stop/start. But flushing should only be skipped if the console is suspended. The v2 patch adds the appropriate check for that.
Thanks, I'm building it now to test, I'll run it on all 30 machines today. P.S. I did enable CONFIG_PRINTK_DEBUG but it seems a little off on the otcpl-whl-u. It only pre-fixes certain lines (not sure why): [ 125.915605] PM: early resume of devices complete after 1.388 msecs [ 125.917270] cannonlake-pinctrl INT34BB:00: disable wake for pin 180 [ 125.926148] serial 00:06: activated [ 126.152582] atkbd serio0: Failed to deactivate keyboard on isa0060/serio0 The other two machines have it working. I ran with your fix on 30 systems. On the 7 machines where this problem appears, the patch fixes it completely. On the 23 machines where this problem does not appear, there are no new issues or regressions. Fix verified, thank you! A fix has been merged into Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/printk?id=9023ca0866250d268b047f21e1392e7a81277a54 Just completed testing on upstream 5.19.0-rc7 and the fix is there. All machines are working fine now. Thanks John! |