Outsourcing the relevant info from the closed Bug 185611. The XPS 13 resume time from suspend can be heavily optimized by activating busy polling during noirq suspend/resume phases. Doing the latter one could reduce the time spent in acpi_pm_finish to ~40ms, however only when booted with nopti. In the following I present the benchmarks that proof that pti considerably slows down the wake up from suspend on the machine: Resume from suspend without PTI (acpi_pm_finish: 38.897 ms): https://bugzilla.kernel.org/attachment.cgi?id=274139 Resume from suspend with PTI (acpi_pm_finish: 391.431 ms) https://bugzilla.kernel.org/attachment.cgi?id=274141 Could kpti be safely be disabled just before entering suspend at a very late stage and enabled again in the very early stage of activating user space?
Humm, why PTI would impact acpi_pm_finish is interesting! But I don't think PTI could be dynamically turned off/on during suspend/resume, could you provide pm-suspend with -f to see the backtrace of this stage
Thanks very much for having a look at the bug. I tested 4.16rc5 today and could not reproduce the issue anymore :-). @pmenzel: Btw: without any special boot parameters resume from suspend takes now 31ms.
Ok it turns out I was too fast :-(. I don't know why but suddenly the problem reappeared. I'll provide the debug output as soon as possible.
(In reply to Esokrarkose from comment #3) > Ok it turns out I was too fast :-(. I don't know why but suddenly the > problem reappeared. I'll provide the debug output as soon as possible. On my system acpi_pm_finish takes very long, if I suspend the system the *second* time. Can you verify that?
Created attachment 274817 [details] attachment-23468-0.html ? Hello, I will be on vacation through 3/26/18. Expect delayed response.
can you attach your dmidecode? I also have a Dell XPS 13 9360 on hand, but can not reproduce the issue. I'm using 4.16-rc4 kernel, and the acpi_pm_finish always takes around 21ms. in your case, does the long latency and short latency happens in the same boot? or is the behavior consistent (always long or always short) for a single boot?
*** Bug 198583 has been marked as a duplicate of this bug. ***
@Esokrarkose @Paul Do you have a change to test with pm-graph using -f option? It might take some time to generate/parse the html result, but it is very helpful to narrow down at which stage it is most time-consuming.
Wait...or maybe you can just test only tracking acpi_pm_finish() by: 1. append no_console_suspend in grub 2. after boot up, echo acpi_pm_finish > /sys/kernel/debug/tracing/set_graph_function 3. do the test 4. provide dmesg and /sys/kernel/debug/tracing/trace
@Esokrarkose@gmail.com Could you provide your kernel config? We have a xps9360 in hand and could not reproduce, maybe we can have a try based on your config.
ping Esokrarkose@gmail.com
Hi Chen, First of all kudos for your commitment! Sorry for the very late response, I am very busy with university recently. The last kernel I tested was 4.16.0-rc6 which I did not build myself (from Debian experimental) and I did not observe the issue there strangely (I tried to invoke pm-graph a few times randomly and never got the inconsistent results anymore). The prior kernel (rc5) was built by me and I observed the problem there with long latency and short latency in the the SAME boot. So to be honest I need to test again, I just did not have the time to look into it yet. Just before writing this I invoked pm-graph again and got the attached error output.
Created attachment 275519 [details] Errors appearing during testing
Created attachment 275521 [details] Errors appearing during testing
Created attachment 275523 [details] Errors appearing during testing I can reproduce the errors appearing during resume right now, have not yet rebooted.
(In reply to Esokrarkose from comment #12) […] > Just before writing this I invoked pm-graph again and got the attached error > output. Thank you for testing, but to keep this bug focused, these errors should be dealt with in separate bug reports. Looking at the output, the `acpi_pm_finish()` took 40 ms. The traces in the Linux kernel log should be good enough to point developers in the right direction.
Created attachment 275637 [details] Output of today I tested again today and observed the issue.
Created attachment 275639 [details] Output of today (1)
Created attachment 275641 [details] Output of today (ftrace)
I could not try the -f option for pm-graph, everytime I try I get: CAPTURING DMESG PROCESSING DATA ERROR: Not enough test data to build a timeline DONE Traceback (most recent call last): File "./analyze_suspend.py", line 5929, in <module> runTest() File "./analyze_suspend.py", line 5265, in runTest testruns, stamp = processData(True) File "./analyze_suspend.py", line 5227, in processData data = testruns[0] IndexError: list index out of range I will attach the files it generated while throwing that errors.
Created attachment 275643 [details] Kernel config
Created attachment 275645 [details] Dmidecode output
Please find the big ftrace file where python threw the traceback (comment 20) here: https://drive.google.com/file/d/1DL7Sq4fxqCnnZ38SnXTjh6DhOHQe0C0c/view?usp=sharing
HI Esokrarkose, You can test without analyze_suspend tool - to track ftrace result directly via #Comment 9. Please make sure to append 'no_console_suspend ignore_loglevel' in the boot commandline then after test provide both the dmesg and ftrace. It is likely to be time costly in device power on phase, say, in _ON method.
I'm confused, so the 10X delay is not 100% reproducible with KPI enabled? how many suspend cycles have you tried to reproduce the problem? and I'd suggest we do the similar number of suspend cycles in kpi disabled, to see if the problem is really related with kpti or not. BTW, usually I use "analyze_suspend -m mem -multi 50 2 -rtcwake 10 -dev" to run suspend-to-mem for 50 cycles, with rtcwake set to 10 seconds, or even 5 seconds.
@Zhang, @Chen, To be honest the tests back then were statistically not significant, I got the feeling now that it was a coincidence that on the boot with pti disabled I got the shorter resume cycle. I did not have enough time to test back then. However things have changed. In the meantime I have upgraded to the latest Bios from Dell (2.6.2) and now acpi_pm_finish() always takes around 300ms reproducibly without exception! :-( Pti does not change that at all. @Paul: Which Bios version are you at? Can you confirm what I am getting?
Created attachment 275861 [details] dmidecode (Bios v2.6.2)
Created attachment 275863 [details] analyze_suspend with kernel 4.16.7, Bios 2.6.2 This is 100% reproducible for me. I do not get the 40ms anymore. I get 400ms everytime I try. Kernel is from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16.7/
Created attachment 275865 [details] dmesg of when I did the analyze_suspend
Created attachment 275867 [details] Trying to track acpi_pm_finish() (In reply to Chen Yu from comment #9) > Wait...or maybe you can just test only tracking acpi_pm_finish() by: > 1. append no_console_suspend in grub > 2. after boot up, > echo acpi_pm_finish > /sys/kernel/debug/tracing/set_graph_function > 3. do the test > 4. provide dmesg and /sys/kernel/debug/tracing/trace I tried that, you can find all the files in the attached zip. However I am really skeptical this worked, the trace does not look much like anything.
As this is still tagged NEEDINFO, which information would you need? Anything I could try for you or do you have a system to debug?
(In reply to Esokrarkose from comment #28) > This is 100% reproducible for me. I do not get the 40ms anymore. I get 400ms > everytime I try. > Kernel is from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16.7/ Today's test shows me that sometimes I do get 40ms again. When I get 40ms I always seem to get 40ms and when I get 400ms I always seem to get 400ms until the machine is restarted, I still have not carried out enough tests though. Could someone of your address the following? > I could not try the -f option for pm-graph, everytime I try I get: > > CAPTURING DMESG > PROCESSING DATA > ERROR: Not enough test data to build a timeline > DONE > Traceback (most recent call last): > File "./analyze_suspend.py", line 5929, in <module> > runTest() > File "./analyze_suspend.py", line 5265, in runTest > testruns, stamp = processData(True) > File "./analyze_suspend.py", line 5227, in processData > data = testruns[0] > IndexError: list index out of range What's wronge with using the -f option?
hmmm, then can you please attach the tarball for the sleepgraph tool output in -dev mode? for both good and bad case? say "sleepgraph.py -m mem -dev"
ping
Created attachment 278021 [details] Bad case, Linux 4.17.17, BIOS 2.8.1 Could at the moment not reproduce the good case.
Created attachment 278085 [details] Bad case, Linux 4.17.17, BIOS 2.8.1 I could not manage to get the 40ms case from half a year ago, often I get 400ms.
what's the status in latest upstream kernel? it seems that the 400ms is took in _WAK method. can you please attach the acpidump output?
Created attachment 281019 [details] Output of `acpidump` (Firmware 2.10.0) (In reply to Zhang Rui from comment #37) […] > it seems that the 400ms is took in _WAK method. can you please attach the > acpidump output? Please find it attached from BIOS 2.10.0 09/27/2018.
Created attachment 281021 [details] attachment-21503-0.html ? ?Hello, I'm home sick Feb 6, expect delayed response..
Is this still a problem in the latest upstream? We have made quite some EC related changes recently. Plus, we have 2 XPS 13 9360 machines running s3/s2idle stress test almost every week, but didn't observe this issue.
(In reply to Zhang Rui from comment #40) > Is this still a problem in the latest upstream? I’ll have to check. > We have made quite some EC related changes recently. Are these changes in the 5.7.x release? > Plus, we have 2 XPS 13 9360 machines running s3/s2idle stress test almost > every week, but didn't observe this issue. Are the results public? How long does `acpi_pm_finish()` take in the first and second resume cycle?
Created attachment 289881 [details] Screenshot of HTML output of `./analyze_suspend.py` I just tested with Ubuntu 20.04 LTS with Linux 5.4.0-37-generic, and ran `./sleepgraph.py -config config/suspend-callgraph.cfg` and `maxdepth: 15`. Both, the first and second suspend/resume cycle, `acpi_pm_finish()` took around 350 ms. > acpi_pm_finish (351.081 ms) resume_machine It could be though, that this is caused by active tracing. The method below is the hotspot: > acpi_ps_parse_aml (326.995 ms @ 121.989604) It calls hundreds(?) of methods, each taking less than 1 ms, but a few take even up to 15 ms.
I rechecked our s2ram test results on three different 9360 laptops, and confirms that this problem does exists, but occasionally. The average resume time is around 40ms, but it can increase to 400ms in some cases. Todd, I will create a issue.def first so that we can get an overview of this problem among our test results.
Created attachment 289903 [details] issue.def for long _WAK evaluation in acpi_pm_finish()
I can confirm your results. Even with no callgraph, there are outliers, in my case 350 ms. I ran `./sleepgraph.py -config/suspend.cfg -multi 50 15`.
Zhang Rui, any update on this issue? Were you able to get an overview?