Bug 135691
Summary: | EC parallelism: parallel _Qxx evaluation problems | ||
---|---|---|---|
Product: | ACPI | Reporter: | Lv Zheng (lv.zheng) |
Component: | EC | Assignee: | Lv Zheng (lv.zheng) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | rui.zhang, ubuntu |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.3+ | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
acpidump output
dmidecode output The boot kernel log [PATCH] Disable tracer after the culprit is detected /proc/cpuinfo Dmesg output with 224351 patch Dmesg with CONFIG_ACPI_DEBUG=y and acpi.trace_state=enabled [PATCH] ACPI / EC: Reduce parallelism for EC _Qxx evaluations dmesg with ec-paral1.patch [PATCH] ACPI / EC: Fix a regression by reducing parallel _Qxx evaluations |
Description
Lv Zheng
2016-07-20 05:20:08 UTC
Created attachment 224321 [details]
acpidump output
Created attachment 224331 [details]
dmidecode output
Created attachment 224341 [details]
The boot kernel log
This looks like a different issue that the EC one. It's more about a bug of ACPICA's thread_count, not EC. Let's first do some tests to capture the affected control method. I'll paste a debug patch later. Also please upload /proc/cpuinfo here. Thanks and best regards -Lv Created attachment 224351 [details]
[PATCH] Disable tracer after the culprit is detected
1. Please upload /proc/cpuinfo here.
2. Please apply this patch and build the kernel, then boot the kernel with:
acpi.trace_state=enable
And upload the dmesg output here.
Hope the attached patch can help to reduce the output to obtain what we want.
You can also tune the log buffer size with "log_buf_len=" kernel parameter.
If you still cannot obtain the full boot log, please just upload the partial one and let me think different debugging approach.
Thanks and best regards
-Lv
Created attachment 224371 [details]
/proc/cpuinfo
Created attachment 224381 [details]
Dmesg output with 224351 patch
It just occurred to me, I did not CONFIG_ACPI_DEBUG in .config. Maybe I should have? If so, let me know and I will recompile and test.
(In reply to H Buus from comment #7) > Created attachment 224381 [details] > Dmesg output with 224351 patch > > It just occurred to me, I did not CONFIG_ACPI_DEBUG in .config. Maybe I > should have? If so, let me know and I will recompile and test. Yes, you should have CONFIG_ACPI_DEBUG enabled. Sorry that I forgot to mentione. The full documentation of this facility can be found at: Documentation/acpi/method-tracing.txt. Thanks and best regards -Lv Created attachment 224841 [details]
Dmesg with CONFIG_ACPI_DEBUG=y and acpi.trace_state=enabled
I found 380 _Q47 Begin, but only 108 _Q47 End. Thus the "reentrancy limit" error is triggered. However I need to know why, let me think it first. Thanks and best regards -Lv Created attachment 224951 [details]
[PATCH] ACPI / EC: Reduce parallelism for EC _Qxx evaluations
Looks like the system_wq is not suitable during the period ACPICA is not improved.
Let me work this around using a special workqueue with parallel executions limitted.
Please try this patch to see if it can stop the regression.
Thanks
-Lv
Created attachment 225521 [details]
dmesg with ec-paral1.patch
That seems to work. The attached dmesg3.out is stitched together from two captures of dmesg output. When I grepped, I got these counts:
$ grep -c 'Begin.*_Q47' dmesg3.out
4380
$ grep -c 'End.*_Q47' dmesg3.out
4323
4380-4323=57 looks much better than 380-108=272
After running for over 1820 seconds, I disabled trace_state, especially since I know I have overflowed the dmesg log. That is with log_buf_len=4M.
I will keep it running, since the longest it has taken to trigger the regression was after 2140 seconds of uptime. I'll report back in a few hours.
Looking good. The kernel has been running for almost 7 hours, and still no "Method reached maximum reentrancy limit" errors. However this is still dependent on the number of online cpus. Due to the ACPICA limitation, we could further make the max_active as 2 or configurable when this is upstreamed. In the future, when ACPICA is improved, we can use a greater number. Anyway, thanks for the report. I'll refine the patch and send it to the upstream. In the meanwhile, I'll mark this bug as resolved. Best regards -Lv (In reply to H Buus from comment #12) > Created attachment 225521 [details] > dmesg with ec-paral1.patch > > That seems to work. The attached dmesg3.out is stitched together from two > captures of dmesg output. When I grepped, I got these counts: > $ grep -c 'Begin.*_Q47' dmesg3.out > 4380 > $ grep -c 'End.*_Q47' dmesg3.out > 4323 > > 4380-4323=57 looks much better than 380-108=272 We should use WQ_UNBOUND here. Please help me to do one more test. I'll post the patch later. Created attachment 225791 [details]
[PATCH] ACPI / EC: Fix a regression by reducing parallel _Qxx evaluations
Please try this patch and upload the dmesg output.
I want to know if Begin-End can be limited to a reasonable value.
If you still can see breakage.
Please modify /sys/devices/virtual/workqueue/kec_query/max_alive to help to find a best fit value.
Thanks in advance.
Best regards
-Lv
I won't be able to do this until Wednesday. I assume I should revert the ec-paral1.patch, right? (In reply to H Buus from comment #17) > I won't be able to do this until Wednesday. That's OK. > I assume I should revert the ec-paral1.patch, right? Yes. :) Thanks -Lv (In reply to H Buus from comment #17) > I won't be able to do this until Wednesday. I assume I should revert the > ec-paral1.patch, right? No need to do further tests. We just decided to use the version 1 for the Linux upstream. Thanks -Lv OK. Let me know if you change your mind. Patch upstreamed: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e1191bd4f Closing... Thanks |