Bug 196847
Description
Luya Tshimbalanga
2017-09-06 17:32:30 UTC
Created attachment 258235 [details] dmesg boot report Using https://bugzilla.kernel.org/show_bug.cgi?id=115021#c58 Boot with the parameter acpi.trace_state=enable dyndbg="file ec.c +p" Created attachment 258239 [details]
test hotkey functions
Test with kernel paramenter acpi.trace_state=enable dyndbg="file ec.c +p"
After rebooting without the parameter, the function keys are somehow fully restored. I don't know what exactly happened. I am closing this report as function keys works fine. [ 704.649095] PK-Backend[4110]: segfault at 1f4 ip 00007feb6ca1f7fa sp 00007feb5a011870 error 4 in libsolv.so.0[7feb6c9e3000+87000] [ 758.209062] [drm] probing gen 2 caps for device 1022:1425 = 733d01/e [ 758.209067] [drm] enabling PCIE gen 3 link speeds, disable with radeon.pcie_gen2=0 [ 759.576375] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000). [ 759.576648] radeon 0000:01:00.0: WB enabled [ 759.576652] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff92f238185c00 [ 759.576655] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff92f238185c04 [ 759.576657] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff92f238185c08 [ 759.576659] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff92f238185c0c [ 759.576661] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff92f238185c10 [ 759.787071] [drm] ring test on 0 succeeded in 1 usecs [ 759.787077] [drm] ring test on 1 succeeded in 1 usecs [ 759.787082] [drm] ring test on 2 succeeded in 1 usecs [ 759.787091] [drm] ring test on 3 succeeded in 4 usecs [ 759.787098] [drm] ring test on 4 succeeded in 4 usecs [ 759.787150] [drm] ib test on ring 0 succeeded in 0 usecs [ 759.787180] [drm] ib test on ring 1 succeeded in 0 usecs [ 759.787208] [drm] ib test on ring 2 succeeded in 0 usecs [ 759.787224] [drm] ib test on ring 3 succeeded in 0 usecs [ 759.787238] [drm] ib test on ring 4 succeeded in 0 usecs [ 814.563767] cheese[4343]: segfault at 1b ip 00007f1d1fd59f10 sp 00007ffebb484780 error 4 in libwayland-client.so.0.3.0[7f1d1fd53000+d000] Yes, it looks the GPE enabled in the new log is exactly the ECDT GPE#. [ 0.229608] ACPI: \: Used as first EC [ 0.229609] ACPI: \: GPE=0x15, EC_CMD/EC_SC=0x66, EC_DATA=0x62 ^^^^^^^^ [ 0.229610] ACPI: \: Used as boot ECDT EC to handle transactions [ 0.272663] ACPI : EC: EC_SC(R) = 0x08 SCI_EVT=0 BURST=0 CMD=1 IBF=0 OBF=0 [ 0.272665] ACPI: \_SB_.PCI0.SBRG.EC0_: GPE=0x15, EC_CMD/EC_SC=0x66, EC_DATA=0x62 ^^^^^^^^ [ 0.272667] ACPI: \_SB_.PCI0.SBRG.EC0_: Used as boot ECDT EC to handle transactions and events What we've done recently is just changing the timing of acpi_ec_ecdt_start(); For DSDT EC, we'll derive ECDT GPE, so your platform might not be affected. Maybe I've forgotton the root cause, feel free to re-open if it reproduces. Thanks Lv Created attachment 258241 [details]
test result with function keys stopped working
I was able to reproduce the bug. It occurs when suspending and resuming the session.
Reopening this bug as it occurs after resuming the session by either manually suspending or closing the laptop lid for a while. This sounds a different known bug (should be low reproduce ratio) that can be covered by these patches: https://patchwork.kernel.org/patch/9931477/ https://patchwork.kernel.org/patch/9931481/ https://patchwork.kernel.org/patch/9931473/ Can you give them a try? Thanks Lv (In reply to Lv Zheng from comment #8) > This sounds a different known bug (should be low reproduce ratio) that can > be covered by these patches: > > https://patchwork.kernel.org/patch/9931477/ After 8 hours building, the first patch above solved the issue related to resuming session causing disabled function keys. Test done with patched kernel-4.13.0 > https://patchwork.kernel.org/patch/9931481/ > https://patchwork.kernel.org/patch/9931473/ I was unable to use the above patches as the built failed. The result will be published shortly. Created attachment 258283 [details]
dmesg boot report update
Updated boot sequence with patched kernel-4.13.0
Created attachment 258285 [details]
test result with restored functional keys
Resulted test with patched kernel-4.13.0
After a while, it looks like the issue got resolved with https://patchwork.kernel.org/patch/9931477/ I don't know it I correctly set the status as there wasn't any way to confirm the test. I just found out the first patch worked with the latest stable 4.12.11. I will post the result after the build completed. I notice Hans follows this bug and wonder if he can use the patch on Fedora kernel until it officially lands. (In reply to Luya Tshimbalanga from comment #9) > (In reply to Lv Zheng from comment #8) > > This sounds a different known bug (should be low reproduce ratio) that can > > be covered by these patches: > > > > https://patchwork.kernel.org/patch/9931477/ OK, thanks for the test. I'll update the patch with your tested-by. > > After 8 hours building, the first patch above solved the issue related to > resuming session causing disabled function keys. Test done with patched > kernel-4.13.0 > > > https://patchwork.kernel.org/patch/9931481/ > > https://patchwork.kernel.org/patch/9931473/ > > I was unable to use the above patches as the built failed. > > The result will be published shortly. Which kernel version are you using? I can help to port the patches for you. Thanks Lv (In reply to Lv Zheng from comment #14) > > > > After 8 hours building, the first patch above solved the issue related to > > resuming session causing disabled function keys. Test done with patched > > kernel-4.13.0 > > > > > https://patchwork.kernel.org/patch/9931481/ > > > https://patchwork.kernel.org/patch/9931473/ > > > > I was unable to use the above patches as the built failed. > > > > The result will be published shortly. > > Which kernel version are you using? > I can help to port the patches for you. Currently 4.13.0 and in the process of building 4.12.11 to sync with Fedora version. Both patches failed on both kernel version according to the build report. IMO, the fix fixes possible regressions triggered by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c237371 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c40f956 In your case, it fixes regression caused by commit 9c40f956. > Currently 4.13.0 and in the process of building 4.12.11 to sync with Fedora
> version. Both patches failed on both kernel version according to the build
> report.
OK, I'll check with the 4.13.0 or 4.12.11.
(In reply to Lv Zheng from comment #16) > IMO, the fix fixes possible regressions triggered by: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=4c237371 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=9c40f956 > > In your case, it fixes regression caused by commit 9c40f956. Thank you for the explanation. I will look for the other revised patches for testing. In order to be clear, could upload the logs here: And upload dmesg output of the bad/good kernels (post resume logs). 1. build bad/good kernels with CONFIG_DYNAMIC_DEBUG enabled; 2. boot the kernel with "dyndbg="file drivers/acpi/ec.c +p"; 3. dmesg -c > boot-good/bad.log; 4. suspend/resume, then demsg -c > suspend-bood/bad.log; 5. echo "begin pressing keys" > press-good/bad.log, pressing keys, then dmesg -c >> press-good/bad.log, echo "end pressing keys" >> press-good/bad.log. 6. Upload the compressed boot/suspend/press-good/bad.log files here. Maybe I'm wrong with the culprit, so if possible, could you bisect the culprit out and report it here? Thanks and best regards Lv "dyndbg="file drivers/acpi/ec.c +p" Should be: "dyndbg='file drivers/acpi/ec.c +p'". Sorry for the mistake. Created attachment 258293 [details]
[PATCH v5 1/4] ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling
[PATCH v5 1/4] ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling
Can you try if only this change can fix your issue?
Probably this isn't a regression but a requirement to improve a previous fix.
Created attachment 258333 [details]
boot with patch v5-1-4 kernel 4.12.12
Created attachment 258335 [details]
Pressed Fn keys with patched kernel 4.12.12
Unfortunately the patches broke the Fn keys functions. The previous patch from 258285 seems the correct one
It looks that your test is: 1. Press key before the suspend 2. Release key during the resume It seems the test is not valid. Let me ask questions. A. I can see "EC started" in the log. It should happen in acpi_ec_resume(), done by acpi_ec_enable_event(). The condition to trigger it is: EC must be started by acpi_ec_start(). I can see that EC is started in the log: [ 385.723151] ACPI : EC: Starting EC [ 385.723152] ACPI : EC: EC started But I couldn't see "Event unblocked" in the log, it should happen when acpi_ec_resume() is invoked. So could you wait for a while after being resumed and see if the function key press can be recovered. Wait for a while and store the log and upload the log here. $ dmesg > resume-bad.log. B. Also I couldn't see "Polling enabled" in the log. If you still cannot see "Event unblocked" in your new test. This probably is because you don't have the following commit in your code base, causing EC event flooding: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ab69f27 If this commit is there, after "EC started" occurred, when SCI_EVT is detected in advance_transaction(), "Polling enabled" will appear in the log. So could you please use latest upstream kernel and try again? I'd say, it's sad you didn't use the latest upstream kernel? You are creating a bug and fake proofs using intermediate code base. There are many EC bugs I've already fixed several years ago. But upstreaming all of the fixes require me to find proofs for each of them. Thus not all of them are in the upstream kernel. (Trust me, merging this series, then all of them are upstreamed and all of the issues are gone). Please don't use intermediate code base to reproduce and report old issues. Thanks Lv Removing the regression mark as this report doesn't seem to be a regression. Regression can only happen to Samsung models. Some of the old bug fixes are mutual exclusive as I failed to upstream my final solution at the first time and tried a lot of approaches to fix one platform with the effort not breaking the others or not breaking the others a lot. So you'd better to try latest upstream kernel: $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Or linux-next branch of: $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git $ git checkout linux-next Where more EC fixes are upstreamed. Thanks in advance. It seems a fix to your issue should be: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ab69f27 And no further patches are required. So the latest upstream kernel should be OK for you. > So the latest upstream kernel should be OK for you. Probably still requires attachment 258293 [details] if using latest kernel doesn't help. :) Created attachment 258353 [details]
[PATCH] ACPI: EC: Fix an EC event IRQ storming issue
Back ported commit.
On top of 4.12.9 kernel, please: 1. apply the following patches: attachment 258353 [details] attachment 258293 [details] 2. enable CONFIG_DYNAMIC_DEBUG, and rebuild the kernel; 3. boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'", and dmesg -c > boot-4.12-bad.log after login; 4. echo "begin pressing keys" > press-4.12-bad.log, press the keys, and suspend the system; 5. resume the system, release the keys, dmesg -c >> press-4.12-bad.log and echo "end pressing keys" >> press-4.12-bad.log; 6. wait a while, press the key, release the key, and dmesg -c > press-again-4.12-bad.log You can upload the following test result here: 1. boot-4.12-bad.log 2. press-4.12-bad.log 3. press-again-4.12-bad.log Thanks Lv (In reply to Lv Zheng from comment #24) > > A. I can see "EC started" in the log. > It should happen in acpi_ec_resume(), done by acpi_ec_enable_event(). > The condition to trigger it is: EC must be started by acpi_ec_start(). > I can see that EC is started in the log: > [ 385.723151] ACPI : EC: Starting EC > [ 385.723152] ACPI : EC: EC started > But I couldn't see "Event unblocked" in the log, it should happen when > acpi_ec_resume() is invoked. > So could you wait for a while after being resumed and see if the function > key press can be recovered. > Wait for a while and store the log and upload the log here. > $ dmesg > resume-bad.log. Understood. I will try again with the rebuilt 4.12.12. > > B. Also I couldn't see "Polling enabled" in the log. > If you still cannot see "Event unblocked" in your new test. > This probably is because you don't have the following commit in your code > base, causing EC event flooding: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=1ab69f27 > If this commit is there, after "EC started" occurred, when SCI_EVT is > detected in advance_transaction(), "Polling enabled" will appear in the log. > So could you please use latest upstream kernel and try again? > > I'd say, it's sad you didn't use the latest upstream kernel? > You are creating a bug and fake proofs using intermediate code base. The custom kernel I use for the test is derived from Fedora (the distribution I run) which, in my knowledge, follow upstream as possible. You can view the details on http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=815214181377eb0e08efe9d4f8802ad4ac12070f I would have no idea I used intermediate code base. > There are many EC bugs I've already fixed several years ago. > But upstreaming all of the fixes require me to find proofs for each of them. > Thus not all of them are in the upstream kernel. > (Trust me, merging this series, then all of them are upstreamed and all of > the issues are gone). > Please don't use intermediate code base to reproduce and report old issues. I understand how difficult it is to upstream all fixes. I followed your instructions as possible as I am unfamiliar with the kernel. For those I thank you for the effort. (In reply to Lv Zheng from comment #30) > On top of 4.12.9 kernel, please: > > 1. apply the following patches: > attachment 258353 [details] > attachment 258293 [details] > 2. enable CONFIG_DYNAMIC_DEBUG, and rebuild the kernel; > 3. boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'", and dmesg -c > > boot-4.12-bad.log after login; > 4. echo "begin pressing keys" > press-4.12-bad.log, press the keys, and > suspend the system; > 5. resume the system, release the keys, dmesg -c >> press-4.12-bad.log and > echo "end pressing keys" >> press-4.12-bad.log; > 6. wait a while, press the key, release the key, and dmesg -c > > press-again-4.12-bad.log > > You can upload the following test result here: > 1. boot-4.12-bad.log > 2. press-4.12-bad.log > 3. press-again-4.12-bad.log > > Thanks > Lv Should I built with 4.12.12 instead? > Should I built with 4.12.12 instead? I'm working on upstream community kernels (rafael's linux-pm.git and torvalds's linux.git), not working on stable kernels, especially I don't know what 4.12.12 means to your distribution vendor, so it's hard for me to answer. But if you can apply attachment 258353 [details] and attachment 258293 [details] on top of your current working kernel and successfully build ec.o (make drivers/acpi/ec.o), then you can stay with your current working kernel to perform the test. If you failed to do that, you can try 4.12.12 instead. If it still failed, you can: $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git $ git reset --hard v4.12 And apply the patches on top of this kernel. I've confirmed that the patches can be applied on top of this kernel. And according to your phenomenen, applying the 2 patches on top of this kernel could be sufficiently fix the issue IMO. Thanks and best regards Lv Created attachment 258365 [details]
boot-4.12.12 bad with 2 patches
Created attachment 258367 [details]
press-4.12.12 bad with 2 patches
I realized I forgot putting ">>" for dmesg -c >> press-4.12.9 on the earlier result
Created attachment 258369 [details]
press-again-4.12.12-bad with 2 patches
This file is very much empty with no input from ec after resuming. Perhaps using one patch may help.
I am building the latest upstream kernel with no patches https://copr.fedorainfracloud.org/coprs/luya/kernel-acpi-ec/build/602036/ I will post results following steps from comment #30 Created attachment 258391 [details]
boot with latest upstream kernel no patches applied
Created attachment 258393 [details]
press input with latest upstream kernel no patches applied
Fn input before suspend/resume
Created attachment 258395 [details]
press again input with latest upstream kernel no patches applied
Unfortunately, resuming broke Fn input. I will try with one of suggested patches.
I still cannot see "event unblocked" in your tests... If the test result is correct, I might be missing something, and there is really a bug related. Let me think about it and prepare a debugging patch for you. Thanks Lv Hi, Luya attachment 258353 [details] actually contains 2 fixes. attachment 258293 [details] splits its first part into 1/5 and posted here, and second part into 2/5 and I didn't post it here but would send it to the community. They really are irrelevant and I do not want to put them together for reasons. For issue fixed by 1/5, the issue is actually "polling IRQ after enabling edge triggered IRQ to avoid IRQ miss and futher EC event stuck". Exactly the sympton of this bug and Samsung quirks, right? This issue can be fixed by "commit 9c40f956" and the SCI_EVT check mechanism implemented in advance_transaction(). However commit 9c40f956 is not aggressive enough to full fix the problem and thus the improvement of attachment 258293 [details]. With this issue fixed, we can remove quirk using commit 4c237371. While issue fixed by 1/5 is actually affecting the order of driver resume. I would just think some LID/thermal/battery event defer issue are directly related. We can also call it as a regression for very old kernels (probably 2.x may not see this issue). But the regression is made purposely in order to fix the conflict the issue under the situation that Linux contains an noirq stage. Furthermore, what I learned from HW guys is that it might not be a good idea to resume EC driver as early as possible as some platform logics may include a very long reset period and EC resume may dependent on other HW resume which owns high priority power wells to make it possible for EC to communicate with the underlying sensors. So yes, I'll also try to upstream patch 2/5, but it must be a split fix as it is experimental from my point of view. You don't need to use an unrelated bug to prove that the patch 2/5 is a mandatory regression fix IMO. :) Hopen you can understand the entire story. BTW, it's really weired "event unblocked" cannot be seen in your logs. It looks to me as long as there is "event blocked" and acpi_ec_resume() is executed after acpi_ec_unblock_transactions(), it surely can be seen. I need to provide a debugging patch then. Thanks Lv > While issue fixed by 1/5 is actually affecting the order of driver resume.
Should be While issue fixed by 2/5 is actually affecting the order of driver resume.
Sorry for the typo...
In attachment 258393 [details]:
1. In suspend path, __acpi_ec_disable_event() is invoked in acpi_ec_block_transactions() due to ec_freeze_events=N:
acpi_ec_block_transactions():
Line 873: [ 470.238042] ACPI: EC: Stopping EC
Line 874: [ 470.238043] ACPI: EC: event blocked
EC_FLAGS_QUERY_ENABLED will be cleared after this log entry.
Line 875: [ 470.238044] ACPI: EC: EC stopped
Both EC_FLAGS_STARTED/EC_FLAGS_STOPPED will be cleared after this log entry
2. In resume path:
acpi_ec_unblock_transactions():
Line 882: [ 470.272488] ACPI: EC: Starting EC
Line 883: [ 470.272488] ACPI: EC: EC started
EC_FLAGS_STARTED will be set after this log entry.
3. the following log indicates an end of driver resume, so acpi_ec_resume() should be invoked:
[ 472.806081] PM: resume devices took 2.467 seconds
in acpi_ec_resume(), the code path is:
acpi_ec_resume()
acpi_ec_enable_event()
if (acpi_ec_started()) <- surely TRUE as EC_FLAGS_STARTED is set and EC_FLAGS_STOPPED is cleared
__acpi_ec_enable_event() <- surely will be invoked
if (!test_and_set_bit(EC_FLAGS_QUERY_ENABLED, ...)) <- surely TRUE as EC_FLAGS_QUERY_ENABLED is cleared
ec_log_drv("event unblocked") <- surely we can see this log
Unless you have CONFIG_PM=n.
Could you check if CONFIG_PM is disabled for your kernel build?
(In reply to Lv Zheng from comment #42) <snip> > Hopen you can understand the entire story. > > BTW, it's really weired "event unblocked" cannot be seen in your logs. It > looks to me as long as there is "event blocked" and acpi_ec_resume() is > executed after acpi_ec_unblock_transactions(), it surely can be seen. > I need to provide a debugging patch then. > > Thanks > Lv Hi Lv, I do. Never thought such issue is a mess and you have to deal with it. Nevertheless, I will help testing so a better solution will be set. (In reply to Lv Zheng from comment #44) > Unless you have CONFIG_PM=n. > Could you check if CONFIG_PM is disabled for your kernel build? From the conf file, CONFIG_PM=y is set by default. http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/tree/kernel-x86_64.config?id=feab1df5763fb957c04f808c9cc7cb8f2ccb3193 The build with patch "ACPI / EC: Fix regression related to the triggering" attachment 258293 [details] just completed. http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=feab1df5763fb957c04f808c9cc7cb8f2ccb3193 I will post the result as soon as possible. Created attachment 258407 [details] boot with latest upstream kernel one patch applied Boot result with one patch http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/diff/ACPI-EC-Fix-regression-related-to-the-triggering.patch?id=feab1df5763fb957c04f808c9cc7cb8f2ccb3193 Created attachment 258409 [details] Input with one patch Test done with http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/diff/ACPI-EC-Fix-regression-related-to-the-triggering.patch?id=feab1df5763fb957c04f808c9cc7cb8f2ccb3193 Created attachment 258411 [details]
press-again with one patch
Test done with latest upstream kernel. I confirm CONFIG_PM is enabled by default.
Still no "event unblocked" in attachment 258409 [details] and attachment 258411 [details]. Really confusing: [ 436.059266] ACPI: EC: Stopping EC [ 436.059267] ACPI: EC: event blocked EC_FLAGS_QUERY_ENABLED cleared. [ 436.059268] ACPI: EC: EC stopped EC_FLAGS_STARTED/EC_FLAGS_STOPPED cleared. [ 436.089716] ACPI: EC: Starting EC [ 436.089717] ACPI: EC: EC started EC_FLAGS_STARTED set. [ 436.099988] ACPI: Waking up from system sleep state S3 Indicates a suspend to memory (S3). [ 440.218569] PM: resume devices took 4.062 seconds EC event handling should be resumed after this line, however we couldn't see "event unblocked". [ 446.634634] ACPI: EC: Polling enabled Because EC event is not resumed (no "event unblocked" log), EC event cannot be handled unless there is a transaction (EC event stuck problem). It looks like acpi_ec_resume() is not executed... We've confirmed on laptops in our hands, there is no such symptom in all of the platforms here... Let me prepare a test patch to find out the clue. Created attachment 258447 [details]
[PATCH] Debugging patch to confirm correct code paths
Then please try again after applying attachment 258293 [details] and attachment 258447 [details] (the debugging patch). Thanks in advance. Best regards Lv Could you resend the patch from attachment 258293 [details]? For some odd reasons, I hit this message below during compilation: Applying: ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling Applying: =================================================================== fatal: empty ident name (for <>) not allowed error: Bad exit status from /var/tmp/rpm-tmp.Y2BXZS (%prep) https://copr-be.cloud.fedoraproject.org/results/luya/kernel-acpi-ec/fedora-26-x86_64/00604114-kernel/builder-live.log As a result I am unable to build latest upstream kernel. Full details of patches: http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=675c89b427b8cc74566c356b08d4513363aa9ede please try to: $ git reset --hard Before patching the 2 patches. Thanks Lv I'm confused with the information: dbg-ec1.patch (renamed from Debugging-patch-to-confirm-correct-code-paths.patch) It should be attachment 258447 [details]. ec-poll1.patch What's this? It seems it has changed things a lot and probably could affect your code base. Is this applied? kernel-local kernel.spec I'm not sure if they are related. v5-1_4-ACPI_EC-Fix-regression-related-to-the-triggering-source-of-the-post-resume-EC-event-handling.patch Is it attachment 258293 [details]? Should you drop ec-poll1.patch before applying this? So I suggested to "git reset --hard" and try to apply the 2 patches again. Thanks and best regards Lv (In reply to Lv Zheng from comment #55) > What's this? > It seems it has changed things a lot and probably could affect your code > base. > Is this applied? > > kernel-local > kernel.spec > > I'm not sure if they are related. kernel-local is for custom kernel config and kernel.spec is just a set ot scripts including patches for building package for the vendor distribution. Both have no impact on the code base and can be safely ignored. > v5-1_4-ACPI_EC-Fix-regression-related-to-the-triggering-source-of-the-post- > resume-EC-event-handling.patch > Is it attachment 258293 [details]? Yes. > Should you drop ec-poll1.patch before applying this? I thought about that. Unfortunately, I hit another issue after getting latest upstream kernel. Applying: =================================================================== error: patch failed: drivers/acpi/ec.c:484 error: drivers/acpi/ec.c: patch does not apply Patch failed at 0072 =================================================================== The copy of the patch that failed is found in: .git/rebase-apply/patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. See https://copr-be.cloud.fedoraproject.org/results/luya/kernel-acpi-ec/fedora-26-x86_64/00605164-kernel/builder-live.log It looks like something changed inside drivers/acpi/ec.c file from the latest upstream kernel. See http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/tree/?id=6bb98e943ef1e34b41c94644743969041347df69 > So I suggested to "git reset --hard" and try to apply the 2 patches again. I followed the step but I kept hitting the same issue with ec-poll1.patch regardless. I don't know what caused the patch failing as it worked before. I updated to the latest upstream kernel. Doing a quick test, ec-poll1.patch successfully applied, unfortunately dbg-ec1.patch failed like on comment #56 despite doing "git reset --hard" prior to that. I am building with ec-poll1.patch to see if you can get the debugging info you need below: http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=862affa9d358a4b2bd06a1ce5699197380fa5d07 kernel-local and kernel.spec can be ignored. I will redo the steps from comment #33 and will post the result once the built completed. I figured out the issue on my side. The built system system dislike empty ident name <null> so I just copy the comment content from ec-poll1.patch to silence it. http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/diff/?id=92af3c0df5dc268e7c112764f81c59c031cc66d0 At least the build is working and I will post the result as soon as possible. It won't be a useful test if you cannot apply the dbg-ec1.patch... Is that possible to pull my tree? I have applied the 2 patches on top of latest upstream kernel: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git And pushed it here: https://github.com/zetalog/linux/tree/acpi-ec You can: $ git remote add linux-zetalog https://github.com/zetalog/linux $ git branch acpi-ec linux-zetalog/acpi-ec $ git checkout acpi-ec And test using this local git repository. Thanks and best regards Lv Created attachment 258553 [details]
boot with latest upstream kernel using dbg-ec patch
Boot result with both ec-poll1 dbg-ec patches applied.
Created attachment 258555 [details]
input with 2 patches applied including dbg-ec
Result with dbg-ec patch
Created attachment 258557 [details]
press-again-4.14.0-0-rc1-bad with 2 patches
Hopefully that helps
Looks a dead lock here related to CPU hot plug and acpi scan, you may need to reproduce it with pure kernels (no patching) and report it to Rafael: [ 335.404514] Disabling non-boot CPUs ... [ 335.422764] ====================================================== [ 335.422765] WARNING: possible circular locking dependency detected [ 335.422768] 4.14.0-0.rc1.git2.1.acpi_ec.fc26.x86_64 #1 Tainted: G OE [ 335.422769] ------------------------------------------------------ [ 335.422772] systemd-sleep/3452 is trying to acquire lock: [ 335.422773] ((complete)&st->done){+.+.}, at: [<ffffffff9c0b4669>] takedown_cpu+0x89/0xf0 [ 335.422782] but task is already holding lock: [ 335.422783] (sparse_irq_lock){+.+.}, at: [<ffffffff9c131077>] irq_lock_sparse+0x17/0x20 [ 335.422789] which lock already depends on the new lock. [ 335.422789] the existing dependency chain (in reverse order) is: [ 335.422790] -> #1 (sparse_irq_lock){+.+.}: [ 335.422797] __mutex_lock+0x86/0xa00 [ 335.422799] mutex_lock_nested+0x1b/0x20 [ 335.422802] irq_lock_sparse+0x17/0x20 [ 335.422805] irq_affinity_online_cpu+0x18/0xd0 [ 335.422807] cpuhp_invoke_callback+0xa7/0x830 [ 335.422808] -> #0 ((complete)&st->done){+.+.}: [ 335.422814] check_prev_add+0x351/0x700 [ 335.422816] __lock_acquire+0x1107/0x11d0 [ 335.422819] lock_acquire+0xa3/0x1f0 [ 335.422822] wait_for_completion+0x66/0x1d0 [ 335.422823] takedown_cpu+0x89/0xf0 [ 335.422825] cpuhp_invoke_callback+0xa7/0x830 [ 335.422826] cpuhp_down_callbacks+0x42/0x80 [ 335.422830] _cpu_down+0xc2/0x100 [ 335.422832] freeze_secondary_cpus+0xbd/0x390 [ 335.422835] suspend_devices_and_enter+0x2e0/0xc90 [ 335.422837] pm_suspend+0x340/0x3b0 [ 335.422839] state_store+0x85/0xf0 [ 335.422843] kobj_attr_store+0xf/0x20 [ 335.422846] sysfs_kf_write+0x42/0x60 [ 335.422849] kernfs_fop_write+0x134/0x1c0 [ 335.422852] __vfs_write+0x37/0x170 [ 335.422854] vfs_write+0xc6/0x1c0 [ 335.422856] SyS_write+0x58/0xc0 [ 335.422859] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 335.422860] other info that might help us debug this: [ 335.422861] Possible unsafe locking scenario: [ 335.422862] CPU0 CPU1 [ 335.422862] ---- ---- [ 335.422863] lock(sparse_irq_lock); [ 335.422865] lock((complete)&st->done); [ 335.422866] lock(sparse_irq_lock); [ 335.422868] lock((complete)&st->done); [ 335.422869] *** DEADLOCK *** [ 335.422871] 8 locks held by systemd-sleep/3452: [ 335.422871] #0: (sb_writers#4){.+.+}, at: [<ffffffff9c2e84f3>] vfs_write+0x193/0x1c0 [ 335.422877] #1: (&of->mutex){+.+.}, at: [<ffffffff9c38a916>] kernfs_fop_write+0x106/0x1c0 [ 335.422882] #2: (kn->count#217){.+.+}, at: [<ffffffff9c38a91f>] kernfs_fop_write+0x10f/0x1c0 [ 335.422887] #3: (pm_mutex){+.+.}, at: [<ffffffff9c12474e>] pm_suspend+0xae/0x3b0 [ 335.422892] #4: (acpi_scan_lock){+.+.}, at: [<ffffffff9c56f297>] acpi_scan_lock_acquire+0x17/0x20 [ 335.422899] #5: (cpu_add_remove_lock){+.+.}, at: [<ffffffff9c0b6b6e>] freeze_secondary_cpus+0x2e/0x390 [ 335.422903] #6: (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff9c1149ba>] percpu_down_write+0x2a/0x110 [ 335.422908] #7: (sparse_irq_lock){+.+.}, at: [<ffffffff9c131077>] irq_lock_sparse+0x17/0x20 [ 335.422912] stack backtrace: [ 335.422915] CPU: 3 PID: 3452 Comm: systemd-sleep Tainted: G OE 4.14.0-0.rc1.git2.1.acpi_ec.fc26.x86_64 #1 [ 335.422916] Hardware name: ASUSTeK COMPUTER INC. X550ZE/X550ZE, BIOS 208 10/15/2015 [ 335.422918] Call Trace: [ 335.422922] dump_stack+0x8e/0xd6 [ 335.422926] print_circular_bug+0x1f6/0x2e0 [ 335.422929] ? copy_trace+0x90/0x90 [ 335.422932] check_prev_add+0x351/0x700 [ 335.422937] __lock_acquire+0x1107/0x11d0 [ 335.422940] ? __lock_acquire+0x1107/0x11d0 [ 335.422942] ? copy_trace+0x90/0x90 [ 335.422946] lock_acquire+0xa3/0x1f0 [ 335.422948] ? takedown_cpu+0x89/0xf0 [ 335.422952] wait_for_completion+0x66/0x1d0 [ 335.422953] ? takedown_cpu+0x89/0xf0 [ 335.422956] ? stop_machine_cpuslocked+0x100/0x120 [ 335.422958] ? cpuhp_invoke_callback+0x830/0x830 [ 335.422960] takedown_cpu+0x89/0xf0 [ 335.422963] ? cpuhp_complete_idle_dead+0x20/0x20 [ 335.422965] cpuhp_invoke_callback+0xa7/0x830 [ 335.422969] cpuhp_down_callbacks+0x42/0x80 [ 335.422972] _cpu_down+0xc2/0x100 [ 335.422975] freeze_secondary_cpus+0xbd/0x390 [ 335.422979] suspend_devices_and_enter+0x2e0/0xc90 [ 335.422982] ? rcu_read_lock_sched_held+0x79/0x80 [ 335.422985] pm_suspend+0x340/0x3b0 [ 335.422988] state_store+0x85/0xf0 [ 335.422991] kobj_attr_store+0xf/0x20 [ 335.422993] sysfs_kf_write+0x42/0x60 [ 335.422996] kernfs_fop_write+0x134/0x1c0 [ 335.422999] __vfs_write+0x37/0x170 [ 335.423002] ? rcu_read_lock_sched_held+0x79/0x80 [ 335.423004] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 335.423008] ? __sb_start_write+0x144/0x1a0 [ 335.423010] ? vfs_write+0x193/0x1c0 [ 335.423012] vfs_write+0xc6/0x1c0 [ 335.423015] SyS_write+0x58/0xc0 [ 335.423019] entry_SYSCALL_64_fastpath+0x1f/0xbe I cannot see "enter acpi_ec_suspend()" and "enter acpi_ec_resume()‘. So this is probably the cause. Please upload your kernel config file here (.config) and let me know how did you trigger the suspend? Thanks Lv I guess: Your platform only contain valid ECDT EC, and invalid DSDT EC. So we are actually working with ECDT EC. And there is no driver binding against ECDT EC, so the acpi_ec_suspend/resume() actually cannot be invoked. That explains why first version can fix your issue but second version cannot. IMO, the rest fix in the first version is still not a part of this issue, it's related to the timing of EC event handling enabling. Your bug is actually: ECDT device is not registered as an ACPI device, so there is no driver binding against it. Let me try to implement a correct fix for this issue and see what can happen to your platform. Thanks Lv Created attachment 258577 [details]
[PATCH] ACPI / EC: Add scan support for ECDT device to attach PM ops
Please try if this can fix the issue?
Created attachment 258589 [details] kernel-x86_64.config default (In reply to Lv Zheng from comment #64) > I cannot see "enter acpi_ec_suspend()" and "enter acpi_ec_resume()‘. > So this is probably the cause. > Please upload your kernel config file here (.config) and let me know how did > you trigger the suspend? Suspend is simply triggered by either Fn+F1 keys or closing the lid of the laptop. After the lockup occurred, I had to manually suspend from my desktop. (In reply to Lv Zheng from comment #65) > I guess: > > Your platform only contain valid ECDT EC, and invalid DSDT EC. > So we are actually working with ECDT EC. > And there is no driver binding against ECDT EC, so the > acpi_ec_suspend/resume() actually cannot be invoked. > That explains why first version can fix your issue but second version cannot. > > IMO, the rest fix in the first version is still not a part of this issue, > it's related to the timing of EC event handling enabling. > > Your bug is actually: > ECDT device is not registered as an ACPI device, so there is no driver > binding against it. > Let me try to implement a correct fix for this issue and see what can happen > to your platform. Thanks for the explanation, I changed the title to reflect the cause of the issue. Basically, a hardware vendor implementation problem hard to detect by normal procedure which makes me wonder if other ASUS X series are affected. Really a regression, then. Whatever the BIOS issue is, actually EC driver has something missing and triggered by an enhancement. The indication for the ECDT selection: [ 0.599731] ACPI: \_SB_.PCI0.SBRG.EC0_: Used as boot ECDT EC to handle transactions and events ^^^^ Created attachment 258601 [details]
boot with latest upstream kernel using ec-ecdt1.patch
Boot sequence with ec-ecdt1.patch
Created attachment 258603 [details]
press-again-4.14.0-0-rc2-good with ec-ecdt1.patch
Initial Fn inputs command with ec-ecdt1.patch
Created attachment 258605 [details]
press-4.14.0-0-rc2-good with ec-ecdt1.patch
Initial Fn input command with ec-ecdt1.patch
Created attachment 258607 [details]
press-again-4.14.0-0-rc2-good with ec-ecdt1.patch
Fn inputs command after resuming the laptop.
I went further by closing/opening the lid to verify suspend/resume will not cause another lock-up. It looks like the patch fixed the root cause of the issue.
Thanks for the report and the test. :) I'll do some local validation for the patch before sending it to the upstream. Thanks Lv You are very welcome Lv. I would like to also thank you for taking the time of providing the steps for testing and explaining about the issue. Those details were very helpful. It's my fault taking so long time to reach the root cause. If I checked the line in comment 70 earlier and remembered the details of your original bug, I should have root caused it earlier... Anyway, the upstream version of the patches is posted here: https://patchwork.kernel.org/patch/9971553/ https://patchwork.kernel.org/patch/9971557/ I did an improvement to allow "unbind, suspend, resume, bind" working in your case. # echo -n "LNXECDT:00" > /sys/bus/acpi/drivers/ec/unbind # echo -n "mem" > /sys/power/state # echo -n "LNXECDT:00" > /sys/bus/acpi/drivers/ec/bind Using attachment 258577 [details], your issue should still be able to reproduce after executing this sequence. Using the posted fix, event handling can be restored as long as the EC driver is bound. Thanks Lv Thanks again for the hard work. At least the issue is resolved for now while undiscovered problems lurk around. Let me know when the patches got applied on the mainline kernel. Luya |