Bug 196847

Summary: Boot EC Requirement - ECDT is not registered as an ACPI device, so there is no driver binding against it (Asus X550ZE)
Product: ACPI Reporter: Luya Tshimbalanga (luya)
Component: ECAssignee: Lv Zheng (lv.zheng)
Status: RESOLVED CODE_FIX    
Severity: high CC: jwrdegoede, lv.zheng
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.9 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg boot report
test hotkey functions
test result with function keys stopped working
dmesg boot report update
test result with restored functional keys
[PATCH v5 1/4] ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling
boot with patch v5-1-4 kernel 4.12.12
Pressed Fn keys with patched kernel 4.12.12
[PATCH] ACPI: EC: Fix an EC event IRQ storming issue
boot-4.12.12 bad with 2 patches
press-4.12.12 bad with 2 patches
press-again-4.12.12-bad with 2 patches
boot with latest upstream kernel no patches applied
press input with latest upstream kernel no patches applied
press again input with latest upstream kernel no patches applied
boot with latest upstream kernel one patch applied
Input with one patch
press-again with one patch
[PATCH] Debugging patch to confirm correct code paths
boot with latest upstream kernel using dbg-ec patch
input with 2 patches applied including dbg-ec
press-again-4.14.0-0-rc1-bad with 2 patches
[PATCH] ACPI / EC: Add scan support for ECDT device to attach PM ops
kernel-x86_64.config default
boot with latest upstream kernel using ec-ecdt1.patch
press-again-4.14.0-0-rc2-good with ec-ecdt1.patch
press-4.14.0-0-rc2-good with ec-ecdt1.patch
press-again-4.14.0-0-rc2-good with ec-ecdt1.patch

Description Luya Tshimbalanga 2017-09-06 17:32:30 UTC
This version of kernel-4.12.x series broke the hotkeys functionality compared to 4.11.x. The fix was previously done on https://bugzilla.kernel.org/show_bug.cgi?id=115021.
Also see https://patchwork.kernel.org/patch/9318649/
Comment 1 Luya Tshimbalanga 2017-09-07 01:38:06 UTC
Created attachment 258235 [details]
dmesg boot report

Using https://bugzilla.kernel.org/show_bug.cgi?id=115021#c58

Boot with the parameter acpi.trace_state=enable dyndbg="file ec.c +p"
Comment 2 Luya Tshimbalanga 2017-09-07 01:39:15 UTC
Created attachment 258239 [details]
test hotkey functions

Test with kernel paramenter acpi.trace_state=enable dyndbg="file ec.c +p"
Comment 3 Luya Tshimbalanga 2017-09-07 01:59:50 UTC
After rebooting without the parameter, the function keys are somehow fully restored. I don't know what exactly happened.
Comment 4 Luya Tshimbalanga 2017-09-07 02:09:20 UTC
I am closing this report as function keys works fine.

[  704.649095] PK-Backend[4110]: segfault at 1f4 ip 00007feb6ca1f7fa sp 00007feb5a011870 error 4 in libsolv.so.0[7feb6c9e3000+87000]
[  758.209062] [drm] probing gen 2 caps for device 1022:1425 = 733d01/e
[  758.209067] [drm] enabling PCIE gen 3 link speeds, disable with radeon.pcie_gen2=0
[  759.576375] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000).
[  759.576648] radeon 0000:01:00.0: WB enabled
[  759.576652] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff92f238185c00
[  759.576655] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff92f238185c04
[  759.576657] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff92f238185c08
[  759.576659] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff92f238185c0c
[  759.576661] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff92f238185c10
[  759.787071] [drm] ring test on 0 succeeded in 1 usecs
[  759.787077] [drm] ring test on 1 succeeded in 1 usecs
[  759.787082] [drm] ring test on 2 succeeded in 1 usecs
[  759.787091] [drm] ring test on 3 succeeded in 4 usecs
[  759.787098] [drm] ring test on 4 succeeded in 4 usecs
[  759.787150] [drm] ib test on ring 0 succeeded in 0 usecs
[  759.787180] [drm] ib test on ring 1 succeeded in 0 usecs
[  759.787208] [drm] ib test on ring 2 succeeded in 0 usecs
[  759.787224] [drm] ib test on ring 3 succeeded in 0 usecs
[  759.787238] [drm] ib test on ring 4 succeeded in 0 usecs
[  814.563767] cheese[4343]: segfault at 1b ip 00007f1d1fd59f10 sp 00007ffebb484780 error 4 in libwayland-client.so.0.3.0[7f1d1fd53000+d000]
Comment 5 Lv Zheng 2017-09-07 02:42:30 UTC
Yes, it looks the GPE enabled in the new log is exactly the ECDT GPE#.

[    0.229608] ACPI: \: Used as first EC
[    0.229609] ACPI: \: GPE=0x15, EC_CMD/EC_SC=0x66, EC_DATA=0x62
                        ^^^^^^^^
[    0.229610] ACPI: \: Used as boot ECDT EC to handle transactions

[    0.272663] ACPI : EC: EC_SC(R) = 0x08 SCI_EVT=0 BURST=0 CMD=1 IBF=0 OBF=0
[    0.272665] ACPI: \_SB_.PCI0.SBRG.EC0_: GPE=0x15, EC_CMD/EC_SC=0x66, EC_DATA=0x62
                                           ^^^^^^^^
[    0.272667] ACPI: \_SB_.PCI0.SBRG.EC0_: Used as boot ECDT EC to handle transactions and events

What we've done recently is just changing the timing of acpi_ec_ecdt_start();
For DSDT EC, we'll derive ECDT GPE, so your platform might not be affected.

Maybe I've forgotton the root cause, feel free to re-open if it reproduces.

Thanks
Lv
Comment 6 Luya Tshimbalanga 2017-09-07 05:13:20 UTC
Created attachment 258241 [details]
test result with function keys stopped working

I was able to reproduce the bug. It occurs when suspending and resuming the session.
Comment 7 Luya Tshimbalanga 2017-09-07 05:14:56 UTC
Reopening this bug as it occurs after resuming the session by either manually suspending or closing the laptop lid for a while.
Comment 8 Lv Zheng 2017-09-07 07:11:24 UTC
This sounds a different known bug (should be low reproduce ratio) that can be covered by these patches:

https://patchwork.kernel.org/patch/9931477/
https://patchwork.kernel.org/patch/9931481/
https://patchwork.kernel.org/patch/9931473/

Can you give them a try?

Thanks
Lv
Comment 9 Luya Tshimbalanga 2017-09-09 03:35:05 UTC
(In reply to Lv Zheng from comment #8)
> This sounds a different known bug (should be low reproduce ratio) that can
> be covered by these patches:
> 
> https://patchwork.kernel.org/patch/9931477/

After 8 hours building, the first patch above solved the issue related to resuming session causing disabled function keys. Test done with patched kernel-4.13.0

> https://patchwork.kernel.org/patch/9931481/
> https://patchwork.kernel.org/patch/9931473/

I was unable to use the above patches as the built failed.

The result will be published shortly.
Comment 10 Luya Tshimbalanga 2017-09-09 03:36:46 UTC
Created attachment 258283 [details]
dmesg boot report update

Updated boot sequence with patched kernel-4.13.0
Comment 11 Luya Tshimbalanga 2017-09-09 03:38:58 UTC
Created attachment 258285 [details]
test result with restored functional keys

Resulted test with patched kernel-4.13.0
Comment 12 Luya Tshimbalanga 2017-09-09 19:45:29 UTC
After a while, it looks like the issue got resolved with  https://patchwork.kernel.org/patch/9931477/

I don't know it I correctly set the status as there wasn't any way to confirm the test.
Comment 13 Luya Tshimbalanga 2017-09-10 18:43:12 UTC
I just found out the first patch worked with the latest stable 4.12.11. I will post the result after the build completed.
I notice Hans follows this bug and wonder if he can use the patch on Fedora kernel until it officially lands.
Comment 14 Lv Zheng 2017-09-11 02:18:18 UTC
(In reply to Luya Tshimbalanga from comment #9)
> (In reply to Lv Zheng from comment #8)
> > This sounds a different known bug (should be low reproduce ratio) that can
> > be covered by these patches:
> > 
> > https://patchwork.kernel.org/patch/9931477/

OK, thanks for the test.
I'll update the patch with your tested-by.

> 
> After 8 hours building, the first patch above solved the issue related to
> resuming session causing disabled function keys. Test done with patched
> kernel-4.13.0
> 
> > https://patchwork.kernel.org/patch/9931481/
> > https://patchwork.kernel.org/patch/9931473/
> 
> I was unable to use the above patches as the built failed.
> 
> The result will be published shortly.

Which kernel version are you using?
I can help to port the patches for you.

Thanks
Lv
Comment 15 Luya Tshimbalanga 2017-09-11 02:33:43 UTC
(In reply to Lv Zheng from comment #14)
> > 
> > After 8 hours building, the first patch above solved the issue related to
> > resuming session causing disabled function keys. Test done with patched
> > kernel-4.13.0
> > 
> > > https://patchwork.kernel.org/patch/9931481/
> > > https://patchwork.kernel.org/patch/9931473/
> > 
> > I was unable to use the above patches as the built failed.
> > 
> > The result will be published shortly.
> 
> Which kernel version are you using?
> I can help to port the patches for you.

Currently 4.13.0 and in the process of building 4.12.11 to sync with Fedora version. Both patches failed on both kernel version according to the build report.
Comment 16 Lv Zheng 2017-09-11 02:35:40 UTC
IMO, the fix fixes possible regressions triggered by:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c237371
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9c40f956

In your case, it fixes regression caused by commit 9c40f956.
Comment 17 Lv Zheng 2017-09-11 02:37:41 UTC
> Currently 4.13.0 and in the process of building 4.12.11 to sync with Fedora
> version. Both patches failed on both kernel version according to the build
> report.

OK, I'll check with the 4.13.0 or 4.12.11.
Comment 18 Luya Tshimbalanga 2017-09-11 02:43:17 UTC
(In reply to Lv Zheng from comment #16)
> IMO, the fix fixes possible regressions triggered by:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=4c237371
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=9c40f956
> 
> In your case, it fixes regression caused by commit 9c40f956.

Thank you for the explanation. I will look for the other revised patches for testing.
Comment 19 Lv Zheng 2017-09-11 05:22:08 UTC
In order to be clear, could upload the logs here:
And upload dmesg output of the bad/good kernels (post resume logs).
1. build bad/good kernels with CONFIG_DYNAMIC_DEBUG enabled;
2. boot the kernel with "dyndbg="file drivers/acpi/ec.c +p";
3. dmesg -c > boot-good/bad.log;
4. suspend/resume, then demsg -c > suspend-bood/bad.log;
5. echo "begin pressing keys" > press-good/bad.log, pressing keys, then dmesg -c >> press-good/bad.log, echo "end pressing keys" >> press-good/bad.log.
6. Upload the compressed boot/suspend/press-good/bad.log files here.

Maybe I'm wrong with the culprit, so if possible, could you bisect the culprit out and report it here?

Thanks and best regards
Lv
Comment 20 Lv Zheng 2017-09-11 05:23:20 UTC
"dyndbg="file drivers/acpi/ec.c +p" Should be: "dyndbg='file drivers/acpi/ec.c +p'".

Sorry for the mistake.
Comment 21 Lv Zheng 2017-09-11 09:31:33 UTC
Created attachment 258293 [details]
[PATCH v5 1/4] ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling

[PATCH v5 1/4] ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling

Can you try if only this change can fix your issue?
Probably this isn't a regression but a requirement to improve a previous fix.
Comment 22 Luya Tshimbalanga 2017-09-12 16:57:04 UTC
Created attachment 258333 [details]
boot with patch v5-1-4 kernel 4.12.12
Comment 23 Luya Tshimbalanga 2017-09-12 17:01:29 UTC
Created attachment 258335 [details]
Pressed Fn keys with patched kernel 4.12.12

Unfortunately the patches broke the Fn keys functions. The previous patch from 258285 seems the correct one
Comment 24 Lv Zheng 2017-09-13 01:50:21 UTC
It looks that your test is:
1. Press key before the suspend
2. Release key during the resume

It seems the test is not valid.
Let me ask questions.

A. I can see "EC started" in the log.
It should happen in acpi_ec_resume(), done by acpi_ec_enable_event().
The condition to trigger it is: EC must be started by acpi_ec_start().
I can see that EC is started in the log:
[  385.723151] ACPI : EC: Starting EC
[  385.723152] ACPI : EC: EC started
But I couldn't see "Event unblocked" in the log, it should happen when acpi_ec_resume() is invoked.
So could you wait for a while after being resumed and see if the function key press can be recovered.
Wait for a while and store the log and upload the log here.
$ dmesg > resume-bad.log.

B. Also I couldn't see "Polling enabled" in the log.
If you still cannot see "Event unblocked" in your new test.
This probably is because you don't have the following commit in your code base, causing EC event flooding:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ab69f27
If this commit is there, after "EC started" occurred, when SCI_EVT is detected in advance_transaction(), "Polling enabled" will appear in the log.
So could you please use latest upstream kernel and try again?

I'd say, it's sad you didn't use the latest upstream kernel?
You are creating a bug and fake proofs using intermediate code base.

There are many EC bugs I've already fixed several years ago.
But upstreaming all of the fixes require me to find proofs for each of them.
Thus not all of them are in the upstream kernel.
(Trust me, merging this series, then all of them are upstreamed and all of the issues are gone).
Please don't use intermediate code base to reproduce and report old issues.

Thanks
Lv
Comment 25 Lv Zheng 2017-09-13 02:02:04 UTC
Removing the regression mark as this report doesn't seem to be a regression. Regression can only happen to Samsung models.
Comment 26 Lv Zheng 2017-09-13 02:05:46 UTC
Some of the old bug fixes are mutual exclusive as I failed to upstream my final solution at the first time and tried a lot of approaches to fix one platform with the effort not breaking the others or not breaking the others a lot.

So you'd better to try latest upstream kernel:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Or linux-next branch of:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git
$ git checkout linux-next

Where more EC fixes are upstreamed.

Thanks in advance.
Comment 27 Lv Zheng 2017-09-13 04:55:14 UTC
It seems a fix to your issue should be:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ab69f27
And no further patches are required.
So the latest upstream kernel should be OK for you.
Comment 28 Lv Zheng 2017-09-13 04:59:14 UTC
> So the latest upstream kernel should be OK for you.
Probably still requires attachment 258293 [details] if using latest kernel doesn't help. :)
Comment 29 Lv Zheng 2017-09-13 05:04:50 UTC
Created attachment 258353 [details]
[PATCH] ACPI: EC: Fix an EC event IRQ storming issue

Back ported commit.
Comment 30 Lv Zheng 2017-09-13 05:12:52 UTC
On top of 4.12.9 kernel, please:

1. apply the following patches:
   attachment 258353 [details]
   attachment 258293 [details]
2. enable CONFIG_DYNAMIC_DEBUG, and rebuild the kernel;
3. boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'", and dmesg -c > boot-4.12-bad.log after login;
4. echo "begin pressing keys" > press-4.12-bad.log, press the keys, and suspend the system;
5. resume the system, release the keys, dmesg -c >> press-4.12-bad.log and echo "end pressing keys" >> press-4.12-bad.log;
6. wait a while, press the key, release the key, and dmesg -c > press-again-4.12-bad.log

You can upload the following test result here:
1. boot-4.12-bad.log
2. press-4.12-bad.log
3. press-again-4.12-bad.log

Thanks
Lv
Comment 31 Luya Tshimbalanga 2017-09-13 06:02:08 UTC
(In reply to Lv Zheng from comment #24)
> 
> A. I can see "EC started" in the log.
> It should happen in acpi_ec_resume(), done by acpi_ec_enable_event().
> The condition to trigger it is: EC must be started by acpi_ec_start().
> I can see that EC is started in the log:
> [  385.723151] ACPI : EC: Starting EC
> [  385.723152] ACPI : EC: EC started
> But I couldn't see "Event unblocked" in the log, it should happen when
> acpi_ec_resume() is invoked.
> So could you wait for a while after being resumed and see if the function
> key press can be recovered.
> Wait for a while and store the log and upload the log here.
> $ dmesg > resume-bad.log.

Understood. I will try again with the rebuilt 4.12.12.

> 
> B. Also I couldn't see "Polling enabled" in the log.
> If you still cannot see "Event unblocked" in your new test.
> This probably is because you don't have the following commit in your code
> base, causing EC event flooding:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=1ab69f27
> If this commit is there, after "EC started" occurred, when SCI_EVT is
> detected in advance_transaction(), "Polling enabled" will appear in the log.
> So could you please use latest upstream kernel and try again?
> 
> I'd say, it's sad you didn't use the latest upstream kernel?
> You are creating a bug and fake proofs using intermediate code base.

The custom kernel I use for the test is derived from Fedora (the distribution I run) which, in my knowledge, follow upstream as possible. You can view the details on 
http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=815214181377eb0e08efe9d4f8802ad4ac12070f

I would have no idea I used intermediate code base.

> There are many EC bugs I've already fixed several years ago.
> But upstreaming all of the fixes require me to find proofs for each of them.
> Thus not all of them are in the upstream kernel.
> (Trust me, merging this series, then all of them are upstreamed and all of
> the issues are gone).
> Please don't use intermediate code base to reproduce and report old issues.

I understand how difficult it is to upstream all fixes. I followed your instructions as possible as I am unfamiliar with the kernel. For those I thank you for the effort.
Comment 32 Luya Tshimbalanga 2017-09-13 06:10:11 UTC
(In reply to Lv Zheng from comment #30)
> On top of 4.12.9 kernel, please:
> 
> 1. apply the following patches:
>    attachment 258353 [details]
>    attachment 258293 [details]
> 2. enable CONFIG_DYNAMIC_DEBUG, and rebuild the kernel;
> 3. boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'", and dmesg -c >
> boot-4.12-bad.log after login;
> 4. echo "begin pressing keys" > press-4.12-bad.log, press the keys, and
> suspend the system;
> 5. resume the system, release the keys, dmesg -c >> press-4.12-bad.log and
> echo "end pressing keys" >> press-4.12-bad.log;
> 6. wait a while, press the key, release the key, and dmesg -c >
> press-again-4.12-bad.log
> 
> You can upload the following test result here:
> 1. boot-4.12-bad.log
> 2. press-4.12-bad.log
> 3. press-again-4.12-bad.log
> 
> Thanks
> Lv

Should I built with 4.12.12 instead?
Comment 33 Lv Zheng 2017-09-13 07:13:44 UTC
> Should I built with 4.12.12 instead?

I'm working on upstream community kernels (rafael's linux-pm.git and torvalds's linux.git), not working on stable kernels, especially I don't know what 4.12.12 means to your distribution vendor, so it's hard for me to answer.

But if you can apply attachment 258353 [details] and attachment 258293 [details] on top of your current working kernel and successfully build ec.o (make drivers/acpi/ec.o), then you can stay with your current working kernel to perform the test.

If you failed to do that, you can try 4.12.12 instead.

If it still failed, you can:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ git reset --hard v4.12
And apply the patches on top of this kernel.
I've confirmed that the patches can be applied on top of this kernel.
And according to your phenomenen, applying the 2 patches on top of this kernel could be sufficiently fix the issue IMO.

Thanks and best regards
Lv
Comment 34 Luya Tshimbalanga 2017-09-14 06:07:26 UTC
Created attachment 258365 [details]
boot-4.12.12 bad with 2 patches
Comment 35 Luya Tshimbalanga 2017-09-14 06:09:18 UTC
Created attachment 258367 [details]
press-4.12.12 bad with 2 patches

I realized I forgot putting ">>" for dmesg -c >> press-4.12.9 on the earlier result
Comment 36 Luya Tshimbalanga 2017-09-14 06:12:28 UTC
Created attachment 258369 [details]
press-again-4.12.12-bad with 2 patches

This file is very much empty with no input from ec after resuming. Perhaps using one patch may help.
Comment 37 Luya Tshimbalanga 2017-09-14 06:32:49 UTC
I am building the latest upstream kernel with no patches
https://copr.fedorainfracloud.org/coprs/luya/kernel-acpi-ec/build/602036/

I will post results following steps from comment #30
Comment 38 Luya Tshimbalanga 2017-09-14 22:37:07 UTC
Created attachment 258391 [details]
boot with latest upstream kernel no patches applied
Comment 39 Luya Tshimbalanga 2017-09-14 22:39:02 UTC
Created attachment 258393 [details]
press input with latest upstream kernel no patches applied

Fn input before suspend/resume
Comment 40 Luya Tshimbalanga 2017-09-14 22:42:11 UTC
Created attachment 258395 [details]
press again input with latest upstream kernel no patches applied

Unfortunately, resuming broke Fn input. I will try with one of suggested patches.
Comment 41 Lv Zheng 2017-09-15 03:34:33 UTC
I still cannot see "event unblocked" in your tests...
If the test result is correct, I might be missing something, and there is really a bug related.
Let me think about it and prepare a debugging patch for you.

Thanks
Lv
Comment 42 Lv Zheng 2017-09-15 05:14:39 UTC
Hi, Luya

attachment 258353 [details] actually contains 2 fixes. attachment 258293 [details] splits its first part into 1/5 and posted here, and second part into 2/5 and I didn't post it here but would send it to the community. They really are irrelevant and I do not want to put them together for reasons.

For issue fixed by 1/5, the issue is actually "polling IRQ after enabling edge triggered IRQ to avoid IRQ miss and futher EC event stuck". Exactly the sympton of this bug and Samsung quirks, right? This issue can be fixed by "commit 9c40f956" and the SCI_EVT check mechanism implemented in advance_transaction().
However commit 9c40f956 is not aggressive enough to full fix the problem and thus the improvement of attachment 258293 [details]. With this issue fixed, we can remove quirk using commit 4c237371.

While issue fixed by 1/5 is actually affecting the order of driver resume. I would just think some LID/thermal/battery event defer issue are directly related. We can also call it as a regression for very old kernels (probably 2.x may not see this issue). But the regression is made purposely in order to fix the conflict the issue under the situation that Linux contains an noirq stage.
Furthermore, what I learned from HW guys is that it might not be a good idea to resume EC driver as early as possible as some platform logics may include a very long reset period and EC resume may dependent on other HW resume which owns high priority power wells to make it possible for EC to communicate with the underlying sensors.
So yes, I'll also try to upstream patch 2/5, but it must be a split fix as it is experimental from my point of view. You don't need to use an unrelated bug to prove that the patch 2/5 is a mandatory regression fix IMO. :)

Hopen you can understand the entire story.

BTW, it's really weired "event unblocked" cannot be seen in your logs. It looks to me as long as there is "event blocked" and acpi_ec_resume() is executed after acpi_ec_unblock_transactions(), it surely can be seen.
I need to provide a debugging patch then.

Thanks
Lv
Comment 43 Lv Zheng 2017-09-15 05:16:52 UTC
> While issue fixed by 1/5 is actually affecting the order of driver resume.
Should be While issue fixed by 2/5 is actually affecting the order of driver resume.

Sorry for the typo...
Comment 44 Lv Zheng 2017-09-15 06:08:32 UTC
In attachment 258393 [details]:

1. In suspend path, __acpi_ec_disable_event() is invoked in acpi_ec_block_transactions() due to ec_freeze_events=N:

acpi_ec_block_transactions():
	Line 873: [  470.238042] ACPI: EC: Stopping EC
	Line 874: [  470.238043] ACPI: EC: event blocked
EC_FLAGS_QUERY_ENABLED will be cleared after this log entry.
	Line 875: [  470.238044] ACPI: EC: EC stopped
Both EC_FLAGS_STARTED/EC_FLAGS_STOPPED will be cleared after this log entry

2. In resume path:
acpi_ec_unblock_transactions():
	Line 882: [  470.272488] ACPI: EC: Starting EC
	Line 883: [  470.272488] ACPI: EC: EC started
EC_FLAGS_STARTED will be set after this log entry.

3. the following log indicates an end of driver resume, so acpi_ec_resume() should be invoked:
[  472.806081] PM: resume devices took 2.467 seconds
in acpi_ec_resume(), the code path is:
acpi_ec_resume()
  acpi_ec_enable_event()
    if (acpi_ec_started()) <- surely TRUE as EC_FLAGS_STARTED is set and EC_FLAGS_STOPPED is cleared
      __acpi_ec_enable_event() <- surely will be invoked
        if (!test_and_set_bit(EC_FLAGS_QUERY_ENABLED, ...)) <- surely TRUE as EC_FLAGS_QUERY_ENABLED is cleared
          ec_log_drv("event unblocked") <- surely we can see this log

Unless you have CONFIG_PM=n.
Could you check if CONFIG_PM is disabled for your kernel build?
Comment 45 Luya Tshimbalanga 2017-09-15 16:35:43 UTC
(In reply to Lv Zheng from comment #42)
<snip> 

> Hopen you can understand the entire story.
> 
> BTW, it's really weired "event unblocked" cannot be seen in your logs. It
> looks to me as long as there is "event blocked" and acpi_ec_resume() is
> executed after acpi_ec_unblock_transactions(), it surely can be seen.
> I need to provide a debugging patch then.
> 
> Thanks
> Lv

Hi Lv,

I do. Never thought such issue is a mess and you have to deal with it. Nevertheless, I will help testing so a better solution will be set.

(In reply to Lv Zheng from comment #44)
> Unless you have CONFIG_PM=n.
> Could you check if CONFIG_PM is disabled for your kernel build?

From the conf file,  CONFIG_PM=y is set by default.
http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/tree/kernel-x86_64.config?id=feab1df5763fb957c04f808c9cc7cb8f2ccb3193

The build with patch "ACPI / EC: Fix regression related to the triggering"  attachment 258293 [details] just completed. 
http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=feab1df5763fb957c04f808c9cc7cb8f2ccb3193

I will post the result as soon as possible.
Comment 48 Luya Tshimbalanga 2017-09-15 17:06:00 UTC
Created attachment 258411 [details]
press-again with one patch

Test done with latest upstream kernel. I confirm CONFIG_PM is enabled by default.
Comment 49 Lv Zheng 2017-09-18 01:08:01 UTC
Still no "event unblocked" in attachment 258409 [details] and attachment 258411 [details].
Really confusing:

[  436.059266] ACPI: EC: Stopping EC
[  436.059267] ACPI: EC: event blocked
 EC_FLAGS_QUERY_ENABLED cleared.
[  436.059268] ACPI: EC: EC stopped
 EC_FLAGS_STARTED/EC_FLAGS_STOPPED cleared.
[  436.089716] ACPI: EC: Starting EC
[  436.089717] ACPI: EC: EC started
 EC_FLAGS_STARTED set.
[  436.099988] ACPI: Waking up from system sleep state S3
 Indicates a suspend to memory (S3).
[  440.218569] PM: resume devices took 4.062 seconds
 EC event handling should be resumed after this line, however we couldn't see "event unblocked".
[  446.634634] ACPI: EC: Polling enabled
 Because EC event is not resumed (no "event unblocked" log), EC event cannot be handled unless there is a transaction (EC event stuck problem).

It looks like acpi_ec_resume() is not executed...
We've confirmed on laptops in our hands, there is no such symptom in all of the platforms here...

Let me prepare a test patch to find out the clue.
Comment 50 Lv Zheng 2017-09-18 01:31:42 UTC
Created attachment 258447 [details]
[PATCH] Debugging patch to confirm correct code paths
Comment 51 Lv Zheng 2017-09-18 01:32:46 UTC
Then please try again after applying attachment 258293 [details] and attachment 258447 [details] (the debugging patch). Thanks in advance.

Best regards
Lv
Comment 52 Luya Tshimbalanga 2017-09-18 09:12:42 UTC
Could you resend the patch from attachment 258293 [details]?
For some odd reasons, I hit this message below during compilation:
Applying: ACPI / EC: Fix regression related to the triggering source of the post-resume EC event handling
Applying: ===================================================================
fatal: empty ident name (for <>) not allowed
error: Bad exit status from /var/tmp/rpm-tmp.Y2BXZS (%prep)

https://copr-be.cloud.fedoraproject.org/results/luya/kernel-acpi-ec/fedora-26-x86_64/00604114-kernel/builder-live.log

As a result I am unable to build latest upstream kernel.
Comment 54 Lv Zheng 2017-09-18 22:33:01 UTC
please try to:

$ git reset --hard

Before patching the 2 patches.

Thanks
Lv
Comment 55 Lv Zheng 2017-09-20 00:41:08 UTC
I'm confused with the information:

dbg-ec1.patch (renamed from Debugging-patch-to-confirm-correct-code-paths.patch)

It should be attachment 258447 [details].
 
ec-poll1.patch

What's this?
It seems it has changed things a lot and probably could affect your code base.
Is this applied?
 
kernel-local
kernel.spec

I'm not sure if they are related.
 
v5-1_4-ACPI_EC-Fix-regression-related-to-the-triggering-source-of-the-post-resume-EC-event-handling.patch
  
Is it attachment 258293 [details]?
Should you drop ec-poll1.patch before applying this?
So I suggested to "git reset --hard" and try to apply the 2 patches again.

Thanks and best regards
Lv
Comment 56 Luya Tshimbalanga 2017-09-20 03:30:00 UTC
(In reply to Lv Zheng from comment #55)
> What's this?
> It seems it has changed things a lot and probably could affect your code
> base.
> Is this applied?
>  
> kernel-local
> kernel.spec
> 
> I'm not sure if they are related.

kernel-local is for custom kernel config and kernel.spec is just a set ot scripts including patches for building package for the vendor distribution. Both have no impact on the code base and can be safely ignored.


> v5-1_4-ACPI_EC-Fix-regression-related-to-the-triggering-source-of-the-post-
> resume-EC-event-handling.patch
> Is it attachment 258293 [details]?

Yes.

> Should you drop ec-poll1.patch before applying this?

I thought about that. Unfortunately, I hit another issue after getting latest upstream kernel.

Applying: ===================================================================
error: patch failed: drivers/acpi/ec.c:484
error: drivers/acpi/ec.c: patch does not apply
Patch failed at 0072 ===================================================================
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.

See https://copr-be.cloud.fedoraproject.org/results/luya/kernel-acpi-ec/fedora-26-x86_64/00605164-kernel/builder-live.log

It looks like something changed inside drivers/acpi/ec.c file from the latest upstream kernel.

See http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/tree/?id=6bb98e943ef1e34b41c94644743969041347df69


> So I suggested to "git reset --hard" and try to apply the 2 patches again.

I followed the step but I kept hitting the same issue with ec-poll1.patch regardless. I don't know what caused the patch failing as it worked before.
Comment 57 Luya Tshimbalanga 2017-09-21 05:51:11 UTC
I updated to the latest upstream kernel. Doing a quick test, ec-poll1.patch successfully applied, unfortunately dbg-ec1.patch failed like on comment #56 despite doing "git reset --hard" prior to that. 

I am building with ec-poll1.patch to see if you can get the debugging info you need below:
http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/commit/?id=862affa9d358a4b2bd06a1ce5699197380fa5d07

kernel-local and kernel.spec can be ignored.

I will redo the steps from comment #33 and will post the result once the built completed.
Comment 58 Luya Tshimbalanga 2017-09-21 06:00:33 UTC
I figured out the issue on my side. The built system system dislike empty ident name <null> so I just copy the comment content from ec-poll1.patch to silence it. 

http://copr-dist-git.fedorainfracloud.org/cgit/luya/kernel-acpi-ec/kernel.git/diff/?id=92af3c0df5dc268e7c112764f81c59c031cc66d0

At least the build is working and I will post the result as soon as possible.
Comment 59 Lv Zheng 2017-09-21 08:36:18 UTC
It won't be a useful test if you cannot apply the dbg-ec1.patch...
Is that possible to pull my tree?

I have applied the 2 patches on top of latest upstream kernel: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
And pushed it here:
https://github.com/zetalog/linux/tree/acpi-ec

You can:
$ git remote add linux-zetalog https://github.com/zetalog/linux
$ git branch acpi-ec linux-zetalog/acpi-ec
$ git checkout acpi-ec

And test using this local git repository.

Thanks and best regards
Lv
Comment 60 Luya Tshimbalanga 2017-09-21 17:55:20 UTC
Created attachment 258553 [details]
boot with latest upstream kernel using dbg-ec patch

Boot result with both ec-poll1 dbg-ec patches applied.
Comment 61 Luya Tshimbalanga 2017-09-21 23:40:32 UTC
Created attachment 258555 [details]
input with 2 patches applied including dbg-ec

Result with dbg-ec patch
Comment 62 Luya Tshimbalanga 2017-09-21 23:41:41 UTC
Created attachment 258557 [details]
press-again-4.14.0-0-rc1-bad with 2 patches

Hopefully that helps
Comment 63 Lv Zheng 2017-09-25 02:54:05 UTC
Looks a dead lock here related to CPU hot plug and acpi scan, you may need to reproduce it with pure kernels (no patching) and report it to Rafael:
[  335.404514] Disabling non-boot CPUs ...

[  335.422764] ======================================================
[  335.422765] WARNING: possible circular locking dependency detected
[  335.422768] 4.14.0-0.rc1.git2.1.acpi_ec.fc26.x86_64 #1 Tainted: G           OE  
[  335.422769] ------------------------------------------------------
[  335.422772] systemd-sleep/3452 is trying to acquire lock:
[  335.422773]  ((complete)&st->done){+.+.}, at: [<ffffffff9c0b4669>] takedown_cpu+0x89/0xf0
[  335.422782] 
               but task is already holding lock:
[  335.422783]  (sparse_irq_lock){+.+.}, at: [<ffffffff9c131077>] irq_lock_sparse+0x17/0x20
[  335.422789] 
               which lock already depends on the new lock.

[  335.422789] 
               the existing dependency chain (in reverse order) is:
[  335.422790] 
               -> #1 (sparse_irq_lock){+.+.}:
[  335.422797]        __mutex_lock+0x86/0xa00
[  335.422799]        mutex_lock_nested+0x1b/0x20
[  335.422802]        irq_lock_sparse+0x17/0x20
[  335.422805]        irq_affinity_online_cpu+0x18/0xd0
[  335.422807]        cpuhp_invoke_callback+0xa7/0x830
[  335.422808] 
               -> #0 ((complete)&st->done){+.+.}:
[  335.422814]        check_prev_add+0x351/0x700
[  335.422816]        __lock_acquire+0x1107/0x11d0
[  335.422819]        lock_acquire+0xa3/0x1f0
[  335.422822]        wait_for_completion+0x66/0x1d0
[  335.422823]        takedown_cpu+0x89/0xf0
[  335.422825]        cpuhp_invoke_callback+0xa7/0x830
[  335.422826]        cpuhp_down_callbacks+0x42/0x80
[  335.422830]        _cpu_down+0xc2/0x100
[  335.422832]        freeze_secondary_cpus+0xbd/0x390
[  335.422835]        suspend_devices_and_enter+0x2e0/0xc90
[  335.422837]        pm_suspend+0x340/0x3b0
[  335.422839]        state_store+0x85/0xf0
[  335.422843]        kobj_attr_store+0xf/0x20
[  335.422846]        sysfs_kf_write+0x42/0x60
[  335.422849]        kernfs_fop_write+0x134/0x1c0
[  335.422852]        __vfs_write+0x37/0x170
[  335.422854]        vfs_write+0xc6/0x1c0
[  335.422856]        SyS_write+0x58/0xc0
[  335.422859]        entry_SYSCALL_64_fastpath+0x1f/0xbe
[  335.422860] 
               other info that might help us debug this:

[  335.422861]  Possible unsafe locking scenario:

[  335.422862]        CPU0                    CPU1
[  335.422862]        ----                    ----
[  335.422863]   lock(sparse_irq_lock);
[  335.422865]                                lock((complete)&st->done);
[  335.422866]                                lock(sparse_irq_lock);
[  335.422868]   lock((complete)&st->done);
[  335.422869] 
                *** DEADLOCK ***

[  335.422871] 8 locks held by systemd-sleep/3452:
[  335.422871]  #0:  (sb_writers#4){.+.+}, at: [<ffffffff9c2e84f3>] vfs_write+0x193/0x1c0
[  335.422877]  #1:  (&of->mutex){+.+.}, at: [<ffffffff9c38a916>] kernfs_fop_write+0x106/0x1c0
[  335.422882]  #2:  (kn->count#217){.+.+}, at: [<ffffffff9c38a91f>] kernfs_fop_write+0x10f/0x1c0
[  335.422887]  #3:  (pm_mutex){+.+.}, at: [<ffffffff9c12474e>] pm_suspend+0xae/0x3b0
[  335.422892]  #4:  (acpi_scan_lock){+.+.}, at: [<ffffffff9c56f297>] acpi_scan_lock_acquire+0x17/0x20
[  335.422899]  #5:  (cpu_add_remove_lock){+.+.}, at: [<ffffffff9c0b6b6e>] freeze_secondary_cpus+0x2e/0x390
[  335.422903]  #6:  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff9c1149ba>] percpu_down_write+0x2a/0x110
[  335.422908]  #7:  (sparse_irq_lock){+.+.}, at: [<ffffffff9c131077>] irq_lock_sparse+0x17/0x20
[  335.422912] 
               stack backtrace:
[  335.422915] CPU: 3 PID: 3452 Comm: systemd-sleep Tainted: G           OE   4.14.0-0.rc1.git2.1.acpi_ec.fc26.x86_64 #1
[  335.422916] Hardware name: ASUSTeK COMPUTER INC. X550ZE/X550ZE, BIOS 208 10/15/2015
[  335.422918] Call Trace:
[  335.422922]  dump_stack+0x8e/0xd6
[  335.422926]  print_circular_bug+0x1f6/0x2e0
[  335.422929]  ? copy_trace+0x90/0x90
[  335.422932]  check_prev_add+0x351/0x700
[  335.422937]  __lock_acquire+0x1107/0x11d0
[  335.422940]  ? __lock_acquire+0x1107/0x11d0
[  335.422942]  ? copy_trace+0x90/0x90
[  335.422946]  lock_acquire+0xa3/0x1f0
[  335.422948]  ? takedown_cpu+0x89/0xf0
[  335.422952]  wait_for_completion+0x66/0x1d0
[  335.422953]  ? takedown_cpu+0x89/0xf0
[  335.422956]  ? stop_machine_cpuslocked+0x100/0x120
[  335.422958]  ? cpuhp_invoke_callback+0x830/0x830
[  335.422960]  takedown_cpu+0x89/0xf0
[  335.422963]  ? cpuhp_complete_idle_dead+0x20/0x20
[  335.422965]  cpuhp_invoke_callback+0xa7/0x830
[  335.422969]  cpuhp_down_callbacks+0x42/0x80
[  335.422972]  _cpu_down+0xc2/0x100
[  335.422975]  freeze_secondary_cpus+0xbd/0x390
[  335.422979]  suspend_devices_and_enter+0x2e0/0xc90
[  335.422982]  ? rcu_read_lock_sched_held+0x79/0x80
[  335.422985]  pm_suspend+0x340/0x3b0
[  335.422988]  state_store+0x85/0xf0
[  335.422991]  kobj_attr_store+0xf/0x20
[  335.422993]  sysfs_kf_write+0x42/0x60
[  335.422996]  kernfs_fop_write+0x134/0x1c0
[  335.422999]  __vfs_write+0x37/0x170
[  335.423002]  ? rcu_read_lock_sched_held+0x79/0x80
[  335.423004]  ? rcu_sync_lockdep_assert+0x2f/0x60
[  335.423008]  ? __sb_start_write+0x144/0x1a0
[  335.423010]  ? vfs_write+0x193/0x1c0
[  335.423012]  vfs_write+0xc6/0x1c0
[  335.423015]  SyS_write+0x58/0xc0
[  335.423019]  entry_SYSCALL_64_fastpath+0x1f/0xbe
Comment 64 Lv Zheng 2017-09-25 02:56:41 UTC
I cannot see "enter acpi_ec_suspend()" and "enter acpi_ec_resume()‘.
So this is probably the cause.
Please upload your kernel config file here (.config) and let me know how did you trigger the suspend?

Thanks
Lv
Comment 65 Lv Zheng 2017-09-25 06:55:44 UTC
I guess:

Your platform only contain valid ECDT EC, and invalid DSDT EC.
So we are actually working with ECDT EC.
And there is no driver binding against ECDT EC, so the acpi_ec_suspend/resume() actually cannot be invoked.
That explains why first version can fix your issue but second version cannot.

IMO, the rest fix in the first version is still not a part of this issue, it's related to the timing of EC event handling enabling.

Your bug is actually:
ECDT device is not registered as an ACPI device, so there is no driver binding against it.
Let me try to implement a correct fix for this issue and see what can happen to your platform.

Thanks
Lv
Comment 66 Lv Zheng 2017-09-25 10:04:45 UTC
Created attachment 258577 [details]
[PATCH] ACPI / EC: Add scan support for ECDT device to attach PM ops

Please try if this can fix the issue?
Comment 67 Luya Tshimbalanga 2017-09-25 17:13:19 UTC
Created attachment 258589 [details]
kernel-x86_64.config default

(In reply to Lv Zheng from comment #64)
> I cannot see "enter acpi_ec_suspend()" and "enter acpi_ec_resume()‘.
> So this is probably the cause.
> Please upload your kernel config file here (.config) and let me know how did
> you trigger the suspend?

Suspend is simply triggered by either Fn+F1 keys or closing the lid of the laptop. After the lockup occurred, I had to manually suspend from my desktop.
Comment 68 Luya Tshimbalanga 2017-09-25 17:21:47 UTC
(In reply to Lv Zheng from comment #65)
> I guess:
> 
> Your platform only contain valid ECDT EC, and invalid DSDT EC.
> So we are actually working with ECDT EC.
> And there is no driver binding against ECDT EC, so the
> acpi_ec_suspend/resume() actually cannot be invoked.
> That explains why first version can fix your issue but second version cannot.
> 
> IMO, the rest fix in the first version is still not a part of this issue,
> it's related to the timing of EC event handling enabling.
> 
> Your bug is actually:
> ECDT device is not registered as an ACPI device, so there is no driver
> binding against it.
> Let me try to implement a correct fix for this issue and see what can happen
> to your platform.

Thanks for the explanation, I changed the title to reflect the cause of the issue. Basically, a hardware vendor implementation problem hard to detect by normal procedure which makes me wonder if other ASUS X series are affected.
Comment 69 Lv Zheng 2017-09-25 22:44:49 UTC
Really a regression, then.
Whatever the BIOS issue is, actually EC driver has something missing and triggered by an enhancement.
Comment 70 Lv Zheng 2017-09-25 22:48:09 UTC
The indication for the ECDT selection:
[    0.599731] ACPI: \_SB_.PCI0.SBRG.EC0_: Used as boot ECDT EC to handle transactions and events
                                                        ^^^^
Comment 71 Luya Tshimbalanga 2017-09-26 04:39:25 UTC
Created attachment 258601 [details]
boot with latest upstream kernel using ec-ecdt1.patch

Boot sequence with ec-ecdt1.patch
Comment 72 Luya Tshimbalanga 2017-09-26 04:40:43 UTC
Created attachment 258603 [details]
press-again-4.14.0-0-rc2-good with ec-ecdt1.patch

Initial Fn inputs command with ec-ecdt1.patch
Comment 73 Luya Tshimbalanga 2017-09-26 04:42:22 UTC
Created attachment 258605 [details]
press-4.14.0-0-rc2-good with ec-ecdt1.patch

Initial Fn input command with ec-ecdt1.patch
Comment 74 Luya Tshimbalanga 2017-09-26 04:58:32 UTC
Created attachment 258607 [details]
press-again-4.14.0-0-rc2-good with ec-ecdt1.patch

Fn inputs command after resuming the laptop. 
I went further by closing/opening the lid to verify suspend/resume will not cause another lock-up. It looks like the patch fixed the root cause of the issue.
Comment 75 Lv Zheng 2017-09-26 06:03:02 UTC
Thanks for the report and the test. :)

I'll do some local validation for the patch before sending it to the upstream.

Thanks
Lv
Comment 76 Luya Tshimbalanga 2017-09-26 06:25:39 UTC
You are very welcome Lv. I would like to also thank you for taking the time of providing the steps for testing and explaining about the issue. Those details were very helpful.
Comment 77 Lv Zheng 2017-09-27 02:39:02 UTC
It's my fault taking so long time to reach the root cause.
If I checked the line in comment 70 earlier and remembered the details of your original bug, I should have root caused it earlier...

Anyway, the upstream version of the patches is posted here:
https://patchwork.kernel.org/patch/9971553/
https://patchwork.kernel.org/patch/9971557/

I did an improvement to allow "unbind, suspend, resume, bind" working in your case.

# echo -n "LNXECDT:00" > /sys/bus/acpi/drivers/ec/unbind
# echo -n "mem" > /sys/power/state
# echo -n "LNXECDT:00" > /sys/bus/acpi/drivers/ec/bind

Using attachment 258577 [details], your issue should still be able to reproduce after executing this sequence.
Using the posted fix, event handling can be restored as long as the EC driver is bound.

Thanks
Lv
Comment 78 Luya Tshimbalanga 2017-09-27 06:31:56 UTC
Thanks again for the hard work. At least the issue is resolved for now while undiscovered problems lurk around.
Let me know when the patches got applied on the mainline kernel.

Luya