Bug 196833

Summary: Change in 4.11 drivers/acpi/ec.c re-creates bug 44161
Product: ACPI Reporter: Alistair Hamilton (ahpatent)
Component: ECAssignee: Lv Zheng (lv.zheng)
Status: RESOLVED CODE_FIX    
Severity: normal CC: lv.zheng, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.11 onwards Tree: Mainline
Regression: No
Attachments: attachment-32255-0.html
attachment-2201-0.html

Description Alistair Hamilton 2017-09-05 15:23:37 UTC
Bug 44161 addressed a problem relating to some Samsung laptops. This was resolved with a patch to drivers/acpi/ec.c in version 4.0 or thereabouts.

However, this code added by this patch was removed from ec.c in 4.11, and the problem has returned, exactly as described in 44161.

There is nothing more than I can add that was not covered in detain in 44161.
Comment 1 Lv Zheng 2017-09-06 02:39:26 UTC
I don't know if it is actually solved in version 4.0 as not all of the commits have been upstreamed.
Now could you give the following 3 patches a try:

https://patchwork.kernel.org/patch/9931477/
https://patchwork.kernel.org/patch/9931481/
https://patchwork.kernel.org/patch/9931473/

Thanks
Lv
Comment 2 Lv Zheng 2017-09-06 02:39:53 UTC
And also tell me if applying only this patch could fully solve the issue:
https://patchwork.kernel.org/patch/9931477/
Comment 3 Lv Zheng 2017-09-06 02:50:11 UTC
> Bug 44161 addressed a problem relating to some Samsung laptops. This was
> resolved with a patch to drivers/acpi/ec.c in version 4.0 or thereabouts.

If you mean the CLEAR_ON_RESUME quirk?

> However, this code added by this patch was removed from ec.c in 4.11, and the
> problem has returned, exactly as described in 44161.

Yes, it is removed because we are using 2 mechansims to solve the same issue:
1. Enable IRQ after noirq stage, if the GPE silicon is good, the event will be triggered again.
2. Polling EC GPE once after enabling the event. However, I did this in a safer way, only poll GPEs when an event is already noticed.

It should work for most of the platforms requiring the quirk.

https://patchwork.kernel.org/patch/9931477/
Just stop doing things in a safer way, but doing things aggressively.

Thanks
Lv
Comment 4 Lv Zheng 2017-09-06 02:54:41 UTC
IMO, the quirk introduced in bug 44161 just reduces the reproduce ratio of the problem.
The final solution in that bug is actually not upstreamed:
https://bugzilla.kernel.org/attachment.cgi?id=155851

So comes out the approaches mentioned in comment 3.
The approaches are closer to original quirk and safer for other platforms.

The following 2 patches are the new form of the final solution:
https://patchwork.kernel.org/patch/9931481/
https://patchwork.kernel.org/patch/9931473/

Thanks
Lv
Comment 5 Lv Zheng 2017-09-11 08:34:51 UTC
Ping...
What's the blocking issue for the test?

Thanks
Lv
Comment 6 Alistair Hamilton 2017-09-14 21:39:59 UTC
Created attachment 258389 [details]
attachment-32255-0.html

Do you want *me* to do this? I only reported the bug as an and user.

I actually had a few lines of my code included in the kernel going back to
about 2002, but I have not compiled a kernel for about a decade. I tried to
apply the patches into the current Fedora kernel source (because my Samsung
machine has Fedora installed) but they failed.n

If no-one else can do it, I will try to figure out how to build and install
a patched kernel, but it will take me a while. Problem is, the only
affected machine is in constant use, and has a work-around using the
program mentioned before and a hacked systemd script.

On 11 September 2017 at 09:34, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=196833
>
> --- Comment #5 from Lv Zheng (lv.zheng@intel.com) ---
> Ping...
> What's the blocking issue for the test?
>
> Thanks
> Lv
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 7 Lv Zheng 2017-09-15 04:36:14 UTC
You should know, if you want the bug to be fixed, you need to provide validation support here.

It's very simple to build and test a kernel.

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ apply the patches
$ cd linux
$ cp /boot/config-xxx (your current config) ./.config

Then you can type "make help" to see the build targets for you.

Simply you can:

$ git make oldconfig
$ git make -j64
$ sudo make modules_install install

Reboot the kernel, you'll see the new kernel in your grub2 menu.
Comment 8 Lv Zheng 2017-09-15 04:38:34 UTC
NOTE you are reporting a regression for a removed quirk as we have already root caused the regression and is able to provide a final solution as a replacement of the quirk (though may still not perfect).
So it is not good to restore the quirk back and hide the root cause again.

Please help to complete the validations.

Thanks
Lv
Comment 9 Alistair Hamilton 2017-09-19 15:43:00 UTC
OK. Sorry for being stupid about this.

I have the Samsung machine in my possession now for "maintenance" and will be trying to test the fixes in the next couple of days.

Alistair
Comment 10 Lv Zheng 2017-09-20 00:16:00 UTC
Great. Thanks for the help. :)
Comment 11 Lv Zheng 2017-09-20 01:02:34 UTC
Let me comment more

A. To correct my mistake:
Sorry, I have a mistake:
> $ git make oldconfig
> $ git make -j64
> $ sudo make modules_install install
should be:
$ make oldconfig
$ make -j64
$ sudo make modules_install install

B. To know the replicdation ration of the issue:
IMO, you should have very low reproduce ratio for the regression.
As the following conditions should be met:
1. An EC events (causing SCI_EVT flagged) occurred between 
   acpi_ec_unblock_transaction() and acpi_ec_resume();
2. No EC transactions occurred between
   acpi_ec_unblock_transaction() and acpi_ec_resume();
3. Samsung platforms.
Only in this case, our original root cause fix is not sufficient
to handle what was dealt by the removed quirk.
And it's a very short period between acpi_ec_unblock_transaction()
and acpi_ec_resume(), probably only several msecs.
So let me ask:
Could you let me know the reproduce ratio of the issue?

B. To capture some bug related information:
If you are not able to rebuild the kernel right now, then you may
still able to provide useful information for us. If you have
CONFIG_DYNAMIC_DEBUG enabled for your current running kernel, you
still can help to offer some information.
So Could help me to capture what's going on when the bug occurs
using the following steps.
1. Boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'"
   (I hope you needn't rebuild the kernel with CONFIG_DYNAMIC_DEBUG=y);
2. Perform a suspend/resume and confirm the bug is triggered;
3. Upload the dmesg output here.
If you cannot do 2 due to lacking of CONFIG_DYNAMIC_DEBUG=y,
you still need to learn kernel build, and stay in the kernel build
directory:
$ make menuconfig
Find DYNAMIC_DEBUG, select it (you can type "\" to enter a search CUI to find its location), and exit the menuconfig CUI.
$ make -j64
$ sudo make modules_install install

D. To know if the better fix is working for you:
attachment 258293 [details] should be a complete replacement of the removed
quirk. If you can apply this fix on top of your kernel, rebuild
the kernel and try it. You can help me to confirm if the bug is
fully fixed by the attachment 258293 [details].

Thanks and best regards
Lv
Comment 12 Alistair Hamilton 2017-09-22 20:46:11 UTC
Created attachment 258575 [details]
attachment-2201-0.html

Hi, Lv

I was just about to respond to you. I wanted to give the machine a good
test before posting again.

I compiled and installed the kernel with the three patches linked in
comment 1 (9931477, 9931481, 9931473)
<https://patchwork.kernel.org/patch/9931477/> a few days ago. I figured out
the correct make and install sequence :-) (This takes me back years - very
nostalgic - except that I could compile now using 8 cores rather than one
or two, when I last did it!) Is 258293 the same as these three?

[alistair@juno ~]$ uname -a
Linux juno.ty-eurgain 4.14.0-rc1+ #1 SMP Wed Sep 20 00:00:48 BST 2017
x86_64 x86_64 x86_64 GNU/Linux

I have suspended and resumed the machine many times (>30) and booted into
Windows 10 and back again. I made sure that lots of PM events would occur
when the machine was suspended (lid open/close, power on/off, battery
discharge) which seems to be the trigger for the problem.

There has been no instance of any problem with suspend or resume on either
OS. I can also confirm that I have removed the user-space workaround that I
had implemented. I also rebooted into the old Fedora kernel (with no
userspace workaround) to confirm that the problem re-appeared. So it seems
that this good behaviour is all down to the new kernel.

I have not noticed any other problems arising from these changes.

Note that this machine (Samsung Series 9 NP900X3B) has never automatically
resumed on lid-open other than with the original Win7 factory install - the
power button always has to be pressed. (Not just Linux - it failed to
resume with a Win 7 re-install and with Win 10.)

So, from my point of view, the bug is solved. Is there any other sort of
testing that I can do or are there any logs that you would like me to post?

Thanks for your help and thanks for getting me back into looking into the
kernel again after so many years away!

Alistair

On 20 September 2017 at 02:02, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=196833
>
> Lv Zheng (lv.zheng@intel.com) changed:
>
>            What    |Removed                     |Added
> ------------------------------------------------------------
> ----------------
>             Summary|[possible regression]       |Change in 4.11
>                    |Change in 4.11              |drivers/acpi/ec.c
>                    |drivers/acpi/ec.c           |re-creates bug 44161
>                    |re-creates bug 44161        |
>
> --- Comment #11 from Lv Zheng (lv.zheng@intel.com) ---
> Let me comment more
>
> A. To correct my mistake:
> Sorry, I have a mistake:
> > $ git make oldconfig
> > $ git make -j64
> > $ sudo make modules_install install
> should be:
> $ make oldconfig
> $ make -j64
> $ sudo make modules_install install
>
> B. To know the replicdation ration of the issue:
> IMO, you should have very low reproduce ratio for the regression.
> As the following conditions should be met:
> 1. An EC events (causing SCI_EVT flagged) occurred between
>    acpi_ec_unblock_transaction() and acpi_ec_resume();
> 2. No EC transactions occurred between
>    acpi_ec_unblock_transaction() and acpi_ec_resume();
> 3. Samsung platforms.
> Only in this case, our original root cause fix is not sufficient
> to handle what was dealt by the removed quirk.
> And it's a very short period between acpi_ec_unblock_transaction()
> and acpi_ec_resume(), probably only several msecs.
> So let me ask:
> Could you let me know the reproduce ratio of the issue?
>
> B. To capture some bug related information:
> If you are not able to rebuild the kernel right now, then you may
> still able to provide useful information for us. If you have
> CONFIG_DYNAMIC_DEBUG enabled for your current running kernel, you
> still can help to offer some information.
> So Could help me to capture what's going on when the bug occurs
> using the following steps.
> 1. Boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'"
>    (I hope you needn't rebuild the kernel with CONFIG_DYNAMIC_DEBUG=y);
> 2. Perform a suspend/resume and confirm the bug is triggered;
> 3. Upload the dmesg output here.
> If you cannot do 2 due to lacking of CONFIG_DYNAMIC_DEBUG=y,
> you still need to learn kernel build, and stay in the kernel build
> directory:
> $ make menuconfig
> Find DYNAMIC_DEBUG, select it (you can type "\" to enter a search CUI to
> find
> its location), and exit the menuconfig CUI.
> $ make -j64
> $ sudo make modules_install install
>
> D. To know if the better fix is working for you:
> attachment 258293 [details] should be a complete replacement of the removed
> quirk. If you can apply this fix on top of your kernel, rebuild
> the kernel and try it. You can help me to confirm if the bug is
> fully fixed by the attachment 258293 [details].
>
> Thanks and best regards
> Lv
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 13 Lv Zheng 2017-09-25 03:18:14 UTC
Thanks for the confirmation.
Your test is reasonable and match my theory.
Thanks for the help.
I'll mark this as resolved and close it after having the fixes upstreamed.

Thanks
Lv
Comment 14 Lv Zheng 2017-09-25 05:49:05 UTC
Hi,

You've tested against this commit:
https://patchwork.kernel.org/patch/9931477/
Actually I'll split it into 2 fixes.
1. unconditionally re-checks SCI_EVT as attachment 258293 [details], and
2. enables EC event handling earlier.

As far as I know, there are EC firmware taking very long time to complete its initialization, and hence the 2nd fix is risky as it can trigger timing issues inside of EC FW (race between EC FW initialization and the 1st ACPI EC transaction). I'd rather to make it a seperate patch so that the 1st one won't be affected by the possible regressions.
Comment 15 Zhang Rui 2017-09-25 06:32:22 UTC
(In reply to Lv Zheng from comment #14)
> Hi,
> 
> You've tested against this commit:
> https://patchwork.kernel.org/patch/9931477/

This patch actually contains two fixes, thus it won't be targeted for upstream kernel as we want to fix only one problem in one patch.

> Actually I'll split it into 2 fixes.
> 1. unconditionally re-checks SCI_EVT as attachment 258293 [details],

please test this patch instead.
Comment 16 Lv Zheng 2017-09-25 08:33:01 UTC
Yes, https://patchwork.kernel.org/patch/9931477/ actually contains a change useful for bug 196847 which should have a different root cause but similar sympton.
Thus it's better to confirm if attachment 258293 [details] is the root cause fix for this issue.