Bug 196833
Summary: | Change in 4.11 drivers/acpi/ec.c re-creates bug 44161 | ||
---|---|---|---|
Product: | ACPI | Reporter: | Alistair Hamilton (ahpatent) |
Component: | EC | Assignee: | Lv Zheng (lv.zheng) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | lv.zheng, rui.zhang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.11 onwards | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
attachment-32255-0.html
attachment-2201-0.html |
Description
Alistair Hamilton
2017-09-05 15:23:37 UTC
I don't know if it is actually solved in version 4.0 as not all of the commits have been upstreamed. Now could you give the following 3 patches a try: https://patchwork.kernel.org/patch/9931477/ https://patchwork.kernel.org/patch/9931481/ https://patchwork.kernel.org/patch/9931473/ Thanks Lv And also tell me if applying only this patch could fully solve the issue: https://patchwork.kernel.org/patch/9931477/ > Bug 44161 addressed a problem relating to some Samsung laptops. This was > resolved with a patch to drivers/acpi/ec.c in version 4.0 or thereabouts. If you mean the CLEAR_ON_RESUME quirk? > However, this code added by this patch was removed from ec.c in 4.11, and the > problem has returned, exactly as described in 44161. Yes, it is removed because we are using 2 mechansims to solve the same issue: 1. Enable IRQ after noirq stage, if the GPE silicon is good, the event will be triggered again. 2. Polling EC GPE once after enabling the event. However, I did this in a safer way, only poll GPEs when an event is already noticed. It should work for most of the platforms requiring the quirk. https://patchwork.kernel.org/patch/9931477/ Just stop doing things in a safer way, but doing things aggressively. Thanks Lv IMO, the quirk introduced in bug 44161 just reduces the reproduce ratio of the problem. The final solution in that bug is actually not upstreamed: https://bugzilla.kernel.org/attachment.cgi?id=155851 So comes out the approaches mentioned in comment 3. The approaches are closer to original quirk and safer for other platforms. The following 2 patches are the new form of the final solution: https://patchwork.kernel.org/patch/9931481/ https://patchwork.kernel.org/patch/9931473/ Thanks Lv Ping... What's the blocking issue for the test? Thanks Lv Created attachment 258389 [details] attachment-32255-0.html Do you want *me* to do this? I only reported the bug as an and user. I actually had a few lines of my code included in the kernel going back to about 2002, but I have not compiled a kernel for about a decade. I tried to apply the patches into the current Fedora kernel source (because my Samsung machine has Fedora installed) but they failed.n If no-one else can do it, I will try to figure out how to build and install a patched kernel, but it will take me a while. Problem is, the only affected machine is in constant use, and has a work-around using the program mentioned before and a hacked systemd script. On 11 September 2017 at 09:34, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=196833 > > --- Comment #5 from Lv Zheng (lv.zheng@intel.com) --- > Ping... > What's the blocking issue for the test? > > Thanks > Lv > > -- > You are receiving this mail because: > You reported the bug. You should know, if you want the bug to be fixed, you need to provide validation support here. It's very simple to build and test a kernel. $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git $ apply the patches $ cd linux $ cp /boot/config-xxx (your current config) ./.config Then you can type "make help" to see the build targets for you. Simply you can: $ git make oldconfig $ git make -j64 $ sudo make modules_install install Reboot the kernel, you'll see the new kernel in your grub2 menu. NOTE you are reporting a regression for a removed quirk as we have already root caused the regression and is able to provide a final solution as a replacement of the quirk (though may still not perfect). So it is not good to restore the quirk back and hide the root cause again. Please help to complete the validations. Thanks Lv OK. Sorry for being stupid about this. I have the Samsung machine in my possession now for "maintenance" and will be trying to test the fixes in the next couple of days. Alistair Great. Thanks for the help. :) Let me comment more A. To correct my mistake: Sorry, I have a mistake: > $ git make oldconfig > $ git make -j64 > $ sudo make modules_install install should be: $ make oldconfig $ make -j64 $ sudo make modules_install install B. To know the replicdation ration of the issue: IMO, you should have very low reproduce ratio for the regression. As the following conditions should be met: 1. An EC events (causing SCI_EVT flagged) occurred between acpi_ec_unblock_transaction() and acpi_ec_resume(); 2. No EC transactions occurred between acpi_ec_unblock_transaction() and acpi_ec_resume(); 3. Samsung platforms. Only in this case, our original root cause fix is not sufficient to handle what was dealt by the removed quirk. And it's a very short period between acpi_ec_unblock_transaction() and acpi_ec_resume(), probably only several msecs. So let me ask: Could you let me know the reproduce ratio of the issue? B. To capture some bug related information: If you are not able to rebuild the kernel right now, then you may still able to provide useful information for us. If you have CONFIG_DYNAMIC_DEBUG enabled for your current running kernel, you still can help to offer some information. So Could help me to capture what's going on when the bug occurs using the following steps. 1. Boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'" (I hope you needn't rebuild the kernel with CONFIG_DYNAMIC_DEBUG=y); 2. Perform a suspend/resume and confirm the bug is triggered; 3. Upload the dmesg output here. If you cannot do 2 due to lacking of CONFIG_DYNAMIC_DEBUG=y, you still need to learn kernel build, and stay in the kernel build directory: $ make menuconfig Find DYNAMIC_DEBUG, select it (you can type "\" to enter a search CUI to find its location), and exit the menuconfig CUI. $ make -j64 $ sudo make modules_install install D. To know if the better fix is working for you: attachment 258293 [details] should be a complete replacement of the removed quirk. If you can apply this fix on top of your kernel, rebuild the kernel and try it. You can help me to confirm if the bug is fully fixed by the attachment 258293 [details]. Thanks and best regards Lv Created attachment 258575 [details] attachment-2201-0.html Hi, Lv I was just about to respond to you. I wanted to give the machine a good test before posting again. I compiled and installed the kernel with the three patches linked in comment 1 (9931477, 9931481, 9931473) <https://patchwork.kernel.org/patch/9931477/> a few days ago. I figured out the correct make and install sequence :-) (This takes me back years - very nostalgic - except that I could compile now using 8 cores rather than one or two, when I last did it!) Is 258293 the same as these three? [alistair@juno ~]$ uname -a Linux juno.ty-eurgain 4.14.0-rc1+ #1 SMP Wed Sep 20 00:00:48 BST 2017 x86_64 x86_64 x86_64 GNU/Linux I have suspended and resumed the machine many times (>30) and booted into Windows 10 and back again. I made sure that lots of PM events would occur when the machine was suspended (lid open/close, power on/off, battery discharge) which seems to be the trigger for the problem. There has been no instance of any problem with suspend or resume on either OS. I can also confirm that I have removed the user-space workaround that I had implemented. I also rebooted into the old Fedora kernel (with no userspace workaround) to confirm that the problem re-appeared. So it seems that this good behaviour is all down to the new kernel. I have not noticed any other problems arising from these changes. Note that this machine (Samsung Series 9 NP900X3B) has never automatically resumed on lid-open other than with the original Win7 factory install - the power button always has to be pressed. (Not just Linux - it failed to resume with a Win 7 re-install and with Win 10.) So, from my point of view, the bug is solved. Is there any other sort of testing that I can do or are there any logs that you would like me to post? Thanks for your help and thanks for getting me back into looking into the kernel again after so many years away! Alistair On 20 September 2017 at 02:02, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=196833 > > Lv Zheng (lv.zheng@intel.com) changed: > > What |Removed |Added > ------------------------------------------------------------ > ---------------- > Summary|[possible regression] |Change in 4.11 > |Change in 4.11 |drivers/acpi/ec.c > |drivers/acpi/ec.c |re-creates bug 44161 > |re-creates bug 44161 | > > --- Comment #11 from Lv Zheng (lv.zheng@intel.com) --- > Let me comment more > > A. To correct my mistake: > Sorry, I have a mistake: > > $ git make oldconfig > > $ git make -j64 > > $ sudo make modules_install install > should be: > $ make oldconfig > $ make -j64 > $ sudo make modules_install install > > B. To know the replicdation ration of the issue: > IMO, you should have very low reproduce ratio for the regression. > As the following conditions should be met: > 1. An EC events (causing SCI_EVT flagged) occurred between > acpi_ec_unblock_transaction() and acpi_ec_resume(); > 2. No EC transactions occurred between > acpi_ec_unblock_transaction() and acpi_ec_resume(); > 3. Samsung platforms. > Only in this case, our original root cause fix is not sufficient > to handle what was dealt by the removed quirk. > And it's a very short period between acpi_ec_unblock_transaction() > and acpi_ec_resume(), probably only several msecs. > So let me ask: > Could you let me know the reproduce ratio of the issue? > > B. To capture some bug related information: > If you are not able to rebuild the kernel right now, then you may > still able to provide useful information for us. If you have > CONFIG_DYNAMIC_DEBUG enabled for your current running kernel, you > still can help to offer some information. > So Could help me to capture what's going on when the bug occurs > using the following steps. > 1. Boot the kernel with "dyndbg='file drivers/acpi/ec.c +p'" > (I hope you needn't rebuild the kernel with CONFIG_DYNAMIC_DEBUG=y); > 2. Perform a suspend/resume and confirm the bug is triggered; > 3. Upload the dmesg output here. > If you cannot do 2 due to lacking of CONFIG_DYNAMIC_DEBUG=y, > you still need to learn kernel build, and stay in the kernel build > directory: > $ make menuconfig > Find DYNAMIC_DEBUG, select it (you can type "\" to enter a search CUI to > find > its location), and exit the menuconfig CUI. > $ make -j64 > $ sudo make modules_install install > > D. To know if the better fix is working for you: > attachment 258293 [details] should be a complete replacement of the removed > quirk. If you can apply this fix on top of your kernel, rebuild > the kernel and try it. You can help me to confirm if the bug is > fully fixed by the attachment 258293 [details]. > > Thanks and best regards > Lv > > -- > You are receiving this mail because: > You reported the bug. Thanks for the confirmation. Your test is reasonable and match my theory. Thanks for the help. I'll mark this as resolved and close it after having the fixes upstreamed. Thanks Lv Hi, You've tested against this commit: https://patchwork.kernel.org/patch/9931477/ Actually I'll split it into 2 fixes. 1. unconditionally re-checks SCI_EVT as attachment 258293 [details], and 2. enables EC event handling earlier. As far as I know, there are EC firmware taking very long time to complete its initialization, and hence the 2nd fix is risky as it can trigger timing issues inside of EC FW (race between EC FW initialization and the 1st ACPI EC transaction). I'd rather to make it a seperate patch so that the 1st one won't be affected by the possible regressions. (In reply to Lv Zheng from comment #14) > Hi, > > You've tested against this commit: > https://patchwork.kernel.org/patch/9931477/ This patch actually contains two fixes, thus it won't be targeted for upstream kernel as we want to fix only one problem in one patch. > Actually I'll split it into 2 fixes. > 1. unconditionally re-checks SCI_EVT as attachment 258293 [details], please test this patch instead. Yes, https://patchwork.kernel.org/patch/9931477/ actually contains a change useful for bug 196847 which should have a different root cause but similar sympton. Thus it's better to confirm if attachment 258293 [details] is the root cause fix for this issue. |