Bug 82611 - Almost all ACPI notifications delayed by more than 1sec
Summary: Almost all ACPI notifications delayed by more than 1sec
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: EC (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Lv Zheng
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-08-17 15:12 UTC by Aleksandr Mezin
Modified: 2014-10-28 00:23 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.16
Tree: Mainline
Regression: Yes


Attachments
dmesg (17.18 KB, application/x-gzip)
2014-08-19 22:01 UTC, Aleksandr Mezin
Details
Dmesg, pressed "brightness up" and then "brightness down" (24.93 KB, application/x-gzip)
2014-08-19 22:01 UTC, Aleksandr Mezin
Details
dmesg, older ec driver (17.22 KB, application/x-gzip)
2014-08-19 22:02 UTC, Aleksandr Mezin
Details
dmesg, old driver, pressed "bightness up", then "brightness down" (17.76 KB, application/x-gzip)
2014-08-19 22:03 UTC, Aleksandr Mezin
Details
acpidump (114.13 KB, application/x-gzip)
2014-08-19 22:03 UTC, Aleksandr Mezin
Details
dmidecode (10.01 KB, application/octet-stream)
2014-08-19 22:04 UTC, Aleksandr Mezin
Details
[PATCH] ACPI/EC: Add support to disallow QR_EC issued when SCI_EVT isn't set. (2.24 KB, patch)
2014-08-20 03:07 UTC, Lv Zheng
Details | Diff
[PATCH] ACPI/EC: Add support to disallow QR_EC issued before completing a previous QR_EC. (1.72 KB, patch)
2014-08-20 03:11 UTC, Lv Zheng
Details | Diff
[PATCH] ACPI/EC: Add support to disallow QR_EC issued when SCI_EVT isn't set. (2.27 KB, patch)
2014-08-20 03:22 UTC, Lv Zheng
Details | Diff
dmidecode (10.01 KB, application/octet-stream)
2014-10-27 14:22 UTC, Aleksandr Mezin
Details

Description Aleksandr Mezin 2014-08-17 15:12:42 UTC
Almost all notifications (ac state change, battery state change, brightness) happen after long delay (1 sec or more).

First commit is:
c0d653412fc8450370167a3268b78fc772ff9c87
ACPI / EC: Fix race condition in ec_transaction_completed()
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=c0d653412fc8450370167a3268b78fc772ff9c87
Comment 1 Lv Zheng 2014-08-19 00:08:50 UTC
Hi,

Please provide some information for this report.

1. Please uncomment "#define DEBUG" in drivers/acpi/ec.c.
2. Please boot the kernel and upload the dmesg output as "boot.log" here.
3. Perform a state change, for example, unplug AC, and upload the dmesg output as "change.log" here.
4. It could be better you can use a non affected kernel, and do the same thing, upload the both dmesg outputs here.
5. upload the dmidecode and acpidump output here.

Thanks
-Lv
Comment 2 Aleksandr Mezin 2014-08-19 22:01:13 UTC
Created attachment 147361 [details]
dmesg
Comment 3 Aleksandr Mezin 2014-08-19 22:01:54 UTC
Created attachment 147371 [details]
Dmesg, pressed "brightness up" and then "brightness down"
Comment 4 Aleksandr Mezin 2014-08-19 22:02:29 UTC
Created attachment 147381 [details]
dmesg, older ec driver
Comment 5 Aleksandr Mezin 2014-08-19 22:03:18 UTC
Created attachment 147391 [details]
dmesg, old driver, pressed "bightness up", then "brightness down"
Comment 6 Aleksandr Mezin 2014-08-19 22:03:41 UTC
Created attachment 147401 [details]
acpidump
Comment 7 Aleksandr Mezin 2014-08-19 22:04:08 UTC
Created attachment 147411 [details]
dmidecode
Comment 8 Lv Zheng 2014-08-20 02:01:56 UTC
I checked the log. The firmware refuses to respond QR_EC when SCI_EVT isn't set. Thus in the work queue item that has issued such QR_EC, timeout has to be waited, which makes the queued _Qxx method delayed.

Learned from this issue, I can find the following problems in the firmware and the driver:
1. The firmware doesn't respond to QR_EC when SCI_EVT isn't set.
2. The QR_EC shouldn't be issued in the same work queue as the work item to evaluate _Qxx method, and it's better to issue each _Qxx from one thread
3. The QR_EC shouldn't be issued without SCI_EVT set (though this is not spec compliant)

For issue 1, this is a firmware bug. In ACPI specificiation 5.1, 12.3.5 Query Embedded Controller, QR_EC (0x84): The query value of zero is reserved for a spurious query result and indicates “no outstanding event.” So the Linux EC driver expectes that the firmware is always able to respond QR_EC, when there is no event in the queue, the firmware should respond 0x00.

For issue 2, we've already a plan to change the QR_EC work item, issuing QR_EC from an IRQ handling thread so that the Linux EC driver can be more robust. But so far we haven't any plan to issue each _Qxx method evaluation from seperate thread. This is better to be fixed in ACPICA, tuning executer to be able to work in asynchronous mode so that we don't need to create a thread for each _Lxx/_Exx/_Qxx method evaluation.

For issue 3, this is an existing issue. Linux EC driver originally will issue QR_EC from 2 contexts, and issuing QR_EC without checking SCI_EVT bit, so there are possiblities the 2nd QR_EC will be issued without SCI_EVT bit set. The bisected commit just makes the occurrance rate from "possible to happen" to "sure to happen". And in the driver the CLEAR_ON_RESUME quirk is highly relying on this behavior.

Since:
1. This is basically a firmware bug
2. Linux EC driver hasn't handled this case in the original code
3. The bisected commit is a good fix for good firmware and has been tested by many reporters
Thus I cannot agree that this is a regression.

Though I ate the rule that we cannot break any old working system, but "thinking every user breakage as regression without checking whether the commit is an improvement" is not my cake as this case is an obvious proof: malicious firmware/hardware hackers can always create malfunctioning cases to highlight a good thing as a "REGRESSION".

Also we need to have improvements merged though they might break old working systems because:
1. Because of communication cost, it's very hard to sell code to the maintainers. Maintainers cannot check in bunch of improvements at once, so some solutions have to be split and pushed to the upstream piece by piece.
2. Some old bugs are fixed in a wrong way, and when we see such suspicious code that is conflict with other bugs, we have to fix other root caused bugs first. If it triggers old diseases, then it's just a good chance for us to be able to find the real causes for that suspicious code. In theroy, such bad code is "software entropy" which can finally make other bugs very hard to be root caused, thus they need to be cleaned up first.

For this case, we just need to enhance our code to deal with this malfunctioning firmware.

Let me post something for you to test.
Comment 9 Lv Zheng 2014-08-20 03:07:38 UTC
Created attachment 147451 [details]
[PATCH] ACPI/EC: Add support to disallow QR_EC issued when SCI_EVT isn't set.

This is the fix for this issue.

Please apply this patch and perform a test.
You'd better also perform a suspend/resume test, during the suspending/resuming period, you can do a robustness test by plugging/unplugging AC frequently or switching backlight brighness frequently.

To suspend:
# sudo echo mem > /sys/power/state
After suspending, push specific button to wake the system up.
Comment 10 Lv Zheng 2014-08-20 03:11:08 UTC
Created attachment 147461 [details]
[PATCH] ACPI/EC: Add support to disallow QR_EC issued before completing a previous QR_EC.

This is also a useful cleanup for this issue, so it's better you can also perform a test for this patch to see if your platform can work with only this patch applied. We can learn more firmware behaviours according to your test result.

Please apply this patch and perform a test without applying attachment 147451 [details].
You'd better also perform a suspend/resume test, during the suspending/resuming period, you can do a robustness test by plugging/unplugging AC frequently or switching backlight brighness frequently.
Comment 11 Lv Zheng 2014-08-20 03:22:21 UTC
Created attachment 147481 [details]
[PATCH] ACPI/EC: Add support to disallow QR_EC issued when SCI_EVT isn't set.

This is an update of 14751, which is better.
Please use it instead of 14751.

Sorry for the noise.
Comment 12 Lv Zheng 2014-08-20 04:39:32 UTC
(In reply to Lv Zheng from comment #11)
> Created attachment 147481 [details]
> [PATCH] ACPI/EC: Add support to disallow QR_EC issued when SCI_EVT isn't set.
> 
> This is an update of 14751, which is better.
> Please use it instead of 14751.

The attachment ID was wrong.
This is an update of attachment 147451 [details].
Sorry for the mistake.
Comment 13 Lv Zheng 2014-08-20 04:49:15 UTC
To clarify:

1. Please apply attachment 147481 [details] without attachment 147461 [details] applied to build kernel 1, and perform a test using kernel 1 to see if the delay can be solved.
2. Please apply attachment 147461 [details] without attachment 147481 [details] applied to build kernel 2, and perform a test using kernel 2 to see if the delay can be solved.

It's better you can perform a suspend/resume test for both kernels. During the suspending/resuming period, you can do a robustness test by plugging/unplugging AC frequently or switching backlight brightness frequently.

To suspend:
# sudo echo mem > /sys/power/state
After suspending, push specific button to wake the system up.
Comment 14 Aleksandr Mezin 2014-08-20 18:03:28 UTC
147481: no delays, suspend/resume works
Comment 15 Aleksandr Mezin 2014-08-20 18:29:53 UTC
However, when I tried to compile new kernel (running kernel with 147481 applied), "make" segfaulted. And then "mount" segfaulted. And I wasn't able to enter "acpica" subdirectory, because bash crashed.

147461: no delays, suspend/resume works too. But without crashes and segfaults so far
Comment 16 Lv Zheng 2014-08-21 01:57:30 UTC
(In reply to Alexander Mezin from comment #15)
> However, when I tried to compile new kernel (running kernel with 147481
> applied), "make" segfaulted. And then "mount" segfaulted. And I wasn't able
> to enter "acpica" subdirectory, because bash crashed.
> 
> 147461: no delays, suspend/resume works too. But without crashes and
> segfaults so far

Each patch fixes one issue existing before the bisected commit.

For this case, 147481 was the right fix, it fixes things in the state machine code which exactly corresponds to this root cause.

147461 was an experiment to see if the patchset I've already posted in the ACPI mailing list can fix this automatically. The patchset hasn't been merged just because of reviews pending. Using this fix, the driver still have chances to issue QR_EC when SCI_EVT isn't set. If a platform doesn't clear SCI_EVT right after the completion of QR_EC, then this issue still can be seen on such platform. And this commit can't fix the same problem for CLEAR_ON_RESUME quirk.

As a conclusion, 147481 is the required fix and 147461 can just tune the possiblities lower and make things better. Thus we need to make 147481 working.

We reviewed the code, it couldn't be a trigger for your segfaults.

So let me ask you to do more tests:
1. Could you upload a dmesg for the segfaulted system so that I can learn what has happened in the kernel?
2. If you recompile the same kernel without any of the patches applied, will you still see the segfaults?
2. Also please retry the test by disabling EC debugging messages - commenting "#define DEBUG" in drivers/acpi/ec.c. Does the new system still suffers from segfault?

Thanks and best regards
-Lv
Comment 17 Aleksandr Mezin 2014-08-21 03:35:56 UTC
Yes, segfaults are unrelated. I got them with non-patched kernel too.
They happen after hibernation. From traces it looks like the problem is somewhere in block layer.
I should report another bug, I think.
Both of these patches work fine.
Comment 18 Lv Zheng 2014-08-21 04:52:29 UTC
(In reply to Alexander Mezin from comment #17)
> Yes, segfaults are unrelated. I got them with non-patched kernel too.
> They happen after hibernation. From traces it looks like the problem is
> somewhere in block layer.
> I should report another bug, I think.
> Both of these patches work fine.

Sounds good.
I'll send the 2 patches upstream.
Thanks for your report and test.

Best regards
-Lv
Comment 19 Lv Zheng 2014-09-09 13:58:12 UTC
Upstreamed.
Commit IDs are as follows:
3afcf2ece453e1a8c2c6de19cdf06da3772a1b08
558e4736f2e1b0e6323adf7a5e4df77ed6cfc1a4
Comment 20 Aleksandr Mezin 2014-09-23 18:06:43 UTC
I've switched to 3.14.x LTS kernel recently. And I've found that the problem is still present there. Patches that exposed the problem were backported to 3.14, but fixes weren't. Can fixes be backported too?
Comment 21 Lv Zheng 2014-09-24 05:34:40 UTC
They are already marked as stable material:
 Cc: 3.16+ <stable@vger.kernel.org> # 3.16+
So the back porting will automatically happen sooner or later.
Actions will be taken by the stable maintainers.

Thanks and best regards
-Lv
Comment 22 Lv Zheng 2014-10-27 05:18:42 UTC
Hi, Alexander

It is reported at https://bugzilla.kernel.org/show_bug.cgi?id=44161,
saying Samsung firmware behavior is conflict with Acer firmware behavior.

So could you provide a dmidecode here?
Then if we determine to make Samsung behavior the default way, we will need the DMI information to make a quirk for Acer behavior.
Comment 23 Aleksandr Mezin 2014-10-27 14:22:42 UTC
Created attachment 155361 [details]
dmidecode

It was attached to the bug already...

By the way, these patches still aren't in 3.14.x
Comment 24 Lv Zheng 2014-10-28 00:23:08 UTC
(In reply to Alexander Mezin from comment #23)
> Created attachment 155361 [details]
> dmidecode
> 
> It was attached to the bug already...

Ah, yes. It's in the comment 7.
Appologies and thanks for your help.

> 
> By the way, these patches still aren't in 3.14.x

Maybe you need to ping 3.14.x stable maintainers for this.

Best regards
-Lv

Note You need to log in before you can comment on or make changes to this bug.