Bug 11089

Summary: Asus Eee PC hotkeys stop working if pressed quickly
Product: Drivers Reporter: Alan Jenkins (alan-jenkins)
Component: PlatformAssignee: Alan Jenkins (alan-jenkins)
Status: CLOSED DUPLICATE    
Severity: normal CC: jean.bruenn, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg showing GPE storm message
Workaround - ugly but functional
Cleaned up fixes (squashed into one patch)
Patch 1/4 : Don't issue the burst disable command if EC exits the burst mode
Patch 2/4: Clear the query_pending bit only after processing EC notification event
Patch 3/4: Simplify EC working flowchart and always enable EC GPE
patch 4/4: Add some udelay in EC GPE handler to avoid EC GPE interrupt storm

Description Alan Jenkins 2008-07-15 07:13:24 UTC
Latest working kernel version: 2.6.26 (with in-tree eeepc-laptop module)
Earliest failing kernel version: 2.6.25.3 - with out-of-tree eee-acpi module
Distribution: Ubuntu 8.10
Hardware Environment: Asus Eee-PC 701 4G
Problem Description: ACPI hotkeys behave strangely, then stop working altogether.

Status: Ongoing investigations by interested reporter :-).
Severity: I can no longer rely on my laptop hotkeys, which I use regularly.

Steps to reproduce:
1. Boot Eee-PC
2. Run acpi_listen to observe the keycodes
3. Start pressing hotkeys

Observations:
1. The keycodes do not appear immediately - sometimes there is a small delay, no more than a second.  Presumably they're being polled at regular intervals.

2. If I don't wait for the keycode to appear before pressing a key again, the keycodes get out of sync:

    I press key 12 twice in quick succession, and it only appears once.  Then I press key 30, but keycode 12 appears again.  Then I go back and press key 12, and keycode 30 appears.  And so on.

3. If I cause the keycodes to get too much out of sync, by repeatedly pressing them too quickly, they stop working altogether.  This takes about 16 repetitions.

Probable cause:
- I don't believe the switch from eeepc-acpi to eeepc-latop is significant.
- I believe it is triggered by something that shows up in dmesg as:
"ACPI: EC: GPE storm detected, disabling EC GPE"
- Therefore the hotkeys are now being polled instead of interrupt driven
- I believe this reveals a bug where hotkeys are buffered (in hardware?).  It seems that when two hotkey events arrive between polling intervals, only one event is removed from the buffer.  When the buffer overflows, it breaks.
Comment 1 Alan Jenkins 2008-07-15 07:15:33 UTC
This is related to #1752, which is also caused by "ACPI: EC: GPE storm detected, disabling EC GPE"
Comment 2 Alan Jenkins 2008-07-15 07:21:38 UTC
Created attachment 16821 [details]
dmesg showing GPE storm message
Comment 3 Alan Jenkins 2008-07-15 07:32:14 UTC
(In reply to comment #1)
> This is related to #1752, which is also caused by "ACPI: EC: GPE storm
> detected, disabling EC GPE"

Sorry, that's not right.  I confused two different bugs that looked a bit like
mine.  I think this is really a new bug.
Comment 4 Alan Jenkins 2008-07-15 11:35:07 UTC
Created attachment 16825 [details]
Workaround - ugly but functional

I tested the attached patch and it fixes the bug.

Explanation: the buffer in question was indeed a hardware (EC) buffer.  Unfortunately it looks like my EC is clears the SMI_EVT bit after every query, even if there are more events waiting in the buffer.  So the workaround is to repeat the query until it fails, instead of stopping when the SMI_EVT bit becomes zero.

I'm now writing a more tasteful patch with more lines and less goto statements.
Comment 5 Andrew Morton 2008-07-15 12:32:34 UTC
> Latest working kernel version: 2.6.26 (with in-tree eeepc-laptop module)
> Earliest failing kernel version: 2.6.25.3 - with out-of-tree eee-acpi module

I assume these are reversed.

I marked it as a regression.
Comment 6 Alan Jenkins 2008-07-15 15:27:47 UTC
(In reply to comment #5)
> > Latest working kernel version: 2.6.26 (with in-tree eeepc-laptop module)
> > Earliest failing kernel version: 2.6.25.3 - with out-of-tree eee-acpi
> module
> 
> I assume these are reversed.
> 
> I marked it as a regression.

Correct.  Thanks.  I've now submitted a cleaned-up patch to Alexey Starikovskiy (ACPI EC maintainer).
Comment 7 Alan Jenkins 2008-07-17 09:46:15 UTC

*** This bug has been marked as a duplicate of bug 10919 ***
Comment 8 Alan Jenkins 2008-07-17 09:47:49 UTC
It turns out the fix for bug 10919 also resolves this issue.
Comment 9 Alan Jenkins 2008-07-19 05:07:00 UTC
Created attachment 16886 [details]
Cleaned up fixes (squashed into one patch)

This is the latest version of my GPE fixes as submitted to the mailing lists, rolled into a single patch.  I'm trying to solicit testers to increase confidence in the last change, which removes GPE polling.
Comment 10 Jean-Michel Bruenn 2008-08-06 15:02:47 UTC
This patch is fixing another problem i explained in BUG 10724 (
System is unstable with EC GPE storm detected) but it brings up a new bug. I tested this patch on an Acer Aspire 5520G and i'm very happy now because i can use the 2.6.26.1 Kernel - Anyway: Booting up is still hanging, EXCEPT i press a keyboard key - like the power button (i need to press it several times, seems the box is only "working" if it gets a new interrupt or something like that) anyway - after the system booted the first parts of the kernel (i need to press such keys till it loads the filesystem drivers) everything is working fine.

without this patch the system is hanging at the boot and i'm not able to use this kernel. 
Comment 11 Alan Jenkins 2008-08-07 02:39:50 UTC
Thanks for your report Jean.  I should update you that the patch is in limbo at the moment (see below).

The good news is Alexey did start work on a new GPE fix which avoids GPE polling altogether.  GPE polling has several disadvantages, and it sounds like it may be breaking more hardware than it fixes.

The new fix is bug 10919.  Can you retest 2.6.26.1 after applying the following two patches together:
1. Alexey's new fix <http://bugzilla.kernel.org/attachment.cgi?id=16880&action=view>
2. My correction for something I think Alexey overlooked :-) <http://bugzilla.kernel.org/attachment.cgi?id=16948&action=view>

They work on my EeePC, I think they'll fix all your regressions, and I'm optimistic that they won't cause any more problems :-).
Comment 12 Jean-Michel Bruenn 2008-08-07 13:32:34 UTC
i can do, but i don't have an acer one or eeepc - so i don't know whether you're happy with my result - Anyway. As you may have read, i had some problems getting the 2.6.26* kernels booting on my box cause of a gpe storm detected (box is hanging at that line) with your patch i tried i can boot like explained, anyway - i tested that patch at 2.6.26.1 by manually editing the source .. ;) i'll test the  two new patches with 2.6.26.1 in a few hours.

thanks so far ;)
Comment 13 Jean-Michel Bruenn 2008-08-08 13:29:13 UTC
well, tested this patches with 2.6.26.2 - same results, the system is only booting if i press a key resulting in an acpi event (for example the power button - i have to press that every one or two seconds.. so maybe 20 or 30 times to have the system booted) otherwise the system is doing nothing. Thats only for booting up - when booted everything is working as expected.
Comment 14 Jean-Michel Bruenn 2008-08-08 15:12:40 UTC
hello again.

i tried now something (maybe stupid) i replaced the ec.c from 2.6.26.2 with the ec.c from 2.6.25.8 (the for me working release) - i have the same issues, like with using your patches - So i think i have two problems:

1.) the GPE Storm - if i don't touch the ec.c in the 2.6.26* tree i cannot boot - if i use your patch, a boot is possible. - This problem is ec.c related.
2.) the "button" problem - i need to use a key resulting in an acpi event (like for example the power button) - otherwise nothing happens, the kernel seems to wait for an acpi event/interrupt, without, nothing happens. This problem is _not_ ec.c related.

So, what to do now? Any information i could provide? (I'm using an acer aspire 5520 G Notebook, without battery plugged in, 32 bit kernel)
Comment 15 Alan Jenkins 2008-08-09 01:49:38 UTC
Trying an older ec.c was a completely valid idea; the file is nice and self contained.  That's a good solid data point you got us.

However I do think your two problems are related.  Perhaps this is a general GPE issue - ec.c is not the only user of GPEs.

This is going way off-topic so I think you should create a new bug.  Explain what you've done so far - you should be able to just copy&paste your existing comments and remove any redundancy.

My next step would be to try getting a calltrace at the point the kernel hangs.  This would help identify the problem, and if you do it for both problems we can see if they're the same.

There's a softlockup detector option under the Kernel Hacking menu, CONFIG_DETECT_SOFTLOCKUP.  IIRC that can take 61 seconds (in the hung state) before it triggers.

However you're more likely to get results using the NMI watchdog. Documentation/nmi-watchdog.txt explains how to make sure it's configured in and enable it using a boot option.  Apparently it triggers after only 5 seconds of lockup.  It isn't enabled by default because it generates lots of interrupts which could reduce battery life.

Calltraces can be quite verbose.  Unless you're already using a high resolution framebuffer console, I recommend booting with vga=ask and finding the largest usable text video mode.  Otherwise you can miss important details off the top of the screen.

Secondly, it will have lots of hexadecimal numbers in it :-).  They're useless to me as a newbie.  All I'm interested in is the line saying which process it is, and the function names in the call trace.  But if you can capture the entire calltrace with a serial console, or a digital camera, then it's worth doing.
Comment 16 Alan Jenkins 2008-08-09 02:15:22 UTC
Ah.  There is already one general GPE fix in the pipeline.  The title "ACPICA: Fix for hang on GPE method invocation" sounds like it could well be relevant!

The above commit has been merged in 2.6.27-rc2.  Can you please test if this solves your problem?
Comment 17 ykzhao 2008-09-07 22:46:29 UTC
Created attachment 17667 [details]
Patch 1/4 : Don't issue the burst disable command if EC exits the burst mode
Comment 18 ykzhao 2008-09-07 22:48:40 UTC
Created attachment 17668 [details]
Patch 2/4: Clear the query_pending bit only after processing EC notification event
Comment 19 ykzhao 2008-09-07 22:50:26 UTC
Created attachment 17669 [details]
Patch 3/4: Simplify EC working flowchart and always enable EC GPE
Comment 20 ykzhao 2008-09-07 22:51:44 UTC
Created attachment 17670 [details]
patch 4/4: Add some udelay in EC GPE handler to avoid EC GPE interrupt storm
Comment 21 ykzhao 2008-09-07 22:54:35 UTC
Hi, Alan
    Do you have opportunity to try the attached four patches on the latest kernel(2.6.27-rc5) and see whether the system can work well?
    Thanks.
Comment 22 Alan Jenkins 2008-09-08 01:40:43 UTC
Yup, I still have it and will test these.

I would appreciate a little more explanation / references - I stopped keeping track of what was going on, after Alex's patch to not automatically disable interrupt mode.

I don't think my system is very interesting though.  Now I know I damaged my hardware and need to boot with noapic, the only thing we expect to go wrong is dropping events.  The code that caused events to be dropped (by only querying one event every 0.5 seconds) was pretty damn obvious.

I do have access to identical hardware, not my own, which is running 2.6.25.3 and does not appear to have been damaged.  I am not going to touch the undamaged system until I have verified the new code doesn't drop events and then audited it myself.  Even then I will wait until it hits mainline.

patch 1 - typo, I think "exists" is supposed to be "exits".

patch 2 - I'm not convinced.  I can see how this is appropriate on the hardware described.  But how is it supposed to work on correctly implemented hardware?  If there are two events pending at a time, the interrupt for the second will fire straight after the query transaction, before PENDING is cleared.  Could you make this safer by re-reading the status register and checking for SCI_EVT after you clear PENDING?

patch 3 - it looks like you remove all usage of WAIT_GPE but forget to remove the definition?  Also ec_*poll*?
Comment 23 Alan Jenkins 2008-09-08 02:27:52 UTC
Nope, doesn't work.

- No error messages.
- double-pressing _seemed_ to work (without dropping events), but I only did it once
- then holding down an auto-repeat hotkey generated ~7 events, before breaking.

After breakage, the system is otherwise useable, but ACPI hotkey events stop working (and presumably anything else that goes through the EC).

Please advise how I can help further.
Comment 24 Alan Jenkins 2008-09-08 03:09:27 UTC
Ah, I know why this happens...

I think it is mentioned above that my hardware has an interesting bug.  When more than one event is pending, and a query is executed, the SCI_EVT bit is cleared - even though only one of the several events has been de-queued.  This is why the queue fills up and then breaks.  It will happen whenever the queue contains more than one event.

This is why I was trying to merge a patch to query repeatedly until the query returns 0, regardless of whether the first query clears SCI_EVT or not.  Alex acknowledged this was desirable, but it has been dropped.  From my POV, I strongly recommend this change in the name of robustness.  The problem is that it has not been verified to work on _other_ buggy hardware (#9998).

Your patch switches my hardware from interrupt-driven transactions to polling-driven transactions.  Polling is slower, which allows a new event to arrive before the current one has been de-queued.  (The original problem with notification polling is just a more extreme version of this).
Comment 25 ykzhao 2008-09-08 18:32:48 UTC
Hi, Alan
   Thanks for the test and explanation.
   For the patch 1: what you said is right. It is typo. "Exists" should be "exits".
   For the patch 2: On some platforms before the notification event is processed, the SCI_EVT bit is always set. If it is cleared too early, maybe the same notification event will be processed twice. IMO this is not reasonable. But for some platforms(for example: your machine) before the notification event is processed, the SCI_EVT bit is already cleared. And when another notification event is processed again, the SCI_EVT is set again. But as the previous event is not processed, it will be ignored. Maybe your suggestion is right. Maybe OS should continue to query the EC notification until the query returns 0. But I am not sure whether other system will be affected by this.
   For the patch 3: Yes.The EC driver is changed from interrupt-driven to polling-driven. But it will be safe and stable for most system. In fact the only problem for your system is to query notification events repeatedly until the query returns 0.

   Thanks again.
Comment 26 Alan Jenkins 2008-09-09 02:29:19 UTC
(In reply to comment #25)
> Hi, Alan
>    Thanks for the test and explanation.
>    For the patch 1: what you said is right. It is typo. "Exists" should be
> "exits".
>    For the patch 2: On some platforms before the notification event is
> processed, the SCI_EVT bit is always set. If it is cleared too early, maybe
> the
> same notification event will be processed twice. IMO this is not reasonable.

Yes, I agreed that a fix seems appropriate for these platforms.

> But for some platforms(for example: your machine) before the notification
> event
> is processed, the SCI_EVT bit is already cleared. And when another
> notification
> event is processed again, the SCI_EVT is set again. But as the previous event
> is not processed, it will be ignored. Maybe your suggestion is right. Maybe
> OS
> should continue to query the EC notification until the query returns 0. But I
> am not sure whether other system will be affected by this.

That's not what I was saying here.

Consider an EC which complies completely with the ACPI spec.  It clears SCI_EVT immediately after a query (assuming no more events are queued). It doesn't wait for the event to be processed.  Now look what can happen with your patch:

A new event arrives while the OS is still processing the last one.  The EC sets SCI_EVT and fires a notification interrupt.  Since the PENDING bit is set, we ignore this interrupt.

The OS finishes processing the event.  It clears the PENDING bit.  However SCI_EVT is still set.  The new event is still pending.  It will not be processed until the next interrupt - whenever that is.

So I don't see how this patch is safe *on a 100% compliant platform*.  It could lead to events being delayed for an indefinite period.  Which is a bug.

My proposed solution was that after you finish event processing and clear the PENDING bit, you should then test the status register for SCI_EVT.  Obviously if SCI_EVT is set you should then make sure the new event gets queried and processed.

>    For the patch 3: Yes.The EC driver is changed from interrupt-driven to
> polling-driven. But it will be safe and stable for most system. In fact the
> only problem for your system is to query notification events repeatedly until
> the query returns 0.

I understand that.  I was just wondering that the patch seems to leave some dead code.  E.g. it looks like the definition of EC_FLAGS_GPE_WAIT is preserved.  But EC_FLAGS_GPE_WAIT was only used for interrupt-driven transactions.  Since the patch removes interrupt driven transactions, the flag should be removed also.
Comment 27 ykzhao 2008-09-09 08:09:06 UTC
Thanks for the discussion.
 
>Consider an EC which complies completely with the ACPI spec.  It clears
>>SCI_EVT immediately after a query (assuming no more events are queued). 
    Can you point out in which section of ACPI spec 3.0b the above definition is found? As is pointed out in my patch, maybe the same EC notification event will be processed twice if the query_pending bit is cleared too early. If EC works as what you said, my patch is unreasonable. Maybe the EC notification event will be delay.
    Your suggestion will be appropriate for your system. But maybe some system will be affected by this. For example: On my laptop when issuing the query command, a non-zero query event is returned but it can't be processed.(There is no corresponding ACPI _Qxx object). At the same time the SCI_EVT bit won't be cleared. In such case OS can't exit the function of acpi_ec_query_handler, which causes that the acpid kernel thread can't work well. 

   For the patch 3 now it is for the test purpose. If it is appropriate for most laptops, I will refresh it and delete the useless source code. Thanks for your suggestion again.
   Thanks.

      
Comment 28 Alan Jenkins 2008-09-09 12:55:58 UTC
(In reply to comment #27)
> Thanks for the discussion.
> 
> >Consider an EC which complies completely with the ACPI spec.  It clears
> >SCI_EVT immediately after a query (assuming no more events are queued).
>     Can you point out in which section of ACPI spec 3.0b the above definition
> is found?

/me reads

Ah.  No, I now see this is left unspecified.

> As is pointed out in my patch, maybe the same EC notification event
> will be processed twice if the query_pending bit is cleared too early. If EC
> works as what you said, my patch is unreasonable. Maybe the EC notification
> event will be delay.

It does, I have debug output to prove it :-).  You can see the SCI_EVT flag being cleared as soon as the EC receives the query command.

#define ACPI_EC_FLAG_SCI	0x20

<excerpt from my system, attached @ http://bugzilla.kernel.org/attachment.cgi?id=16929&action=view>

[   15.131654] ACPI: EC: ~~~> interrupt
[   15.131660] ACPI: EC: ---> status = 0x28
[   15.131672] ACPI: EC: ---> status = 0x28
[   15.131677] ACPI: EC: transaction start
[   15.131682] ACPI: EC: <--- command = 0x84
[   15.133551] ACPI: EC: ~~~> interrupt
[   15.133560] ACPI: EC: ---> status = 0x09
[   15.133581] ACPI: EC: ~~~> interrupt
[   15.133587] ACPI: EC: ---> status = 0x09
[   15.133610] ACPI: EC: ~~~> interrupt
[   15.133616] ACPI: EC: ---> status = 0x09
[   15.133624] ACPI: EC: ---> status = 0x09
[   15.133631] ACPI: EC: ---> data = 0x5d
[   15.133637] ACPI: EC: transaction end

>     Your suggestion will be appropriate for your system. But maybe some
>     system
> will be affected by this. For example: On my laptop when issuing the query
> command, a non-zero query event is returned but it can't be processed.(There
> is
> no corresponding ACPI _Qxx object). At the same time the SCI_EVT bit won't be
> cleared. In such case OS can't exit the function of acpi_ec_query_handler,
> which causes that the acpid kernel thread can't work well. 

That's sound reasoning.  I think you will have to get very tricksy to fix both cases.  I guess it's not too hard to detect whether laptop clears SCI_EVT like mine or like yours, and act accordingly.  But as you put it, it gives us complex flowcharts.
 
>    For the patch 3 now it is for the test purpose. If it is appropriate for
> most laptops, I will refresh it and delete the useless source code. Thanks
> for
> your suggestion again.
>    Thanks.
Comment 29 ykzhao 2008-09-09 20:48:52 UTC
> Ah.  No, I now see this is left unspecified.
If it is unspecified, maybe both our understandings are correct. But neither of them can explain what happens on some laptops.
> > As is pointed out in my patch, maybe the same EC notification event
> > will be processed twice if the query_pending bit is cleared too early. If
> EC
> > works as what you said, my patch is unreasonable. Maybe the EC notification
> > event will be delay.
> 
> It does, I have debug output to prove it :-).  You can see the SCI_EVT flag
> being cleared as soon as the EC receives the query command.
> 
> #define ACPI_EC_FLAG_SCI        0x20
> 
> <excerpt from my system, attached @
> http://bugzilla.kernel.org/attachment.cgi?id=16929&action=view>
From the attached log it seems that the SCI_EVT bit will be cleared immediately on your machine after issuing the query command. 
   But I see the following cases on some laptops.
   a. Before issuing the query command, the SCI_EVT bit is already cleared.
   b. After processing the notification event, the SCI_EVT bit is still set.
   c. Only after processing the notification event, the SCI_EVT bit is cleared. It means that the SCI_EVT bit is still set after issuing the query command.

   It looks so confusing. And I don't know which solution will be better.