Bug 8757

Summary: ACPI Exception breaking Thinkpad T43, R51, since 2.6.22-rc (ec_intr related)
Product: ACPI Reporter: Brice Goglin (Brice.Goglin)
Component: ECAssignee: Alexey Starikovskiy (astarikovskiy)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: high CC: acpi-bugzilla, hmh, ml.alek, reg
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Possible solution for bug #8757
ACPICA: fix AML mutex re-entrancy

Description Brice Goglin 2007-07-14 16:48:06 UTC
Most recent kernel where this bug did not occur: 2.6.21
Distribution: Debian testing/unstable
Hardware Environment: Thinkpad T43
Problem Description:

Problem discussed a little bit during 2.6.22 release cycle in the thread starting at:
  http://marc.info/?l=linux-acpi&m=118176713206066&w=2
without getting any useful feedback. The whole dmesg and config is available there, I can repost them here if needed. I am reporting the bug here now since it still occurs with the final 2.6.22.

For several weeks, I experienced multiple times the following problem. Suddenly (I have no idea how to reproduce it), the internal keyboard and mouse become unresponsive and Gnome applets complain that they cannot read events ACPI anymore.

/var/log/messages shows the following errors:
  [ 8379.800000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8379.800000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8379.800000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8380.300000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8380.300000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8380.300000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8409.552000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8409.552000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8409.552000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8410.052000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8410.052000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8410.052000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8439.552000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8439.552000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8439.552000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8440.052000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8440.052000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8440.052000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8469.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8469.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8469.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8470.060000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8470.060000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8470.060000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8499.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8499.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8499.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8500.060000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8500.060000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8500.060000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8529.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8529.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8529.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8530.064000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8530.064000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8530.064000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8559.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8559.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8559.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8560.060000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8560.060000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8560.060000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8589.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8589.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8589.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8590.060000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8590.060000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8590.060000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8619.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8619.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8619.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8620.060000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8620.060000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8620.060000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
  [ 8649.560000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8649.560000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node dfeda9f0), AE_TIME
  [ 8649.560000] ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
  [ 8650.060000] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
  [ 8650.060000] ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node dfedab58), AE_TIME
  [ 8650.060000] ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]

2 seconds before that, I also get:
  tp-fancontrol[4355]: Shutting down, switching to automatic fan control
I don't whether it could be a cause of the problem or a consequence of the same bug (I'd say the latter).

Removing/reinserting as many ACPI modules as possible does not seem to help. Only reboot helps. But, apart from these input devices, the machine seems to work fine (logging by ssh shows everything ok apart from ACPI stuff). Plugging a USB mouse works fine, but the internal mouse/keyboard remains disabled until reboot.

Also, once I did 'sudo reboot' through ssh, the machine failed to re-enter the BIOS menu, and even the power button failed to shutdown for real (I had to unplug the battery and AC adapter).

The problem appeared first when I tried a 2.6.22-rcX (probably around rc3 since earliers rc did not work for unrelated reasons). It usually takes a couple hours/days of random work to reproduce the problem. Going back to 2.6.21 fixes the problem (did not reproduce in 2 weeks). Booting the final 2.6.22 tonight lead to the problem again within 5 hours.

Steps to reproduce:
Unknown, wait hours/days. It makes git bisect pretty hard to use...

Cheers,
Brice
Comment 1 Len Brown 2007-07-25 19:16:03 UTC
can you reproduce this without loading the thinkpad platform driver?
Comment 2 Brice Goglin 2007-07-25 23:48:58 UTC
I guess that's CONFIG_THINKPAD_ACPI=n and/or not loading thinkpad_acpi, right ? Yes, I'll try that in the near future.

For now, I am trying to reproduce with 2.6.23-rc1 (with THINKPAD_ACPI=m and  THINKPAD_ACPI_INPUT_ENABLED=y). It's been working fine 3 days so it might be already fixed.

I'll report back within a week or so, it is supposed to be enough to reproduce the problem.

Thanks.
Comment 3 Henrique de Moraes Holschuh 2007-07-27 10:13:31 UTC
If it was fixed, I'd very much appreciate knowing WHAT fixed it... we can't have 2.6.22 broken for thinkpads, be it my fault (b0rkage in thinkpad-acpi) or not :-)

Please do check if 2.6.22 behaves when thinkpad-acpi is not loaded.

You can try to force the bug to show up by telling thinkpad-acpi to much around lots of stuff in the thinkpad (read/set brightness, volume, read thermal, do fan control, etc), and also check suspend-to-RAM or to-disk.

There is *one* thing I noticed in thinkpad-acpi, which is new (but I don't know if it is present in 2.6.22):  the thinkpad DSDT is causing ACPI to call the hotkey_notify method of thinkpad-acpi *VERY* early during resume.  This doesn't cause problems in 2.6.21 and a T43, but I don't know yet about 2.6.22.

Len, what are the do's and don'ts for ACPICA NOTIFY handler callbacks?  Might thinkpad-acpi doing something not welcome inside such a handler (it calls ACPI methods, for example) cause this bug to show up?
Comment 4 Brice Goglin 2007-07-27 11:04:28 UTC
Ok I now running 2.6.22 with thinkpad-acpi *not* loaded. I'll report what happens in the next days.

2.6.23-rc1 didn't expose the problem within 4-5 days, which seems to mean the bug is fixed there.
Comment 5 Brice Goglin 2007-08-01 14:31:11 UTC
4 days on 2.6.22 without loading thinkpad-acpi loaded, no problem. So the problem looks only 2.6.22 and thinkpad-acpi related.

I could revert some patches from 2.6.23-rc1's thinkpad-acpi to locate the fix. But given how long it takes to reproduce the problem or not, git-bisect is going to be very long :) So I'd appreciate a short list of patches/commits to try.
Comment 6 Henrique de Moraes Holschuh 2007-08-01 15:41:05 UTC
git log --pretty=oneline -- drivers/misc/thinkpad_acpi.c drivers/misc/thinkpad_acpi.h drivers/misc/Kconfig

will give you a very handy list of interesting commits.  But git bissect can be told about the files it should pay attention to, and in this case these are "drivers/misc/thinkpad_acpi.c" and "drivers/misc/thinkpad_acpi.h".

Anyway, if you want to keep 2.6.22 in good health and just rewind thinkpad-acpi bit by bit, please use this git tree:

git://repo.or.cz/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git

In the master branch, you will find a release/ directory.  There you will find (unless I screwed up), all patches that went in mainline for thinkpad-acpi up to 2.6.23-rc1.  That makes it easy to revert them one-by-one, using patch -R.

One particularly interesting starting point is:
2.6.22-rc1/0033-ACPI-thinkpad-acpi-register-with-the-device-model.patch

There is something else you could do, since you report that it is fixed in 2.6.23-rc1.  You could get the latest thinkpad-acpi tree for 2.6.22 and apply it.  If it fixes the problem, then we know it is a thinkpad-acpi isolated issue.  

You can find the patches at http://sourceforge.net/project/showfiles.php?group_id=117042&package_id=230205
Comment 7 Henrique de Moraes Holschuh 2007-08-01 15:43:22 UTC
Oh, that reminds me. In the ibm-acpi git tree I gave you the pointer to, you can just use the release/2.6.22 branch to get 2.6.22.1 with only the thinkpad-acpi patches, and use that for bissecting.
Comment 8 Henrique de Moraes Holschuh 2007-08-02 08:59:18 UTC
Well, I have a theory.  Two, actually.

Thinkpad-acpi, just like almost everything else sysfs and procfs, does not attempt to protect the system against a local DoS.  Yet, anyway.

But thinkpad-acpi *can*, as most of the ACPI system, DoS the embedded controller.   And userland can cause it to happen by doing extremely stupid things, such as while true ; do cat /sys/bus/platform/devices/thinkpad_acpi/* ; done.

And the thinkpad EC *has* built-in anti-DoS capabilities.  If you are unfortunately enough to trigger it, it will *stop talking to you* for a while (or until a reboot, or whatever. Nobody really did a strong reverse engineering of that).

Intel's powertop (google for it) might help you track down if something is bothering thinkpad-acpi too much.  It does *not* do any polling by itself yet, so if it shows up in powertop at all, it means userspace is pestering it, and maybe it might be pestering it too much, and thus pestering the EC.

There is also another possibility.  If *anything* accesses the ACPI EC interface out-of-sync, it can go up in flames.  Now, thinkpad-acpi does NOT do that by itself, it calls ACPICA for all EC access, so it should be always synchronized.  But if some sort of locking is not exactly right at early resume, it might well be triggering the bug.  thinkpad-acpi will cause AML method executions at early resume.

Anyway, I wait your reply about the sourceforge patch.
Comment 9 Brice Goglin 2007-08-03 10:11:13 UTC
I am actually not looking at the sourceforge patch. I found 21 commits involving thinkpad-acpi in 2.6.23-rc1's patch. I am bisecting/reverting them to find the one which fixes the problem.

The status of this bisecting is the following:

2.6.23-rc1 with 2.6.22's thinkpad-acpi
  => BUG (so the bug is a thinkpad-acpi isolated issue)
2.6.23-rc1 zith 2.6.22's thinkpad-acpi + patch
 b964b437601a0e7d09896d5d9a85c83643e94f41
 ACPI: thinkpad-acpi: add DMI-based modalias
  => ??
2.6.23-rc1 zith 2.6.22's thinkpad-acpi + patches up to
 94954cc60194796fb257802f6f65d79553c9a8ca
 ACPI: thinkpad-acpi: remove all uneeded initializers
  => ??
2.6.23-rc1 zith 2.6.22's thinkpad-acpi + patches up to
 ae92bd17ff703b3703562148c73b4d6833e6a326
 ACPI: thinkpad-acpi: enable more hotkeys
  => ??
2.6.23-rc1 zith 2.6.22's thinkpad-acpi + patches up to
 9b010de59cb6dcab7e167dd2a0fa5d3b31447fea
 ACPI: thinkpad-acpi: export hotkey maximum masks
  => ??
2.6.23-rc1 zith 2.6.22's thinkpad-acpi + patches up to
 74941a69afcc06722685d492784414ec042ab492
 ACPI: thinkpad-acpi: export to sysfs the state of the radio slider switch
  => NO BUG within 10 hours

I should be done within the end of the week or so.

By the way, doing
  while true ; do cat /sys/bus/platform/devices/thinkpad_acpi/* ; done
does not seem to help reproducing the bug. Playing with sound volume/light/brightness/battery/ac power does not either.
Comment 10 Brice Goglin 2007-08-05 00:38:56 UTC
The bisect tends to prove that the fixer is the first commit regarding thinkpad-acpi in 2.6.23-rc1:

commit b964b437601a0e7d09896d5d9a85c83643e94f41
    ACPI: thinkpad-acpi: add DMI-based modalias

Is it possible? It seems strange to me since it just looks like something that would autoload thinkpad-acpi earlier during my boot. Or maybe the fact that the module is loaded very early prevents something to become wrong and mess up things later.

I am still running this git-bisect step in case the bug takes time to occur... If so, I'll go back in the bisect and wait longer at each step...
Comment 11 Henrique de Moraes Holschuh 2007-08-05 07:08:51 UTC
Please do.  If the thinkpad-acpi bissect is finding the very first 2.6.23-rc1 commit, it means the fix is actually somewhere else, and we would have to do the very painful full-kernel bissect between 2.6.22 and 2.6.23-rc1 :-(

It would also mean ACPI itself is broken, not thinkpad-acpi.  Which is a lot more worrisome.

I have found someone else which has this problem (AE_TIME, ACPI EC communications hung), but with very different behaviour, in a different ThinkPad model.  Let's see if I can get him to add more data to this bug report, it might help us track it down.
Comment 12 Henrique de Moraes Holschuh 2007-08-05 07:11:14 UTC
Never mind the 2.6.23-rc1 part of my comment, might be a major thinko.  I need to know exactly what tree you are bissecting...  which one?
Comment 13 Brice Goglin 2007-08-05 08:37:44 UTC
I am bisecting between 2.6.23-rc1 (seems to work) and 2.6.23-rc1 with thinkpad-acpi from 2.6.22 (reproduced the bug quickly).

So the later is 2.6.23-rc1 with the following commits reverted (21 commits touching thinkpad-acpi in between 2.6.22 and 2.6.23-rc1):
b964b437601a0e7d09896d5d9a85c83643e94f41
94954cc60194796fb257802f6f65d79553c9a8ca
ae92bd17ff703b3703562148c73b4d6833e6a326
9b010de59cb6dcab7e167dd2a0fa5d3b31447fea
74941a69afcc06722685d492784414ec042ab492
94b08713186cc47a5c367a866cc0a0a762721455
7f5d1cd6287b7b29d210f85e2343207ac4310da2
6a38abbf2b68e37493f2d5e8702b895a6c23ba0f
1a343760b516ca5466d201bec32b1794858b18a5
e295e8508c1dd56e06c73e78a2f67f2eb563e74f
5c29d58f471099401513e2e567f6c28001bb0f13
741553c2d29b4075d636a38792c05cd2fc62bd8a
996fba08db7faf46b1a674957f60cd772ecd29ec
86cc9445e86bef9da44f933e3849e6eb43cbf626
d5a2f2f1d68e2da538ac28540cddd9ccc733b001
24d3b77467b6aaf59e38dce4aa86d05541858195
edf0e0e56904f794c97ca6c4562d8256e3d8d8e3
a8fba3da3d11d808137be7ebeb3b6938a42f011f
3d6f99ca00ccf861305fd8630a21f2e696886708
c78d5c96bb65b71a54b7551b404fbaf4763ed6e4
f432255e936a892a6896e5032e2b4897423076f2

I am still doing some tests since I have no easy way to reproduce the problem and I can't be 100% sure that I didn't miss a bug by chance at any step of the bisection...

But, so far, I never reproduced the bug as long as at least the first one of the above commits was still applied/not-reverted:
b964b437601a0e7d09896d5d9a85c83643e94f41
  ACPI: thinkpad-acpi: add DMI-based modalias
It means that it works as long as I don't completely revert thinkpad-acpi back to 2.6.22 and that the above commits fixes the problem.
Comment 14 Tanguy Ortolo 2007-08-05 09:57:21 UTC
Hello, I think I am HMH's someone else. :-)
I experiment the almost same problem than Brice Goglin.

With 2.6.22 (vanilla + disk-protect patch), with or without thinkpad-acpi modules loaded, after a random amount of time, my ThinkPad R51 (2888 CTO)'s keyboard, touchpad and trackpoint freeze, and Gnome's sensors applet says it cannot read neiter ACPI nor ThinkPad-ACPI sensors anymore. If I am typing a letter when it happens, this letter is then repeated as if I was keeping it pressed.

In the kernel messages, I see nothing abnormal before the hangup. When it happens, and after, I just lines (arround one quartet of lines per minute) of :
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: write_cmd timeout, command = 128
ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
ACPI Exception (dswexec-0462): AE_TIME, While resolving operands for [OpcodeName unavailable] [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THM0._TMP] (Node df6d73d8), AE_TIME
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
ACPI Exception (dswexec-0462): AE_TIME, While resolving operands for [OpcodeName unavailable] [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THM0._TMP] (Node df6d73d8), AE_TIME
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
ACPI Exception (dswexec-0462): AE_TIME, While resolving operands for [OpcodeName unavailable] [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THM0._TMP] (Node df6d73d8), AE_TIME
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.AC__._PSR] (Node df6d2270), AE_TIME
ACPI Exception (ac-0095): AE_TIME, Error reading AC Adapter state [20070126]
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
ACPI Exception (dswexec-0462): AE_TIME, While resolving operands for [OpcodeName unavailable] [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THM0._TMP] (Node df6d73d8), AE_TIME
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: input buffer is not empty, aborting transaction
ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPC_.EC__.BAT0._BST] (Node df6d20c0), AE_TIME
ACPI Exception (battery-0206): AE_TIME, Evaluating _BST [20070126]
ACPI: EC: acpi_ec_wait timeout, status = 10, expect_event = 2
ACPI: EC: input buffer is not empty, aborting transaction

My BIOS : 1.29 1VET71WW
My EC : 1.04 1VHT28WW
Comment 15 Henrique de Moraes Holschuh 2007-08-05 11:12:54 UTC
Maybe I am going insane, but what happens if you boot with a known-affected kernel (say, 2.6.22.1) but with the ec_intr=0 kernel command line parameter?
Comment 16 Brice Goglin 2007-08-06 09:12:39 UTC
Ok, I will test ec_intr=0 later. For now, I want to be sure the bug does not occur when b964b437601a0e7d09896d5d9a85c83643e94f41 is applied (I am still talking about the same first thinkpad-acpi commit in 2.6.23-rc1).

I have reproduced the bug again with 2.6.23-rc1 + thinkpad-acpi from 2.6.22 (without the above commit then). It is still the only case where I reproduced the bug in my bisect. But given how long it took to reproduce it this time (15hours, while only 1 hour the first time), I reapplied b964b437601a0e7d09896d5d9a85c83643e94f41 on top of this and will run it for a couple days before being sure this commit is the fix.

Brice
Comment 17 Brice Goglin 2007-08-10 17:01:26 UTC
I didn't reproduce the bug with b964b437601a0e7d09896d5d9a85c83643e94f41 applied during several days. Since ithe bug happened twice without this patch in about 24 hours total, it really seems to be the fix.

Running 2.6.22 with ec_intr=0 for 36 hours already, the bug didn't occur so far. Keep testing.
Comment 18 Filip Wojciechowski 2007-08-12 18:50:16 UTC
I decided to comment on this bug, even though I can't say for sure that what I'm experiencing is the same thing, or related to the problem discussed here, but I think it is...  First of all, my hardware is Toshiba Satellite A30-213, an oldish P4 laptop with Phoenix (not Toshiba) bios.  I'm not using toshiba_acpi module (it is incompatible with my bios) - only standard ACPI modules.  The kernel is pretty much the default Arch Linux kernel but with Suspend2 patch added and with some minor configuration changes - the problem also occurs with the default arch linux 2.6.22.1 and 2.6.22.2 kernels.

My problem is certainly ACPI related and it also only occurs in 2.6.22 kernels (both 2.6.22.1 and 2.6.22.2) - 2.6.21 works without problems, and I don't remember experiencing anything like this with any previous kernel.  The symptoms are similar - ACPI fails hanging all ACPI related software (conky, suspend, etc.), when I try running "acpi -b" or "acpi -V", for example, the terminal will hang and has to be killed manually (ctrl-c doesn't work).  Doing something like "cat /proc/acpi/battery/BAT1/state" results in the same kind of lockup.  Reloading ACPI modules manually doesn't help.

[Another problem I've been experiencing with this 2.6.22 kernel has to do with clocksource - I have to run the kernel with "clocksource=pit" option, otherwise I get temporary soft lockups every now and then.  At first I thought that the problems were related, but now I don't think so - the ACPI problem appears even when using pit clock and in spite of the fact that the soft lockups don't happen anymore.  I thought I'd mention this for the sake of completeness though.]

I am able to reproduce the bug fairly easily, basically all I have to do is restart the computer instead of shutting it down completely and then powering it back on. The bug generally occurs within the first couple of minutes from reboot  - pretty much whenever the computer is not shutdown completely (although not always).  If it doesn't occur then it doesn't seem to occur at all until another reboot.

When the problem does occur, the following can be found in /var/log/messages.log:

----
Aug 13 03:21:00 toshiba ACPI Error (dswload-0774): [PBST] Namespace lookup failure, AE_ALREADY_EXISTS
Aug 13 03:21:00 toshiba ACPI Exception (psloop-0225): AE_ALREADY_EXISTS, During name lookup/catalog [20070126]
Aug 13 03:21:00 toshiba ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPCB.BAT1._BST] (Node c18a3b80), AE_ALREADY_EXISTS
Aug 13 03:21:00 toshiba ACPI: Marking method _BST as Serialized
Aug 13 03:21:00 toshiba ACPI Exception (battery-0206): AE_ALREADY_EXISTS, Evaluating _BST [20070126]
Aug 13 03:21:00 toshiba ACPI Error (psargs-0355): [PBST] Namespace lookup failure, AE_NOT_FOUND
Aug 13 03:21:00 toshiba ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PCI0.LPCB.BAT1._BST] (Node c18a3b80), AE_NOT_FOUND
Aug 13 03:21:00 toshiba ACPI Exception (battery-0206): AE_NOT_FOUND, Evaluating _BST [20070126]
----

Interestingly, adding "ec_intr=0" *seems* to fix the problem for me.  I added it to my kernel command line and I rebooted three times without the problem appearing.  I removed it from the kernel command line and, sure enough, the problem appeared on the second try.  I added it back and right now, after two more reboots, everything is working perfectly.  It's a bit late right now, but I will do more testing tomorrow and I'll report if I discover something interesting.

This is my first comment on the kernel bugzilla so please bear with my relative lack of experience.  I'd be happy to help out any way I can, although you might have to be a bit explicit at first when telling me what you need from me.
Comment 19 Tanguy Ortolo 2007-08-24 15:33:58 UTC
(In reply to comment #15)
> Maybe I am going insane, but what happens if you boot with a known-affected
> kernel (say, 2.6.22.1) but with the ec_intr=0 kernel command line parameter?

I have ran it for 5 hours without any problem. I will see tomorrow morning if it will have frozen. :-)
Comment 20 Tanguy Ortolo 2007-08-25 01:15:39 UTC
It seems to work well with ec_intr=0: my ThinkPad has ran 14 hours without problems.
Comment 21 Henrique de Moraes Holschuh 2007-08-25 12:22:06 UTC
This bug is apparently caused by something weird related to EC interrupts, either the EC register handling or the EC GPE handling.

It is not an thinkpad-acpi bug, although it appears thinkpad-acpi access patterns can trigger the bug.

Other notebooks and boxes that do heavy EC register access are probably also affected. Likely it depends on interaction with the EC firmware or ACPI BIOS.

Thus, I am reassigning it back to ACPICA-core.
Comment 22 Brice Goglin 2007-08-25 15:40:41 UTC
For the record, I have been running latest 2.6.23-rc kernels for a while now, and the bug never occurred again on my T43.
Comment 23 Henrique de Moraes Holschuh 2007-08-25 17:59:12 UTC
Yes, that's important information.  However, it would be *extremely* desireable to know exactly what fixed this issue, to get this fix into the stable 2.6.22.y branch.
Comment 24 Tanguy Ortolo 2007-08-26 05:57:19 UTC
Actually, even with ec_intr=0, I still get those freezes...
Comment 25 Henrique de Moraes Holschuh 2007-08-26 06:07:26 UTC
Pity, but I still maintain that the bissected commit can't be the culprit, so IMHO something very fishy is going on.

A lot of GPE fixes and other such stuff got in 2.6.23-rc, it is not unlikely that one of them is the real fix :(
Comment 26 Flávio Martins 2007-09-13 03:36:43 UTC
Created attachment 12814 [details]
Possible solution for bug #8757

When setting the _BST method serialized set the status as AE_OK.
This helps in my testing. The messages appear the first time, and then the _BST method gets serialized. No problems after that.
I tested this on an affected ACER/COMPAL laptop.
PS: I don't usually work with kernel/acpi code.
Comment 27 Flávio Martins 2007-09-13 06:37:34 UTC
Created attachment 12818 [details]
ACPICA: fix AML mutex re-entrancy

I've tested this one and it seems to have fixed things for me too.
git commit c0d127b56937c3e72c2b1819161d2f6718eee877
My hack above should be ignored then.
Comment 28 Flávio Martins 2007-09-13 09:24:12 UTC
Well, back to my humble hack. Disregard my previous comment.
There was not much testing involved.
Comment 29 Flávio Martins 2007-09-13 09:24:43 UTC
Comment on attachment 12818 [details]
ACPICA: fix AML mutex re-entrancy

Doesn't really help here.
Comment 30 Tanguy Ortolo 2007-10-16 09:54:27 UTC
I tried kernel 2.6.23, and I do not have that problem anymore.

Yet, it still appears with 2.6.23, when I apply an external disk head park patch. When I had this problem on 2.6.22, I was also using that patch, may it be the cause of this problem, or just a catalyst?
Comment 31 Brice Goglin 2007-10-16 10:01:28 UTC
> Yet, it still appears with 2.6.23, when I apply an external disk head park
> patch. When I had this problem on 2.6.22, I was also using that patch, may it
> be the cause of this problem, or just a catalyst?

It cannot be the cause. I observed the problem a lot on a totally unpatched 2.6.22 kernel on my T43 (and also on a R52 with Debian 2.6.22 kernel). I never reproduced the problem with 2.6.23 on any machine, but I still don't know how the problem got fixed.
Comment 32 Tanguy Ortolo 2007-10-17 14:34:28 UTC
Well, without that patch, hdapsd does not start, and I think the hdaps driver causes lots of interrupts, may it be linked?
Comment 33 Alexis Guillaume 2007-10-19 01:32:53 UTC
Since I stopped automatically loading the hdaps module at startup, even though I never used hdaps except from time to time to awe some friends, that awful seems to be gone. 
My thinkpad is a t42 and is running a 2.6.22-14 ubuntu kernel.
I hope this is not just plain luck....
Comment 34 Henrique de Moraes Holschuh 2007-10-19 05:40:02 UTC
Here is some data about HDAPS really work, that is ignored by the module in the Linux kernel.

The thinkpad embedded controller is a Renesas H8S/2140 series chip.  There are three communication channels to it, one which emulates the keyboard controller, one that is the standard ACPI EC interface, and a third one that is used by HDAPS, smart battery queries, and "application level" access to the EC (to be used by IBM's ThinkVantage suite, most probably).

The Renesas EC serves these channels as *interrupts*.  It is all in the Renesas documentation (which is public).  If IBM implemented the entire command handling of one of the channels in the interrupt handler, that means a lot of activity in that channel will interfere with the other activities of the EC.  It will "hog the CPU" (in this case, the EC), if you will.  If they implemented all three channels entirely inside the interrupt handlers... well, things will be even worse :-)

So, if removing HDAPS is making a difference for you, that means the EC is being overworked somehow by either ACPI or HDAPS modules (yes, it could be either).

Please try the out-of-tree driver tp_smapi's version of HDAPS (google for tp_smapi).  It is a damn great deal more intelligent on the way it talks to the Renesas EC, so it is a lot nicer to the EC if something weird is happening.

I have a T43, and 2.6.22 with tp_smapi's HDAPS in full use does not cause any trouble at all.  And the EC runs in interrupt mode just fine.
Comment 35 Toralf Förster 2007-11-05 11:35:46 UTC
At my T41 I experienced strange things with a 2.6.22-gentoo-r9-pppoe kernel and hdaps built as a modul.
The ACPI event for pressing <Fn>+<F3|12> didn't worked anymore, the <Fn>+<F4> event was mapped to "button/sleep SLPB 00000080" instead to "ibm/hotkey HKEY 00000080 00001003" and so on. Kicking off the hdaps module from the kernel config removed this strange things.
Comment 36 Len Brown 2008-01-08 23:53:26 UTC
Filip: (comment #18)
> Toshiba Satellite A30-213

Please open a new bug report if you can reproduce the failure
with 2.6.23 or later.

Brice: (comment #22)
> For the record, I have been running latest 2.6.23-rc
> kernels for a while now, and the bug never occurred again on my T43.

Tanguy: (comment #30)
> I tried kernel 2.6.23, and I do not have that problem anymore.

Alexis: (comment #33)
> Since I stopped automatically loading the hdaps module at startup...
> that awful seems to be gone. 
> My thinkpad is a t42 and is running a 2.6.22-14 ubuntu kernel.

Toralf (comment #35)
> 2.6.22-gentoo-r9-pppoe
> Kicking off the hdaps module from the kernel config
> removed this strange things.

Okay, 2.6.22 had some mystery problem, likely related to the EC.
2.6.22 ec_intr=1 seemed to help, but the problem was gone in 2.6.23.
2.6.22 users of the hdaps driver had trouble -- dunno if that is
still true in 2.6.23 kernels; but the original problem seems to be gone.

closed.