Bug 11884
Description
Alex Shumakovitch
2008-10-28 23:01:36 UTC
Created attachment 18487 [details]
dmesg output
Created attachment 18488 [details]
lcpci -v output
Created attachment 18489 [details]
acpidump output
Alex, Could you please check if 2.6.28-rc2 works any better? It looks like ACPI interpreter does not handle your DSDT correctly, leaving EC uninitialized. (In reply to comment #4) > Alex, > Could you please check if 2.6.28-rc2 works any better? It looks like ACPI > interpreter does not handle your DSDT correctly, leaving EC uninitialized. 2.6.28-rc2 freezes on boot after [ 1.295918] ACPI: Thermal Zone [DTSZ] (48 C) message, that is, right before EC initialization, and produces "BUG: soft lockup" errors pproximately every minute. This is exactly the behaviour that I've had with kernels < 2.6.27. By the way, kernel compilation resulted in Oct 29 11:53:29 helix kernel: [ 3272.768498] BUG: soft lockup - CPU#1 stuck for 61s! [gcc:3212] after approximately one hour. I don't know whether this is related to my ACPI troubles :-( --- Alex. I don't see anything obviously wrong with the DSDT. It loads here correctly, and the EC _REG method works without error. Of course, this is not running on the actual hardware, so the behavior could be different. Yes, this sounds familiar -- there was a bug #11418 sounding very similar... "hpet=disable" helped there... I've tried to boot 2.6.28-rc2 with "hpet=disable", but this didn't make a difference. In any case, the patch resolving bug 11418 is already incorporated into the kernel, so it doesn't seem to be related. On a side note, this laptop is listed as "SuSE Linux Enterprise Desktop 10 Certified" on the HP's web page. I've tried to boot it with OpenSuSE 11.0, but the live CD froze as well. I don't really know whether there is any difference in "enterprise" and "open" kernels and a bit reluctant to give all my personal information just to download the trial version of the Enterprise Desktop. Does it make sense to try? Thanks, --- Alex. probably it will be the same -- HP is known to be easy with linux certification on notebooks. did you try nohz=off and all highres=off too? You could enable kernel debugging of locks in kernel config -- it may give us a clue on what is wrong... nohz=off and highres=off make no difference for 2.6.28-rc2 Which options in the kernel should I enable to debug locks? I believe I have most of them, since there is plenty of output after the soft lock occurs. Unfortunately, I have no way to capture it during the boot, since the laptop lacks a serial port. With the 2.6.27.4 kernel, I now suspect that the problem is related to video, since all lockups during kernel compilation that I've seen (a couple already) occurred when the screen saver was trying to kick in. Does it do anything ACPI-related? I've just compiled the kernel two times in a row from the text VC without problems. Anyway, I'll attach the complete error message produced after a gcc soft lockup in a second. Thanks, --- Alex. Created attachment 18506 [details]
GCC soft lockup error messages while compiling the kernel
Please let me know if this is the kind of output that you
are looking for. I'm not good at kernel debugging, sorry.
I can confirm this problem as I have the same system regarding software and hardware. Alex, it's always possible to make a screenshot with any digital camera, there are plenty of them around these days... Michael, There are two problems here, probably related -- one is soft lockups, other is absent ACPI information. Which one bothers you? apparently both. However I didn't use the patch stated in your first post. I also tried out the Release Candidate of Ubuntu 8.10 which features the 2.6.27 kernel, there I get permanent soft lockups on both CPUs which leaves the system unbootable. This seems quite strange for me. One wild guess is that the this problem could be related to the architecture because for my Debian system I use amd64 when trying Ubuntu I used i386. Created attachment 18515 [details] try the custom DSDT It seems that this is an obvious BIOS bug. The PU1T is defined as the following: >Name (PU1T, Package (0x01) { Package (0x08) { 0x00, 0x0B, 0x1F, 0x29, 0x33, 0x3D, 0x51, 0x51 } But it is accessed by "DerefOf (Index (PU1T, 0x01)" in PSWT, which is called by PRIT. In such case OS reports the following error message and EC device can't be initialized correctly. >ACPI Error (dswstate-0097): Result stack is empty! State=f74a5e00 [20080609] >[ 0.248422] ACPI Exception (dsutils-0645): AE_AML_NO_RETURN_VALUE, Missing or null operand [20080609] >[ 0.248578] ACPI Exception (dsutils-0762): AE_AML_NO_RETURN_VALUE, While creating Arg 0 [20080609] }) Will you please try the custom DSDT and see whether the problem still exists? In the custom DSDT the error about PU1T is corrected. How to use the custom DSDT can be found in http://www.lesswatts.org/projects/acpi/faq.php thanks. Created attachment 18517 [details]
without additional parameters
Created attachment 18518 [details]
dmesg with parameter acpi_no_auto_ssdt
in both cases with and without acpi_no_auto_ssdt the folder /proc/acpi/battery/ is now available however the parse exception still occurs. Additionally with acpi_no_auto_ssdt the cpu frequency scaling stopped working. I tried with 2.6.27.2 kernel because I had it locally available. I will try it again later with 2.6.27.4. Created attachment 18523 [details]
dmesg on 2.6.27.4 kernel
now I tried out the custom DSDT with the 2.6.27.4 kernel however no improvement to 2.6.27.2
Created attachment 18524 [details]
My dmesg for 2.6.27.4 with patched DSDT
Not much difference for me either. And I don't even have /proc/acpi/battery
directory.
I'm now going to try to compile 2.6.28-rc2 with this patch to see whether
it helps there.
(In reply to comment #15) Actually, access to PU1T should be prohibited by CUZ0[] being set to all 0xFF in RETD, called from _SB.INI. So, the call to _SB.INI _must_ be first function to be called. If any other INI function is called before it -- it will fail miserably. We could try to pre-init CUZ0 with {0xFF,0xFF,0xFF,0xFF,0xFF,0xFF} instead of {} and see what happens... Created attachment 18525 [details]
Soft lockup during boot process with 2.6.28-rc2 (screen shot 1)
OK, I've tried 2.6.28-rc2 with the DSDT patch. No difference.
I'm going to attach a couple of screen shots of the BUG: soft lockup
messages. The problem, of course, is that the complete error
message doesn't fit into one screen, even when I try vga=0x305
boot option.
Created attachment 18526 [details]
Soft lockup during boot process with 2.6.28-rc2 (screen shot 2)
Created attachment 18527 [details]
Soft lockup during boot process with 2.6.28-rc2 (screen shot 3)
Created attachment 18528 [details]
Enable/disable GPE under spinlock
Thanks for screenshots. Do you have Kernel hacking -> Lock debugging: prove locking correctness turned on? If not, could you try to run with it?
Could you please check this patch ?
Also, it is possible to enable DEBUG mode of EC driver by uncommenting #define DEBUG at very beginning of drivers/acpi/ec.c
This patch is for 2.6.28-rc2, right? 3 chunks were rejected for 2.6.27.4. bugme-daemon@bugzilla.kernel.org wrote: > This patch is for 2.6.28-rc2, right? 3 chunks were rejected for > 2.6.27.4. right. EC driver is quite different in rc2. Created attachment 18531 [details]
Configuration file for the 2.6.28-rc2 kernel
OK, now the booting just stopped after the first "ACPI: Thermal Zone"
message with no output at all over the next 10 minutes. I'm attaching
my kernel config file for you to check that all right options are
selected. I will now try to compile 2.6.27.4 with "Lock debugging:
prove locking correctness" activated and DEBUG in ec.c uncommented.
Well, these changes (kernel option and DEBUG in ec.c) kill my 2.6.27.4 kernel as well. At the same place. Very strange. Created attachment 18544 [details]
Soft lockup during boot process with 2.6.28-rc2 and patch from #25 applied (screen shot 4)
I've now built 2.6.28-rc2 with patch from #25, but without additional
kernel debugging options. The screen shots are attached (sorry
for the quality --- the lighting was really bad this time)
What is interesting though is that now it appears to be only
one type of soft lockups, while in the past there were two
distinct ones with visually different error messages (compare
#23 and #24) that appeared one after another with irregular
intervals. That's the only difference that I could notice.
Created attachment 18545 [details]
Soft lockup during boot process with 2.6.28-rc2 and patch from #25 applied (screen shot 5)
Created attachment 18546 [details]
Soft lockup during boot process with 2.6.28-rc2 and patch from #25 applied (screen shot 6)
Time to sleep in this part of the world now ;-)
Just wanted to add: I can confirm this bug on my machine (also 2730p). Flashing the Bios to F04 (F02 is the original version) did not change anything. Maybe you can report the fixed DSDT (once it works) back to HP for a future bios update? Please look here: http://anholt.livejournal.com/40006.html Wow! This DOES solve all the problems that I was experiencing. I did play with BIOS options, but it never occured to me that this one might be at fault. Simply amazing. What will be the proper course of actions now? Is it possible to make a workaround in kernel around this option (it's actually quite a useful one), or should one simply turn it off and not to bother? Thanks a lot for the help. --- Alex. Alex which kernel and patch(es) did you use? For 2.6.27.4 and the custom DSDT the issue still exists. I first tried 2.6.28-rc2 and when everything worked (with lots of seemingly harmless debugging messages from ec.c) switched back to 2.6.27.4 with the custom DSDT and no other patches (except the intel-agp one, of course). I have the F04 version of BIOS though. Could this be the culprit? I've complied the kernel twice in a row to test the fan throttling (worked great) without any side effects and tested suspend/resume as well. The only thing that doesn't work so far is adjusting of the screen brightness (I swear it worked at the beginning, but I don't remember with which kernel) and built-in speakers and mics. I got it working with 2.6.28-rc2. I still have the old bios version. I think the intel-agp patch is doing the trick with 2.6.27.4. I didn't used it so far I am trying to collect all information about the 2730p at: http://www.linlap.com/wiki/HP+EliteBook+2730P Please add your knowledge. (In reply to comment #37) > I first tried 2.6.28-rc2 and when everything worked so the problem can not be reproduced in 2.6.28-rc2, right? > (with lots of > seemingly harmless debugging messages from ec.c) this is another problem, please attach the dmesg output > > The only thing that doesn't work so far is adjusting of the screen > brightness (I swear it worked at the beginning, but I don't remember > with which kernel) and built-in speakers and mics. > you can open another bug report. and attach the output of "grep . /sys/firmware/acpi/interrupts/*" both before and after pressing the power button. It seems that this issue is related with EC. From the acpidump we can know that there exists the error in evaluating the _REG object of EC device, which causes that the EC device can't be initialized correctly. >[ 0.781695] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.PCI0.LPCB.EC0_.ECRI] (Node ffff88013ba6b9d0), AE_AML_NO_RETURN_VALUE >[ 0.781975] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.PCI0.LPCB.EC0_._REG] (Node ffff88013ba6b9b0), AE_AML_NO_RETURN_VALUE As the _REG object can't executed successfully, the EC GPE handler will be uninstalled. But the ECRG flag is still set and the EC space handler isn't uninstalled , which causes that the EC internal register will be accessed in AML code.(ECRG flag is set in _REG object, which indicates that EC operation region is already accessible). >if (ACPI_FAILURE(status)) { > acpi_remove_gpe_handler(NULL, ec->gpe, &acpi_ec_gpe_handler); > return -ENODEV; > } In such case as the EC device is not initialized correctly, the following warning message will be complained. >[ 11.260018] ACPI: EC: acpi_ec_wait timeout, status = 0xff, event = "b1=0" >[ 11.260082] ACPI: EC: input buffer is not empty, aborting transaction >[ 11.260147] ACPI Exception (evregion-0419): AE_TIME, Returned by Handler for [EmbeddedControl] [20080609] >[ 11.260314] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.PCI0.LPCB.EC0_.BTIF] (Node ffff88013ba6db10), AE_TIME >[ 11.260616] ACPI Error (psparse-0530): Method parse/execution failed [\_SB_.BTIF] (Node ffff88013ba72f10), AE_TIME Thanks. EC is not properly initialized because of the buggy _REG. re-assign to EC category. Created attachment 19029 [details]
patch: Remove EC space handler explicitly when failing in _REG object
Will you please try the attach debug patch and see whether the following message still exists?
>ACPI Error (psparse-0530): Method parse/execution failed
[\_SB_.BAT0._BIF] (Node f7446ba8), AE_TIME
Please enable "CONFIG_ACPI_PROCFS_POWER" in kernel configuration.
Thanks.
Created attachment 19087 [details] debug patch to find out where PSWT fails >ACPI Error (dswstate-0097): Result stack is empty! State=f74a5e00 [20080609] >ACPI Exception (dsutils-0645): AE_AML_NO_RETURN_VALUE, Missing or null operand >>ACPI Exception (dsutils-0762): AE_AML_NO_RETURN_VALUE, While creating Arg 0 >ACPI >Error (psparse-0530): Method parse/execution failed [\_TZ_.PSWT] (Node >f743b978), AE_AML_NO_RETURN_VALUE Alex, let's try to find out where PSWT fails (I think it's the key point of this bug) Would you please help to test the attached debug patch? And then attach the dmesg with this patch applied. Thanks Yakui and Lin, I didn't have a chance to try your patches yet. Sorry. I will have a go at them tomorrow. Which kernel do I have to apply them to, by the way? 2.6.27.7? --- Alex. > I will have a go at them tomorrow. Which kernel do I have
> to apply them to, by the way? 2.6.27.7?
2.6.27.7 is OK
Created attachment 19118 [details] dmesg of the vanilla 2.6.27.7 kernel with the "fan while AC is on" BIOS option disabled As it turns out, 2.6.27.7 does _not_ boot with the "fan while AC is on" BIOS option enabled (soft lockup). Booting the kernel with this option disabled, doesn't produce any remarkable output. No additional debug messages with the patch from #44 were spotted. I'm sure that the code was compiled in because of this: sudo strings /proc/kcore | grep 1188 Bug 11884 debug begin Bug 11884 debug end Anyway, dmesg's for the vanilla and patched kernels are attached. I'll try to apply this patch to 2.6.27.4 tomorrow. This is the latest version that I know of that boots with that BIOS option enabled. Thanks, --- Alex. Created attachment 19119 [details]
dmesg of the 2.6.27.7 kernel with patch from #44 applied and with the "fan while AC is on" BIOS option disabled
Created attachment 19133 [details]
dmesg of the 2.6.27.4 kernel with patch from #44 applied and with the "fan while AC is on" BIOS option enabled
OK, here is the output for the 2.6.27.4 kernel. What next?
--- Alex.
Created attachment 19134 [details]
dmesg of the 2.6.27.4 kernel with patch from #44 applied and with the "fan while AC is on" BIOS option disabled
(In reply to comment #49) > Created an attachment (id=19133) [details] > dmesg of the 2.6.27.4 kernel with patch from #44 applied and with the "fan > while AC is on" BIOS option enabled > > OK, here is the output for the 2.6.27.4 kernel. What next? > > --- Alex. > Thanks Alex, Sorry for delay, I just come back from vacation. Method(PSWT) { ... //PSWT fails here Store (DerefOf (Index (CUZO, 0x00)), Local0) ... } Created attachment 19222 [details]
debug patch
Store (DerefOf (Index (CUZO, 0x00)), Local0)
DerefOf Op push an object
--> something wrongs here, the object is popped before Store Op is executed
Store Op pop the object
Alex, please apply this debug patch to see who pops the object
I believe what is happening here is that the CUZO Package object has not been initialized and the DerefOf is failing because of this. Looking at the DSDT, CUZO is initialized in the RETD method. This method is called from two places: _SB_._INI and HWAK. So it would appear that \_SB_.PCI0.LPCB.EC0_._REG is being called before _SB_._INI is called, and CUZO is uninitialized. A quick workaround would be to statically initialize CUZO: Name (CUZO, Package (0x06) {0xFF,0xFF,0xFF,0xFF,0xFF,0xFF}) The real fix will be to figure out why the Embedded Controller _REG method is being called before the _INI methods are run. (Also, a better error code in this case would be appropriate.) Here is my test code that reproduces the problem. If MAIN is run first, it will fail. If INI is run before MAIN, MAIN will not fail. DefinitionBlock ("", "DSDT", 1, "Intel", "Test", 1) { Name (CUZO, Package (0x06) {}) Method (MAIN, 0, NotSerialized) { Store (DerefOf (Index (CUZO, 0x00)), Local0) Return () } Method (INI) { Store (0x00, Local0) While (LLess (Local0, 0x06)) { Store (0xFF, Index (CUZO, Local0)) Increment (Local0) } } } (In reply to comment #53) > A quick workaround would be to statically initialize CUZO: > > Name (CUZO, Package (0x06) {0xFF,0xFF,0xFF,0xFF,0xFF,0xFF}) > Alex, would you please help to test this quick workaround? See http://www.lesswatts.org/projects/acpi/overridingDSDT.php for info about custom DSDT. Created attachment 19245 [details]
dmesg of the 2.6.27.4 kernel with patches from #44 and #52 applied and with the "fan while AC is on" BIOS option enabled
There was some debugging output with the BIOS option both enabled
and disabled (next attachement) this time.
Concerning the "quick workaround", where in my DSDT.dsl file do
I have to add this line? Sorry, I'm just afraid to mess things up.
--- Alex.
Created attachment 19246 [details]
dmesg of the 2.6.27.4 kernel with patches from #44 and #52 applied and with the "fan while AC is on" BIOS option disabled
(In reply to comment #55) > Concerning the "quick workaround", where in my DSDT.dsl file do > I have to add this line? Sorry, I'm just afraid to mess things up. Apply below patch to your DSDT.dsl --- orig.DSDT.dsl 2008-12-11 09:35:04.000000000 +0800 +++ DSDT.dsl 2008-12-11 09:35:32.000000000 +0800 @@ -875,7 +875,7 @@ DefinitionBlock ("DSDT.aml", "DSDT", 1, Name (OSTH, 0x00) Name (LARE, Package (0x06) {}) Name (LARP, Package (0x06) {}) - Name (CUZO, Package (0x06) {}) + Name (CUZO, Package (0x06) {0xFF,0xFF,0xFF,0xFF,0xFF,0xFF}) Mutex (THER, 0x00) Name (THSC, 0x3D) Name (THOS, 0x00) Created attachment 19257 [details]
dmesg of the 2.6.27.4 kernel with patches from #44 and #52 applied, custom DSDT from #57 and with the "fan while AC is on" BIOS option enabled
This "quick workaround" seems to be working!!! I managed to boot both 2.6.27.4 and 2.6.27.7 (dmesg in the next message) kernels without error messages.
So what should be the "right solution" then? Patch DSDT? Disable BIOS option?
Bug HP to fix their BIOS?
Thanks a lot!!
--- Alex.
Created attachment 19258 [details]
dmesg of the 2.6.27.7 kernel with patches from #44 and #52 applied, custom DSDT from #57 and with the "fan while AC is on" BIOS option enabled
(In reply to comment #58) > So what should be the "right solution" then? Patch DSDT? Disable BIOS option? > Bug HP to fix their BIOS? As Bob mentioned at #53, "The real fix will be to figure out why the Embedded Controller _REG method is being called before the _INI methods are run." Thanks for the test. void __init acpi_early_init(void) ..... status = acpi_ec_ecdt_probe(); /* Ignore result. Not having an ECDT is not fatal. */ status = acpi_initialize_objects(ACPI_FULL_INITIALIZATION); ..... } acpi_ec_ecdt_probe -> ec_install_handlers -> acpi_install_address_space_handler -> acpi_ev_execute_reg_methods -> _REG method is called acpi_initialize_objects -> acpi_ns_initialize_devices -> -> acpi_ns_init_one_device -> _INI method is called This is why _REG method is called before _INI method. Created attachment 19310 [details]
limit workarounds for ASUS
Please check if this patch works without modified DSDT.
Yes, this patch does solve the problem (at least, for 2.6.27.7). Amazing! Do I understand correctly that the bug in ec.c was that despite the comment "We really need to limit this workaround, the only ASUS, which needs it ...", the check was never performed? I wonder how this can be related to the fan being always on or off. Anyway, thanks a lot!! --- Alex. At the time this comment was written, I was trying to limit early EC registration by looking on presence of EC._INI field -- which is quite rare. Now your machine has it too, but does not want early registration -- thus we need to add one more check "ASUS only". Relation to fan -- if we fail to properly init EC driver and device, it may decide to keep fan in the state set by BIOS, or something safe... you never know what BIOS engineer thinks :) Created attachment 19835 [details]
patch vs 2.6.29-rc1
refreshed patch applied to acpi tree
patch in comment #65 shipped in linux-2.6.29-rc2 closed |