Most recent kernel where this bug did *NOT* occur: Distribution: Debian sid Hardware Environment: Acer Travelmate 4001 lmi, intel centrino 1,5 Ghz Software Environment: Problem Description: When the thermal module is loaded, I get plenty of problems: kernel bug, kernel oopses and also sometime kernel panic. Not loading the thermal module seems to solve the problem, but I am not totally sure that it's enough to avoid my computer to freeze. I have added an extract of kernel log. It contains around 2000 lines of acpi related errors ! Steps to reproduce: simply trying to use the computer when the thermal modules is loaded !
Created attachment 11307 [details] kernel log extract
Does this happen in some earlier release, e.g 2.6.21-rc3?
It has happened at least since 2.6.21-rc4. I have not tested earlier pre-releases of 2.6.21, but I had never managed to get an extract of a kernel log until now.
Could be I'm seeing the same thing on my Acer TM660 Laptop. I only get a partial trace, also in kacpid. The bug triggers when compiling, thus having the CPU running at max speed/load. Seems to be when the fan starts running to cool CPU down. Until now I have not yet tried with release, but just with -rc7. The same does not happen with 2.6.20.1. I have ACPI built-in (thermal, with sensor available and working with 2.6.20 and older releases): # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set # CONFIG_ACPI_PROCFS is not set CONFIG_ACPI_AC=y CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_VIDEO=m CONFIG_ACPI_FAN=y # CONFIG_ACPI_DOCK is not set CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_THERMAL=y # CONFIG_ACPI_ASUS is not set # CONFIG_ACPI_IBM is not set # CONFIG_ACPI_TOSHIBA is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y # CONFIG_ACPI_CONTAINER is not set # CONFIG_ACPI_SBS is not set Oops: 0000 [#1] Modules linked in: i915 drm cpufreq_conservative squashfs zlib_inflate loop nfs lockd nfs_acl sunrpc snd_pcm_oss snd_mixer_oss xfs usb_storage acerhk b44 nsc_ircc irda crc_ccitt sr_mod cdrom ehci_hcd snd_intel8x0m uhci_hcd usbcore snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd snd_page_alloc i2c_i801 sg evdev pcspkr CPU: 0 EIP: 0060:[<c0104aa1>] Not tainted VLI EFLAGS: 00010246 (2.6.21-rc7 #1) EIP is at dump_trace+0x64/0xb0 eax: 00000000 ebx: dde01fe0 ecx: c038a8a8 edx: c03482c1 esi: 0001e000 edi: c034938d ebp: dde0010c esp: dde000fc ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Process kacpid (pid: 28, ti=dddfe000 task=c1457a50 task.ti=dde00000) Stack: dc31c3c0 c034938d ffff755c dde001ec dde00120 c0104baa c038a8a8 c034938d dde00194 dde0012c c0104bd2 c034938d dde00138 c0104cf6 dde00144 dde00184 c02fef98 c034a130 c1457bdc ffff0005 0000001c 00000000 dde45f80 00000000 Call Trace:
It's indeed always happening when the CPU load is high. It also happens sometimes when hald is started during the boot process.
any difference if you boot with "ec_intr=0"?
The stack trace shows recursive notify handlers. Have seen these on HP nx6325 -- interesting to find them on Acer as well. Please attach the output from acpidump. Unclear, however, why this code is running with interrupts off. BUG: scheduling while atomic: kacpid/0xce0b2a80/24 [<c0104e3a>] show_trace_log_lvl+0x1a/0x30 [<c01054e2>] show_trace+0x12/0x20 [<c0105596>] dump_stack+0x16/0x20 [<c02c33ac>] __sched_text_start+0x43c/0x5c0 [<c02c3b98>] schedule_timeout+0x48/0xc0 [<c021d4ad>] acpi_ec_wait+0xf1/0x151 [<c021d628>] acpi_ec_transaction+0x11b/0x1c5 [<c021d7c1>] acpi_ec_write+0x30/0x32 [<c021d8c5>] acpi_ec_space_handler+0x9c/0x163 [<c0208bfa>] acpi_ev_address_space_dispatch+0x16c/0x1b9 [<c020cc00>] acpi_ex_access_region+0x203/0x217 [<c020cd28>] acpi_ex_field_datum_io+0x114/0x1a5 [<c020d0d9>] acpi_ex_write_with_update_rule+0x110/0x118 [<c020d252>] acpi_ex_insert_into_field+0x171/0x29f [<c020b940>] acpi_ex_write_data_to_field+0x20e/0x226 [<c020fd5c>] acpi_ex_store_object_to_node+0x70/0xa6 [<c020feff>] acpi_ex_store+0xe8/0x23d [<c020dc1d>] acpi_ex_opcode_1A_1T_1R+0x3c0/0x53e [<c020627f>] acpi_ds_exec_end_op+0xca/0x3db [<c0215286>] acpi_ps_parse_loop+0x56f/0x715 [<c02146fd>] acpi_ps_parse_aml+0x68/0x246 [<c021597e>] acpi_ps_execute_method+0x11f/0x1c1 [<c0212bb8>] acpi_ns_evaluate+0xa0/0x100 [<c0212810>] acpi_evaluate_object+0x120/0x1c0 [<c021f8d1>] acpi_power_on+0xc2/0x110 [<c021fc3a>] acpi_power_transition+0x78/0xf7 [<c021b949>] acpi_bus_set_power+0xe8/0x185 [<e15c31dc>] acpi_thermal_active+0x6a/0xe5 [thermal] [<e15c34e6>] acpi_thermal_check+0x28f/0x39b [thermal] [<e15c399f>] acpi_thermal_notify+0x39/0x61 [thermal] [<c0209758>] acpi_ev_queue_notify_request+0xd9/0xf4 [<c020f9c8>] acpi_ex_opcode_2A_0T_0R+0x68/0x98 [<c020627f>] acpi_ds_exec_end_op+0xca/0x3db [<c0215286>] acpi_ps_parse_loop+0x56f/0x715 [<c02146fd>] acpi_ps_parse_aml+0x68/0x246 [<c021597e>] acpi_ps_execute_method+0x11f/0x1c1 [<c0212bb8>] acpi_ns_evaluate+0xa0/0x100 [<c0212810>] acpi_evaluate_object+0x120/0x1c0 [<c021f8d1>] acpi_power_on+0xc2/0x110 [<c021fc3a>] acpi_power_transition+0x78/0xf7 [<c021b949>] acpi_bus_set_power+0xe8/0x185 [<e15c31dc>] acpi_thermal_active+0x6a/0xe5 [thermal] [<e15c34e6>] acpi_thermal_check+0x28f/0x39b [thermal] [<e15c399f>] acpi_thermal_notify+0x39/0x61 [thermal] [<c0209758>] acpi_ev_queue_notify_request+0xd9/0xf4 [<c020f9c8>] acpi_ex_opcode_2A_0T_0R+0x68/0x98 [<c020627f>] acpi_ds_exec_end_op+0xca/0x3db [<c0215286>] acpi_ps_parse_loop+0x56f/0x715 [<c02146fd>] acpi_ps_parse_aml+0x68/0x246 [<c021597e>] acpi_ps_execute_method+0x11f/0x1c1 [<c0212bb8>] acpi_ns_evaluate+0xa0/0x100 [<c0212810>] acpi_evaluate_object+0x120/0x1c0 [<c021f8d1>] acpi_power_on+0xc2/0x110 [<c021fc3a>] acpi_power_transition+0x78/0xf7 [<c021b949>] acpi_bus_set_power+0xe8/0x185 [<e15c31dc>] acpi_thermal_active+0x6a/0xe5 [thermal] [<e15c34e6>] acpi_thermal_check+0x28f/0x39b [thermal] [<e15c399f>] acpi_thermal_notify+0x39/0x61 [thermal] [<c0209758>] acpi_ev_queue_notify_request+0xd9/0xf4 [<c020f9c8>] acpi_ex_opcode_2A_0T_0R+0x68/0x98 [<c020627f>] acpi_ds_exec_end_op+0xca/0x3db [<c0215286>] acpi_ps_parse_loop+0x56f/0x715 [<c02146fd>] acpi_ps_parse_aml+0x68/0x246 [<c021597e>] acpi_ps_execute_method+0x11f/0x1c1 [<c0212bb8>] acpi_ns_evaluate+0xa0/0x100
Created attachment 11326 [details] Output of acpidump
Using ec_intr=0 doesn't solve the problem at all. I even get almost always a kernel panic during startup when hald starts. It shows plenty of error with "acpi_ns_evaluate" which seems similar to the ones listed in the log I have posted. The last message I see is "Bad EIP value".
Could you please post whole .config?
Created attachment 11327 [details] Kernel configuration file Here it is
Here is a part of the message I see when there is a kernel panic: EIP 0060:[<c012072d>] Tainted P VLI EIP is at run_timer_softirq+0x14d/0x160 ... EIP [<c012072d>] run_timer_softirq+0x14d/0x160 SS:ESP0068c14b81bc ... general protection fault 0000 EIP is at complete+0xa/0x40 Process kacpid ... Please let me know if you want more details. The kernel is tainted because I use the hsf drivers. However, I have already tested without with release candidates of 2.6.21 and it made no difference. I can test again without hsf drivers loaded if you want. Thanks for your help.
Your config mentions suspend2, could you please try without it?
Yet another problem. This is not a kernel panic but the computer is frozen anyway when it occurs: EIP 0060 [<c0104dca>] not tainted VLI EFLAGS 00010246 [2.6.21.1 #8] EIP is at dump_trace+0x6a/0xc0 process kacpid (pid:24, ti=c14b8000 task dfe1e070 task.ti=c14b8000 ... general protection fault: 0000 #2 EIP 0060 [<c01164fa>] not tainted EIP is at complete +0xa/0x40 process kacpid (pid:24, ti=c14b6000 task dfe1e070 task.ti=c14b8010
So, I have tried without suspend2. I almost tought the problem was gone but it is not the case. It takes more time to see the problem occuring but it occurs anyway.
ok, do you have any other off-tree patches applied?
I use vesafb-tng and fbsplash patches made by Spock. In fact, I have already tested release candidates of 2.6.21 without any extra patches the problem also occurs.
then let's see your config/dmesg from those vanilla kernels.
Created attachment 11328 [details] dmesg of kernel 2.6.21 (official vanilla sources)
Created attachment 11329 [details] kernel configuration 2.6.21, official vanilla sources
Created attachment 11330 [details] ACPI errors occuring before the kernel panic
I have thus recompiled a pure vanilla kernel and this time, the problem occurs maximum 2 minutes after the end of the boot process. It always ends up in a kernel panic. Shortly before that, I get the same errors as in the first kernel log extract (see new attachment). Then the kernel panic occurs like that: die+0xe/Ox1d0 do_page_fault +0x227/+0x610 error_code +0x74/+0x7c _wake_up +0x22/+0x30 _queue_work +0x26/+0x40 kblockd_schedule_work 0xf/0x20 blk_unplug_timeout +0xb/+0x10 run_timer_softirq +0xb/+0x10 _do_softirq +0x42/+0x90 do_softirq +0x2a/+0x30 irq_exit +0x2/+0x40 do_IRQ +0x45/+0x80 common_interrupt +0x23/+0x28 cpu_idle +0x39/+0x50 start_kernel EIP [<c14b7fe3>] 0xc147bfe3 SS:ESP 0068:c037d80 Kernel panic: not sincing: fatal exception in interrupt.
Intresting point, the following sequence makes crash not happen (immediately) anymore on just heavy CPU load and temperature reaches the treashold at which FAN starts: boot without thermal (built as module) modprobe thermal rmmod thermal modprobe thermal But later while running the system crashed: (gkrellm was in D state and system crashed during/at the end of sysreq+t I issued to determine why it was stuck in that state - I assume some procfs reading) Same partial trace as I usually get, but preceeded with the recursive notifications that Fran
Could you please try to set Stack debugging variables: # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_DEBUG_STACK_USAGE is not set # CONFIG_4KSTACKS is not set
Created attachment 11331 [details] Undo sync execution of Notify Please try to apply this patch and see if problem goes away...
Created attachment 11332 [details] Execute Notify on other thread If above patch helps, please try if things still work with this patch applied.
It seems to work much better with both patches you posted. I was able to compile a kernel or to make a kernel-header packages, what I had never been able to do before with kernel 2.6.21. Previously, I always got a kernel panic during one of those two operations. Also, I don't encounter anymore a kernel panic as early as 2 minutes after bootup. So I hope everything works fine now. But, could you explain what your two patches are doing ? Thanks for your help, Fran
Well, there is a long story in 5534. take a look if you are interested. Basically, Notify operator of AML interpreter needs to execute some arbitrary C-code, which may call AML interpreter again. latest HP notebooks were known to cause a deadlock if we schedule notify execution after the code that issues it. Thus the patch you just applied was invented (it executes notify on separate thread). At some point it was desided that having another thread is dangerous and executing notify inplace (thus having several AML interpreters on single stack) is less dangerous. Thus the patch you just reverted. In your case it seem to do stack overflow, which was predicted as one of the dangers of this patch.
So, if I have well understood,the way the notification of the execution of the AML code is suitable for some notebooks but not for others. So what do you plan to do now ? Can it become a kernel configuration option so that we can choose the way we want to execute it ?
If it works for you with both patches applied, then there is no problem, as these two patches just change behavior to the one that is already known to work for HP. Fill free to mark bug resolved (not yet closed) if you think that these two patches together solve your issue. Then it will work of Len Brown (Linux ACPI) maintainer to move these patches to mainline kernel and mark bug closed.
I have read the thread about bug # 5534 and I am afraid I have the same problem concerning temperature management. When I run intensive application (partimage, kernel compile for example), the fan makes a lot of noise but the computer doesn't seem to cool down. If I run cat /proc/acpi/thermal_zone/THRM/polling_frequency,i obtain "<polling disabled>" as output. Is it really normal ? Thanks for your help, Fran
if your fan spins, then you don't have thermal management problem. "polling disabled" is a default value, meaning that embedded controller calls us then temperature changes instead of us polling it over some interval. if you want to see something different, write value in seconds to this file.
So, I mark the bug as resolved.
Thanks for report and testing.
On question I am still asking: do I need to keep the kernel options you suggested (CONFIG_DEBUG_STACKOVERFLOW, CONFIG_DEBUG_STACK_USAGE or CONFIG_4KSTACKS) ? I guess DEBUG options are no more needed but is it preferable to use 4kb for the kernel stacks ?
Ok, use of both patches fix crashes for me as well. Fran
So I think I can mark this bug as verified. Thanks for your help.
It is worse than just simple stack overflow, it is close to an infinite loop. From examining the stack trace and the DSDT, it looks like this machine is falling into an "infinite" loop via the following sequence of events: A temperature EC event starts the whole thing going. Linux acpi_ec_gpe_query runs Invoke _Q81 (temp is falling) Notify (THRM, 0x80) Perform thermal_check Invoke active thermal state handler Attempt to turn off a fan Invoke _OFF method for fan Invoke THRM._SCP Notify (THRM, 0x81) The Notify (THRM, 0x81) causes a call to thermal_check (in the Linux thermal notify handler), and we end up in an infinite loop. (or at least we quickly spin through this thing enough times that a stack fault occurs before some event terminates it.) I think it's a bit early to close this bug.
downstream report https://bugs.gentoo.org/176615
The patches in comments #25 and comment #26 went upstream today, and should thus appear in the next upstream snapshot after 2.6.21-git13. closed.
Created attachment 11482 [details] Remove recursion from thermal notify Please test this patch against clean 2.6.21.
It works also well if I only use the last patch you send. I don't know if it's related to this bug but I still think there is a problem with fan and temperature managenement with kernel 2.6.21. If I use the computer normally, without running applications requiring a lot of CPU usage, the fan starts working very hard and becomes extremely noisy after a while, even if the temperature is not extremely high. In fact, the temperature reach 65
Thanks once again for testing.
What's the way forward here then? I note that the patch in comment #25 is noted to break a HP laptop (see header of commit 40d07080e585396dc58bc64befa1de0695318b3b). Now that an independent fix has been produced (comment #41), is that patch going to be re-applied to fix the HP laptop? I see that the patch in comment #41 isn't in the ACPI git tree yet, but it's only been a few days, I'm probably just being impatient :) From the perspective of fixing Gentoo's 2.6.21 kernel, which patches should we backport, from the choices: comment #25, comment #26, comment #41 We'd like to match upstream as best as possible
It's not just Acers, I have what appears to be the same problem on a Gateway 600YG2 laptop. The patches from comment #25 and comment #26 fixed it.
Alexey, any news on the patch? Has it been submitted to Len? Which ones should we consider backporting for distro kernels? Thanks in advance.
Daniel, yes, Len is aware about all the patches in this bug report. I don't know if/when he is going to push them upstream. I'd recommend patch from #41 for stable kernels, as much less intrusive. Regards, Alex.
re: patch in comment #41 to remove _TMP call from trip-point change notify. On the HP nx6325 before the patch, the fan turns completely off when the temperature drops below the lowest trip-point 40, and then turns on again when it rises above the modified lowest trip point 45. However, after this patch is applied on 2.6.22-rc3, the fan never turns completely off. Instead it continues running after the temperature drops below 45, and drives the temperature of TZ1 all the way down to 32.
Len, could you try to apply it to 2.6.21, i.e. to sync version of Notify? Also, is it possible to add some printk in thermal notify to see the order of notifies in nx6325?
Created attachment 11684 [details] Do not do acpi_thermal_check recursively/in parallel Len, Please check if this version works better
Len, I know you're a busy person, just a reminder: the above patch is awaiting your testing on your nx6325. I'm still interested in backporting these fixes to 2.6.21 but it seems it is not fully settled in 2.6.22-rc yet. Thanks.
Just a note, I have a Gateway 450ROG that is also effected with this bug in 2.6.21.5. I applied the patches in comment #25 and comment #26 and so far it appears very stable.
any chance you could try patch from #50?
Hello Alexey, I reversed the #25 and #26 patches on my 2.6.21.5 src and applied the patch from comment #50 . I've been running with this for a couple hours and the system appears just as stable as with the other patches. Will post back if anything bad happens.
A note also from my side. I was facing this problem on my old benq joybook 5000U running any default 2.6.21 fedora 7 kernels. After upgrading to the developmental kernels 2.6.22 the bug went away. Stable now for couple of days. Thanks, good job!
patch from comment #50 applied to acpi-test. I'll try it on my nx6325 when i get home.
in the name of bug #3686, the patch in comment #50 shipped in linux-2.6.24-rc1. The HP nx6325 fan works properly, including turning off completely when temperature drops to 40C. closed.