Most recent kernel where this bug did not occur: none, as far as I can tell Distribution: Debian (amd64 port) Hardware Environment: HP nx6125 (AMD Turion ML 34, ATI Radeon express 200M chipset, onboard ATI X300) Software Environment: Kernel 2.6.13.4 (with Matthew Garrett's double timer patch applied), Debian amd64 (testing/unstable), WM = KDE 3.4.2. Also tried 2.6.14 (boot with no_timer_check) with same results Problem Description: ACPI thermal events rarely get processed, especially under moderate to high CPU load. This results in *no* or erratic fan use and potential damage to the machine/electronics. However, if the CPU temperature exceeds a thermal trip point and then one issues a cat /proc/acpi/thermal_zone/TZ?/temperature or an acpi -t, then, after a brief machine pause, the thermal event is processed by the kernel and the fans respond. This can be observed by stopping acpid and doing a cat /proc/acpi/event, which gives the most graphic evidence. A further and more detailed desciption/diagnosis of the problem can be found here ==> http://lists.debian.org/debian-amd64/2005/10/msg01002.html Steps to reproduce: With a warm processor < 58 degrees C (less than first thermal trip point), run glxgears and wait about a minute or so. Your fan will 90% of the time not kick in. Then execute an acpi -t or a cat /proc/acpi/thermal_zone/TZ?/temperature and almost immediately you will observe that (i) at least one of your thermal trip points have been exceeded and (ii) as a response to the cat command, the fans immediately turn on. Visual evidence can be had by first, before you do anything, stopping acpid and doing a cat /proc/acpi/event (as root). Then do the above procedure. You will observe no thermal event register *until* you do the cat or acpi -t.
Distribution: Ubuntu Breezy (amd64 port) Hardware Environment: HP nx6125 (AMD Turion ML 34, ATI Radeon express 200M chipset, onboard ATI X300) Software Environment: Kernel 2.6.12-9-amd64-generic (without Matthew Garrett's double timer patch applied, i had it running Breezy Colony 4. On Colony 5 i can use my laptop without the patch), Breezy AMD64, WM = Gnome 2.12. I've the same Laptop. I've also the same problem with the CPU-Fan. It works properly if i don't run hard applications on my laptop. The fan starts to cool around 58
Small update: I have just downloaded and compiled kernel 2.6.14.1 and the bug is still present in that version.
cool failure -- good job isolating it
Please demsg, and acpidump output.
Created attachment 6533 [details] acpidump
Created attachment 6534 [details] dmesg (kernel 2.6.14.1)
Created attachment 6535 [details] dmidecode (hp nx6125)
Created attachment 6536 [details] lspci -v (hp nx6125)
Created attachment 6537 [details] cat /proc/interrupts (hp nx6125)
Please send the output from 'cat /proc/acpi/thermal_zone/TZ*/*' command and the '.config' file from your kernel.
Created attachment 6635 [details] .config (kernel 2.6.14.1)
Created attachment 6636 [details] cat /proc/acpi/thermal_zone/TZ1/* Note: I currently have "fan always on when A/C plugged in" set in the bios. This sets the first thermal trip point to 16 C. Without this bios setting the first thermal trip is set at 58 C and when the CPU temp reaches this value it drops to 50 C.
Created attachment 6637 [details] cat /proc/acpi/thermal_zone/TZ2/* See comments for cat /proc/acpi/thermal_zone/TZ1/*
Created attachment 6638 [details] cat /proc/acpi/thermal_zone/TZ3/* See comments for cat /proc/acpi/thermal_zone/TZ1/*
Created attachment 6644 [details] polling freq selection patch There is no asynchronous notification for thermal events on you system, so the system should use polling frequency to poll its state. It could be either '_TZP' value from DSDT or OS-provided value. If '_TZP' evaluates to zero then the polling is disabled. There was a bug which disabled polling if no '_TZP' were provided by DSDT. The attached patch should fix this issue.
Distribution: Gentoo Hardware: HP nx6125 (AMD Sempron) Kernel: 2.6.14 with Gentoo's patchset r2 Same problem here, but I don't have the 64-bit version of the laptop. Note that also the battery state doesn't update after the cpu temperature reaches the 58
Thanks Konstantin, the patch works for me! :)
Did the battery behavior also changed? Could you please attach the /var/log/messages, /var/log/boot.msg, output from 'cat /proc/acpi/ac_adapter/*/*', 'cat /proc/acpi/battery/*/*', dmesg and acpidump, available in pmtools here: http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ Also, could you explain the battery behavior more clear, i.e. when it's being updated, in what state it is, etc?
Konstantin, many thanks for the patch. I am compiling the kernel right now with the patch applied and will let you know what I find. However, I am surprised that you say that the DSDT provides no asynchronous method of notification of thermal events, because right now during the kernel compilation my second trip point (65 C) was exceeded and the second fan came on (or first fan sped up?). I was just sitting here watching the compilation and the fan suddenly blew harder. I quickly did an acpi -t to see what the temp was and it was 63 C -- I guess it had just come down from 65 C--the second trip point. So, as my earlier experiments seem to show... sometimes it works and sometimes it doesn't. (Right now I'm on A/C and I've set bios to have the fan on when on A/C. My first trip point is 16 C and my second is at 65 C. The 65 C trip was exceeded during the compilation and the fan turned up a notch. Here's the output of my current acpi -t Battery 1: charged, 94% Thermal 1: active[2], 63.0 degrees C Thermal 2: ok, 54.0 degrees C Thermal 3: ok, 22.0 degrees C ) I want to stress again: this occurred *without* the patch applied and using the unpatched 2.6.14.1 kernel (so, it seems that there are asynchronous notifications... sometimes). As I'm typing this... fan just turned down a notch without any intervention (no acpi -t to trigger it). The second trip point had been re-set to 59 C when the 65 C was exceeded. My current acpi -t: Battery 1: charged, 94% Thermal 1: active[3], 59.0 degrees C Thermal 2: ok, 53.0 degrees C Thermal 3: ok, 22.0 degrees C Yep, now I'm certain--there are asynchronous notifications...
Dear Konstantin, Results of patch test... (failure, unfortunately) Kernel 2.6.14.1 with polling patch applied. Unplugged A/C (on battery), so first thermal trip is at 58 C, 2nd at 65 C. I ran glxgears for a few minutes and the fans didn't come on. So I issued an acpi -t and here are the results: richm@dilbert:~$ acpi -t Battery 1: discharging, 87%, 02:02:57 remaining Thermal 1: ok, 68.0 degrees C Thermal 2: ok, 51.0 degrees C Thermal 3: ok, 28.0 degrees C Notice that the state of Thermal 1 is registered as ok even though *two* thermal trip points have been exceeded (58 and 65 C). Otherwise the influence of the patch can be seen now with $ cat /proc/acpi/thermal_zone/TZ?/* <setting not supported> cooling mode: active polling frequency: 5 seconds state: ok temperature: 53 C critical (S5): 95 C passive: 88 C: tc1=1 tc2=2 tsp=100 devices=0xffff810037fc8dc0 active[0]: 80 C: devices=0xffff810037f93f00 active[1]: 75 C: devices=0xffff810037f93dc0 active[2]: 65 C: devices=0xffff810037f93cc0 active[3]: 58 C: devices=0xffff810037f93bc0 <setting not supported> cooling mode: passive polling frequency: 5 seconds state: ok temperature: 51 C critical (S5): 100 C passive: 90 C: tc1=1 tc2=2 tsp=300 devices=0xffff810037fc8dc0 <setting not supported> cooling mode: passive polling frequency: 5 seconds state: ok temperature: 27 C critical (S5): 100 C passive: 60 C: tc1=1 tc2=2 tsp=300 devices=0xffff810037fc8dc0 The polling frequency is set to 5 seconds as it defaults to with the new patch. However, please see my earlier post that I wrote during the kernel compile (while on A/C). It seems that asynchronous notifications events were being generated and correctly interpreted.... (much to my surprise!) Richard
Further test: Just booted back into my unpatched 2.6.14.1 kernel. Ran glxgears and exceeded a trip point. Fans didn't come on--no suprise. Did an acpi -t to trigger fans into action and waited to see if the fans would turn off automatically when the temp dropped below the re-set trip point (50 C). After a while, and with no intervention (didn't do any acpi -t), fans stopped. After fans stopped, quickly did an acpi -t and the temp was 50 C. So, as I've suggested elsewhere, this behaviour is asymmetric. The fan is less likely to turn on when a trip point is exceeded, but more likely to turn off when the CPU temp drops below a trip point. With the patch applied, the fans only responded to trip points (turning fans on and off) when I issued an acpi -t, so the behaviour is worse in the case with the patch applied. Hope that this helps... Richard
Sorry for my early response, of course it doesn't work ;) Well, the battery state: It stops updating after the temperature of TZ1 reaches the first trip point, and updates again after a "cat /proc/acpi/thermal_zone/TZ1/temperature" I'm not sure about the point it stops updating (but I don't have a counter-example) /proc/acpi/ac_adapter/C173/state: state: off-line
Created attachment 6645 [details] acpidump
Created attachment 6646 [details] dmesg
Created attachment 6647 [details] /var/log/messages the whole day, the earlyer sessions were without the patch
COMMENTS ON BATTERY ISSUE: Carsten recently brought up the issue of the battery and I'm worried that this could sideline the more important issue of the thermal events (notifications). My interpretation of the battery issue is the following (I am no expert on ACPI, so I stand to be corrected): When the ACPI subsystem generates a thermal notification it appears not to be processed by the kernel. It appears (and I'm guessing here) that it "blocks" the ACPI subsystem (perhaps a queue?). When you thereafter issue a cat /proc/acpi/thermal_zone/TZ1/temperature then there is a perceptibile pause, like you're bringing the ACPI subsystem out of a "hung state". After this pause, the thermal event is processed (unblocking the event queue?) and the battery events are once again processed. The fact that, as Carsten says, it updates again after a cat /proc/acpi/thermal_zone/TZ1/temperature seems to strongly support this claim. Anyway, my naive two cents worth. I hope that the kind folk at Intel who are spending their time trying to fix this will take note. I believe that if we fix the thermal event notifications, the battery notifications will automatically work.
Yesterday (Germany) I noticed that my fans did't run, even if I read the temperature from thermal_zone/TZ1/temperature or did a acpi -t. The laptop was already running for about 5 hours. Strange... First everything works as expected, after a while you have to active the fans with acpi -t, now nothing works... :( Thermal 1: active[3], 61.0 degrees C The battery state did update. I think it's time to describe my wlan problem... Don't know if any of you had the same issue, but the wlan-chip stops working after 15 minutes without sending any packets, I made a cronjob that sends a ping every 10 minutes (strange solution, but it works ^^)... Without doing so, you can't send anything. iwconfig shows that the card is not associated, reassociating doesn't work. Now you can reload the ndiswrapper module a few times and nothing changes and you don't get any errors. After a while every time you load ndiswrapper you get ACPI: PCI interrupt for device 0000:02:02.0 disabled after a second. The module gets unloaded. Looks like some kind of powersaving ^^ Pressing the wlan-button while the everything is still working disables the card (without any messages, exept a button-event). To reactivate you have to press the button again and reload ndiswrapper, and it works as expected. Yesterday there was the same problem again, you couldn't reload the module, the card stopped working after the fans stopped working.
Something to add: the system didn't use 100% cpu after reading the temperature yesterday as the fans stopped working completly. And I'm booting with ec_burst=1 Aaah, just saw a new behavior: I let the fans running for a while, then I wanted to stop them with another acpi -t, but that told me that the cpu had 74
Carsten, It is really hard to work on bugs, if there is more than issue reported, so could you please open new bugs for issues you have and keep this bug pure ACPI thermal related?
There were some issues reported for booting with ec_burst=1 and using spinlocks. Here is the patch which replaces spinlocks with semaphores: https://sourceforge.net/project/showfiles.php?group_id=129330 One could try it to evaluate if he gets the same case. Note also that this patch is stated as temporary solution.
Created attachment 6677 [details] DSDT debug patch Here is the DSDT patch for the DSDT table, available here: http://bugzilla.kernel.org/attachment.cgi?id=6533 It adds debug prints from _ACx, _ALx, Notify (... , 0x80/0x81) methods for thermal zone TZ1. It would be interesting to observe system's behavior under different curcumstances: with polling turned on/off, with ec_burst=1/0, using spinlocks/semaphores.
Created attachment 6678 [details] polling freq selection patch (with debug) Another patch for the same purpose - it add event notification debug info to the thermal subsystem.
First results with the latest patch (attachment 6678 [details]) applied. (Machine on battery: trips at 58, 65, etc deg C) I ran tail -f /var/log/syslog. Then I started glxgears and watched. The appearance of thermal events in the syslog did not seem to correlate with fan usage. I don't like polling because it results in worse behaviour than the asynchronous mode, so I echoed 0 into each of /proc/acpi/thermal_zone/TZ?/polling_frequency, to disable polling and ran the test again. This time didn't observe thermal events until I did an acpi -t (by which time my CPU was frying -- 71 deg C!). Stopped glxgears and watched. Saw thermal event register in the logs and at the same time fan turned off. Anyway, here's the part of my syslog showing this state of affairs. --------------------------begin--------------------------------------------- Nov 26 22:27:24 localhost kernel: APIC error on CPU0: 00(40) Nov 26 22:27:52 localhost kernel: APIC error on CPU0: 40(40) Nov 26 22:28:20 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:28:20 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:29:28 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:29:28 localhost last message repeated 156 times Nov 26 22:29:28 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:29:52 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:29:52 localhost last message repeated 49 times Nov 26 22:29:52 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:30:42 localhost kernel: APIC error on CPU0: 40(40) Nov 26 22:31:45 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:31:45 localhost last message repeated 399 times Nov 26 22:31:45 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:32:49 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:32:49 localhost last message repeated 39 times Nov 26 22:32:49 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:33:11 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:33:11 localhost last message repeated 14 times Nov 26 22:33:11 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:36:28 localhost kernel: APIC error on CPU0: 40(40) Nov 26 22:36:32 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:36:32 localhost last message repeated 164 times Nov 26 22:36:32 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:37:05 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:37:05 localhost last message repeated 44 times Nov 26 22:37:05 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:37:51 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:37:51 localhost last message repeated 8 times Nov 26 22:37:51 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:39:03 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:39:03 localhost last message repeated 8 times Nov 26 22:39:50 localhost last message repeated 5 times Nov 26 22:39:50 localhost kernel: ------------------ Got thermal event 0x81 Nov 26 22:40:36 localhost kernel: ------------------ Got thermal event 0x80 Nov 26 22:40:36 localhost last message repeated 8 times Nov 26 22:40:36 localhost kernel: ------------------ Got thermal event 0x81 -------------------end----------------------------------------------------- The second from last event 0x80 corresponded to the fans turning off, then I got an 0x80 (x8). It seems like a lot of events are being generated without being processed. Can anybody understand this? -- Richard
Please try the spinlock patch from here: https://sourceforge.net/project/showfiles.php?group_id=129330
Tried the spinlock patch on a vanilla 2.6.14.1 kernel. I patched the kernel with patch -p1 < acpi-ec-nospinlock-2.6.14.diff, as explained in the README. I did not use *any other* patches. My boot options were: disable_timer_pin_1 and ec_burst=0. I was on battery, so that my trips were at 58C, 65C, etc. RESULT: Failure. Doing the standard test with glxgears permitted the CPU temp to rise above 58C without a hint of movement from the fans. Let the CPU cool and repeated the test. Failure again. Running acpi -t triggers fan response, as before. Now I think my machine needs a rest from this thermal overload ;-) Anyone else want to try some patches....? -- Richard
There are bios upgrades available at HP's website. The latest BIOS revision is F.09 (see http://h18007.www1.hp.com/support/files/hpcpqnk/us/locate/64_6170.html). I was wondering if anyone on the CC has tried this upgrade and whether it still has the problem. (Perhaps they could report...) To help make further progress, could I ask those reading this to please check their BIOS version (f10 on boot) and post it together with their CPU type (AMD 64/Sempron) and whether this thermal bug is present as follows (this is my data) BIOS: F.07, CPU: Turion ML-34, BUG: present This should help us determine if the bug is specific to a particular BIOS version. Thanks for your cooperation. -- Richard
I use F.09 bios version but the bug is still present
Sorry I forgotten: BIOS: F.09, CPU: Turion ML-34, BUG: present
BIOS: F.07, CPU: Turion ML-40, BUG: present
BIOS: F.07, CPU: Turion ML-34, BUG: present
This from the syslog could also be relevant: Nov 29 10:46:08 localhost kernel: warning: many lost ticks. Nov 29 10:46:08 localhost kernel: Your time source seems to be instable or some driver is hogging interupts Nov 29 10:46:08 localhost kernel: rip acpi_ec_read+0xc4/0xe5
Jfp, yes, I suspect that the timer problem is related to our thermal problem. This issue has now been fixed with a patch from andi kleen (see bug #3927) for kernel 2.6.15rc2. I haven't tried it yet, but I'm hoping that the patch will be incorporated into 2.6.15 (has anybody tried this patch?). -- Richard
Some additional information: I've compiled a good few kernels over the past few days and in each case my fan usage has been working on a regular basis (while on A/C). Let me qualify... I'm on A/C, so my trips are at 16 C and 65 C, etc. My fan is always blowing while on A/C (set in BIOS). During the long compilations my CPU temp inevitably reaches 65C, but then my fan always turns up a notch without any intervention (no acpi -t required). So, it seems that in this case the fans/thermal events are working... Hope that helps.
I use 2.6.14, bios is F.09. Sometimes I get no thermal events even with acpi -t but only unplugging and re-plugging ac connector... It seems this behavior is random... I'll do more tests...
Distribution: Slackware 10.2 System: AMD 64 Turion Mobile ML-28 F0.7 Funny behaviour on my box: when i set "Fan Always on while on AC Power" the fan , after booting Linux 2.6.13 turns off. Ill confirm the erratic behaviour - somethimes the events are processed , sometimes not. The AC Power problem is most concerning. Has anyone been able to get a system working 100%?
Last week i noticed, that the operation acpi -t that i scheduled doesn't work (it works but i has no influence on the cpufan). I try to explain you. I can not work without acpi -t with my laptop because of our fun problem. I noticed that the temperature of my laptop was stable by 78
Problem present here as well on ubuntu 5.10 (amd64), kernel 2.6.12-10. I do not have the BIOS set to have the fan always on, and without that setting, the fan does sometimes kick in by itself, and sometimes does not. Today I've tried with kernel 2.6.15-8 and it does not improve matters (rather, my fan does not run at all anymore, so maybe I'm having more problems now - nevertheless, I can confirm the observation that the fan does kick in by itself on low to moderate loads).
Created attachment 6823 [details] patch to avoid redundant thermal notifications The purpose of this patch is to avoid redundant notifications handling. It allows temperature notification handling not often then once per second.
A tempory fix would of course be to boot the kernel with acpi=off. The bios remains in control of the fan though it will remain on constantly.(This was suggested to me by a friend of mine) I have tried it and it works prefectly.
Hi, My nx6125 (amd64) runs on Gentoo 2005.1 with kernel-2.6.14. bootparams: disable_timer_pin_1 no_timer_check In BIOS I have selected fan always on when AC connected. Now I had used cron job to periodically run acpi -t but my fans didn't work at all (besides the one that is selected in BIOS). Then i turned off cron and after a while I noticed that my fan just turned on when I exceeded 65 degrees Celsius. I thought this was random so I restarted and started playing ET and once again my fans started - without need to run acpi -t. Though sometimes they don't turn off (but that's the smallest problem at least my laptop won't get fried).
bios updated to F.0D, nothing changes...
> My nx6125 (amd64) runs on Gentoo 2005.1 with kernel-2.6.14. bootparams: > disable_timer_pin_1 no_timer_check Hi, can you told me, what disable_timer_pin_1 on bootup does? Thanks.
Well I found disable_timer_pin_1 as a solution to double timer speed. I really don't know what's the difference between no_timer_check and that option, but I found somewhere an exemple where both were used simultanously so I appended exactly the same boot options.
Tested patch in #48 with kernel 2.6.15-rc6 using the glxgears test. First with polling enabled (uggh!), which is the default behaviour of the patch. Then I echoed 0 into /proc/acpi/thermal_zone/TZ[1-3]/polling_frequency to turn off polling and re-ran glxgears. In both tests the thermal trip point of 58C was exceeded without fans turning on. Here's an excerpt from my syslog... ============================================================================== Dec 25 21:07:54 localhost kernel: ------------------ Got thermal event 0x81 Dec 25 21:07:54 localhost kernel: ------------------ Got thermal event 0x81 Dec 25 21:08:13 localhost kernel: APIC error on CPU0: 00(40) Dec 25 21:08:49 localhost kernel: ------------------ Got thermal event 0x80 Dec 25 21:08:49 localhost kernel: ------------------ SKIP thermal event 0x80 Dec 25 21:08:49 localhost last message repeated 161 times Dec 25 21:08:49 localhost kernel: ------------------ Got thermal event 0x81 Dec 25 21:10:11 localhost kernel: ------------------ Got thermal event 0x80 Dec 25 21:10:11 localhost last message repeated 443 times Dec 25 21:10:11 localhost kernel: ------------------ Got thermal event 0x81 Dec 25 21:11:38 localhost kernel: APIC error on CPU0: 40(40) Dec 25 21:11:44 localhost kernel: ------------------ Got thermal event 0x80 Dec 25 21:12:10 localhost last message repeated 266 times Dec 25 21:12:10 localhost kernel: ------------------ Got thermal event 0x81 Dec 25 21:12:33 localhost kernel: ------------------ Got thermal event 0x80 Dec 25 21:12:33 localhost last message repeated 33 times Dec 25 21:12:33 localhost kernel: ------------------ Got thermal event 0x81 ============================================================================= The line that reads "Dec 25 21:08:49 localhost last message repeated 161 times" occurred when I issued my first acpi -t (polling enabled). The line that reads "Dec 25 21:12:10 localhost last message repeated 266 times" occurred when I issued an acpi -t during the second test (polling disabled). In both cases the trip point of 58C was exceeded without any response from the fans (until I issued an acpi -t). It seems that many thermal notifications aren't being skipped by the patch and are still being sent in "bursts".
Adding debug code to the kernel, it seems that acpi_ev_queue_notify_request never gets called for thermal events on this machine (using 2.6.15). The behaviour is identical with and without an enabled apic, so it's nothing to do with the timer problem. The obvious question now is, why are these thermal events not getting through?
although not able to reliably replicate the behaviour, I am now certain that I have observed the same behaviour on several occassions running Windows XP (!). The last time I had observed it was after a warm reboot; the fan ran during boot and never stopped afterwards even when exhaust air was getting really cold. Other information: I am using the rmclock utility provided by rightmark.org to do frequency changing in Windows, rather than the official AMD driver. I do not know if this has anything to do with it - as said, I haven't been able to replicate the behaviour and thus can't test swapping the AMD and the rightmark drivers. In Windows XP, I cannot probe the ACPI temps - using a utility called Speedfan (from almico.com), the ACPI temps never get updated unless a thermal trip point is passed. The SMBus temps do get updated, but I'm not sure these refer to the same thermal zones. The only utility that does provide ACPI temps is the Dashboard utility provided on the AMD website - this however installs its own low-level drivers; which you can see if you try to run the program after installing it but before the required rebooting. Hope it helps, please let me know if I can provide further information.
This bug can be worked around by patching the DSDT (at least for me). The two changes I have done: 1. Enabling thermal zone polling by adding a _TZP method for the main thermal zone. 2. In the _ON method of power resources associated with the fan (C25C to C25F), there is code that seems to check the host OS: Method (_ON, 0, NotSerialized) { If (LNot (LGreater (\C008 (), 0x03))) /* OS != WinXP */ { C256 (0x08, 0x32) } Else /* OS == WinXP */ { If (LGreater (DerefOf (Index (C252, 0x00)), C258 (C248, 0x00))) { C256 (0x08, 0x32) } } } For some reason (bug or feature?), Linux is identified as Windows XP. This causes another check involving the current temperature and a trip point to be made. Removing both checks and calling C256 directly seems to fix the problem of fan refusing to start blowing after a random amount of time (as reported in comment #27). What I still don't understand is why the temperature check did fail after some time, causing the fan not to restart. Could this be a synchronization issue like a race condition? This does not solve the problem of not-proceeded ACPI events, though. The patched DSDT for BIOS F.09: http://acpi.sourceforge.net/dsdt/view.php?id=525 And for BIOS F.0D: http://acpi.sourceforge.net/dsdt/view.php?id=561
>For some reason (bug or feature?), Linux is identified as Windows XP. Feature. This allows Linux to execute the AML code down what is often the *only* tested path.
Hi, NX6125, ML-40; upgraded my BIOS from F.09 to F.0D; formerly both with Gentoo AMD64 and openSuse 10.1beta2 x86_64 the fan would turn on and off properly when running a "watch cat acpi -t" in a console, and even turn on and off accidentially if not. After BIOS upgrade, the fan starts running dependent on CPU temperature during system startup and will never alter its speed, whatever I do. My kernel boot options are "noapic nolapic all-generic-ide". This leads to one recommendation and one question: I recommend not upgrading the BIOS to F.0D if not for testing purposes, and I wonder if anyone can provide me with a BIOS downgrade to F.09 or F.07, prefereably windows softpaq SP31482.exe. HP has deleted this older version from their website/ftp. Regards, Markus
Hi, Markus, I've also upgraded from F.07 to F.0D, but it is now working no better and no worse than the old version. I.e. under heavy CPU load, with "watch 'acpi -t'" running, there is the normal hysteresis of the fan speed between 58 and 65 degrees. But without the watch command, the system misses the trip point at 65. However, I did have a problem at first because I had forgotten to remove a patched version of the old DSDT which I had placed in my initrd for debugging - maybe that is the problem? Peter
Alas, I spoke too soon. Even with "watch acpi -t" running in a terminal, sometimes the fans won't turn on. My system is an NX6125 with Turion ML-40 running a self-compiled kernel-2.6.14-1.1653_1.rhfc4.cubbi_swsusp2.src.rpm (that is, Fedora 4 kernel with Software Suspend 2 patches). Also some debugging patch from this thread. I now get, in /var/log/messages, stuff like this: Feb 16 18:18:31 ceiriog su(pam_unix)[3200]: session opened for user root by (uid=1002) Feb 16 18:19:28 ceiriog kernel: ------------------ Got thermal event 0x80 Feb 16 18:19:28 ceiriog last message repeated 8 times Feb 16 18:19:28 ceiriog kernel: ------------------ Got thermal event 0x81 Feb 16 18:20:33 ceiriog kernel: acpi_power-0435 [77] power_transition : acpi_power_on failed (-8) Feb 16 18:20:33 ceiriog kernel: acpi_power-0459 [77] power_transition : Error transitioning device [C272] to D0 Feb 16 18:20:33 ceiriog kernel: acpi_bus-0266 [76] bus_set_power : Error transitioning device [C272] to D0 Feb 16 18:20:33 ceiriog kernel: acpi_thermal-0652 [75] thermal_active : Unable to turn cooling device [ffff8100016219f0] 'on' Feb 16 18:20:34 ceiriog kernel: ------------------ Got thermal event 0x80 Feb 16 18:20:34 ceiriog last message repeated 2 times Feb 16 18:20:34 ceiriog kernel: ------------------ Got thermal event 0x81 Feb 16 18:21:55 ceiriog kernel: ------------------ Got thermal event 0x80 Feb 16 18:21:55 ceiriog last message repeated 8 times the line with acpi_power_on failed (-8) was added by me in drivers/acpi/power.c so I could see exactly what error code was returned by the acpi_power_on call. If it means anything to anyone, please tell... Just at the moment, I can control the fan C273 by echo 0 > /proc/acpi/fan/C273/state, echo 3 > /proc/acpi/fan/C273/state. The former command turns the fan ON, the latter turns it OFF. echoing 1 or 2 to this file causes error messages like Feb 16 18:24:56 ceiriog kernel: acpi_bus-0216 [150] bus_set_power : Device does not support D1 So evidently the integers here represent the ACPI device levels D0 to D3, and the intermediate levels are not supported. However, the same trick with the other fan devices C270 - C272 now does nothing. I don't think this problem is new since I can see it in my backup /var/log/messages.{1,2,3}. And in those, the fan numbers are C260 - C263, which indicates they were from the original BIOS (F.07). Has anyone else seen this problem? I think it may require CONFIG_ACPI_DEBUG enabled in your kernel config to see this... I don't know whether this is a NEW bug or related to the missing of thermal events; if you like I will open a new bug for it. Regards, Peter
Sorry, you may also want to know my kernel command line options: they are ro root=LABEL=/ no_timer_check=0 rhgb quiet enforcing=0 noapic nolapic (rhgb is redhat-specific "graphical boot", enforcing=0 is for SELinux) Peter
Silly me: after upgrading my BIOS I forgot to re-install the patched DSDT: http://acpi.sourceforge.net/dsdt/view.php?id=561 As stated before in this thread, this is necessary for control of the fans. Otherwise the fan power resource _ON method does nothing (the code being conditional on some OS checks). Peter
Just compiled 2.6.16-rc5, which, incidentally, fixes the double timer problem on the hp nx6125 (see bug # 3927). More significantly, I noticed the following messages in the output of dmesg, which I cannot recall seeing with previous kernels, and I thought they may be relevant.... ACPI Error (evgpeblk-0284): Unknown GPE method type: C265 (name not of form _Lxx or _Exx) [20060127] ACPI Error (evgpeblk-0284): Unknown GPE method type: C266 (name not of form _Lxx or _Exx) [20060127] --Richard
OK, I've been tearing out my remaining hair and have spent the last couple of weekends trying to track down the problem, though I knew absolutely zilch about ACPI when I started. However, here is my speculation, based on reading the DSDT, available at http://acpi.sourceforge.net/dsdt/view.php?id=558, and the kernel source. I am hoping that some kernel expert will jump in and tell me if I'm right or wrong. Asynchronous notification of thermal events is handled by a level-triggered interrupt which causes execution of the ACPI control method _L19 (defined by AML code in the DSDT). This method enters a loop which polls the temperature sensor via the SMBus (I assume this is something similar to http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2408). If the status bits of the temperature sensor indicate a trip point has been exceeded a thermal Notify() event is generated. At the end of the loop, the control method relinquishes control via a Sleep() call for 100 microseconds before polling again. In this interval, one would hope that the OS would take control and process the outstanding Notify() events. HOWEVER, as far as I can tell from the kernel source, both the _L19 interrupt and the Notify handlers are run from a single workqueue in the thread known as "kacpid". This means the Notify() events do not get processed until the _L19 method has completed. They just pile up in the queue. Presumably reading the temperature (using cat /proc/acpi/thermal_zone/*/temperature or acpi -t) causes the queue to be flushed immediately. The question is: does the DSDT or the kernel behaviour better represent the ACPI spec? According to the ACPI spec: "When a control method does block, the operating software can initiate or continue the execution of a different control method. A control method can only assume that access to global objects is exclusive for any period the control method does not block.". I take this to mean that it would be acceptable to process the thermal Notify() events as they occur, and before the interrupt handler _L19 returns. This would presumably require a separate kernel thread. Does this make sense?
OK, I had another look at this this morning. I think I understand most of it. All of this refers to the nx6125 BIOS version F.0D and linux kernel 2.6.16-rc5. In addition to the GPE and Notify events, the thermal zone polling (_TZP) is also done from the same single threaded workqueue. This explains why enabling TZP does not solve the problem. Once the _L19 control method is entered it blocks processing of all other events. However, the userspace workaround "watch acpi -t" (or "watch cat /proc/acpi/thermal_zone/*/temperature") works, because it forces action immediately without queueing it on the kacpid workqueue. Reading the temperature sensor status bits has the effect of clearing them, so on the next loop _L19 returns, and then the Notify events get processed. I do not fully understand the quote from the ACPI spec in comment 65, nor this from the ACPI CA Programmer's Reference: "Because of the constraints of the ACPI specification, there is a major limitation on the concurrency that can be achieved within the AML interpreter portion of the subsystem. The specification states that at most one control method can be actually executing AML code at any given time. If a control method blocks (an event that can occur only under a few limited conditions), another method may begin execution. However, ???it can be said that the specification precludes the concurrent execution of control methods???. Therefore, the AML interpreter itself is essentially a single-threaded component of the ACPI subsystem. Serialization of both internal and external requests for execution of control methods is performed and managed by the front-end of the interpreter." However, the DSDT for the HP nx6125 seems to expect a cooperative multithreading model in which one control method can relinquish control using Sleep() and wait for another to complete. I am not a computer scientist, but a mathematical physicist and part time system administrator. I hope my analysis will be useful to someone, but I don't have the time or the skills to take this further. It would be helpful to have some comment from the people who wrote the ACPI spec (HP/Intel/MS/Phoenix/Toshiba) as to what level of concurrency is really expected from the OS. P.S. I notice that acpi_os_queue_for_execution() takes a "priority" argument which is unused. What is the intention of this? So, I am relinquishing control here and hoping someone else will run: Sleep(100000000....)
Looking at method _L19, the Sleep() operator should relinquish the processor and allow the Notify dispatch to occur. AcpiOsQueueForExecution should be using a different thread than the thread that executes the AML interpreter. The Notify dispatcher does not enter the AML interpreter mutex, so there should not be a deadlock. The actual handler for the Notify should be in a driver somewhere. I have found that the best method to determine exactly what is going on is to enable full debug tracing in the ACPICA subsystem (via acpi_dbg_level) and analyze the sequence of events.
Robert - are you suggesting that you'd like a full debug trace in order to examine this further? I can produce one of those without too much trouble, but it would be helpful to know how much debugging you'd like.
Richard, We have nx6125 now and do not see any problems, we use latest stable kernel 2.6.16 based on your config file. So, pease try 2.6.16 and post the results.
Well problem still remains, at least for me. This is how it can be verified Turn on computer with AC Run some stupid cpu-eater like int main() { while(1); } Plug out notebook from AC... and wait.. just few minutes earlier my CPU came to nearly 80*C degrees without turning any of the fans. I use vanilla-sources-2.6.16 with gentoo patches applied. I don't use any modified DSDT and my BIOS version is F.0D
Your test was repeated step by step, unfortunately fan works. Please attach your '.config' file; I want to reproduce exactly your situation.
Created attachment 7662 [details] kernel config (where fan still don't work) Here's my .config, just be 100% sure I will check vanilla-sources without any patches, and I will report here ASAP
I can still reproduce the problem. A description of my kernel can be found at http://www.ceiriog.eclipse.co.uk/2.6.16-prw7 My kernel configuration (.config) can be found at http://www.ceiriog.eclipse.co.uk/dot-config-2.6.16-prw7 Additional patches mentioned in the kernel description can be found at http://gaugusch.at/acpi-dsdt-initrd-patches/acpi-dsdt-initrd-v0.7e-2.6.14.patch http://www.suspend2.net/ and in the directory http://www.ceiriog.eclipse.co.uk/patches I find that full ACPI debugging gives me too much information, so I have prepared a small patch which will show you the flow of control http://www.ceiriog.eclipse.co.uk/patches/acpi_events.patch In order to reproduce the problem reliably you need to turn thermal zone polling OFF with echo 0 > /proc/acpi/thermal_zone/TZ1/polling_frequency. You also need to turn off any application which is reading the temperature /proc/acpi/thermal_zone/TZ1/temperature. You might also want to enable some limited ACPI debugging with echo 0x0f > /proc/acpi/debug_level. Then run a CPU-intensive application (glxgears will do it eventually, so will a kernel compilation). You may monitor the temperature using watch "cat /proc/acpi/thermal_zone/*/temperature", but kill this process just before the temperature reaches 65C. Wait a while until you are confident that the temperature has exceeded 65C. Then do "acpi -t" or "cat /proc/acpi/thermal_zone/*/temperature" and you should find (1) the system seems to hang for a fraction of a second (or several seconds if it is a long time since the trip point was exceeded) and (2) you get something in the system log like http://www.ceiriog.eclipse.co.uk/patches/messages Note how the thermal notify events Notify (\_TZ.TZ1, 0x80) are repeatedly queued and never processed until after "acpi -t" causes the evaluation of _TMP. Peter Wainwright
As for me problem still exists even on vanilla kernel.
Created attachment 7663 [details] Peter Wainwright's kernel configuration
Created attachment 7664 [details] Peter Wainwright's ACPI event tracking patch
Created attachment 7665 [details] Peter Wainwright's DSDT patch I have uploaded my config and my event tracking patch to bugzilla for reference. I forgot to mention that the /var/log/messages excerpt was created with a patched DSDT: I have also uploaded that patch (it is similar to the one in comment #31 but the symbols in my BIOS are numbered differently). This gives the ACPI Debug lines so you can see the relationship between the ACPI AML code and the kernel stuff. Vladimir, what BIOS version are you using? I am using F.0D, which is the latest I could find on the HP website (and at http://acpi.sourceforge.net/dsdt/view.php?id=558). If you have purchased your box very recently maybe it has a newer BIOS? Peter Wainwright
Created attachment 7666 [details] Peter Wainwright's DSDT patch Corrected upload
In response to #69: Tested vanilla kernel 2.6.16 with glxgears while on battery. The CPU temperature rose to 64 degrees without any response from fans, i.e., the problem still persists. BIOS: F.07 CPU: Turion ML-34 DISTRO: Debian amd64 Vladimir: I've included the distribution information above because, for example, SuSE seems to implement their own polling (at intervals of 2s) by default. As pointed out a bit earlier, you need to disable polling if enabled. However, my experiments with polling (under Debian) have all failed to improve matters. -Richard
Created attachment 7675 [details] config (2.6.16) My 2.6.16 kernel .config
There has been no significant movement on this bug for months. I can only conclude that the problem is only triggered by a few very sophisticated DSDTs (the HP nx6125 among them) or has remained unrecognized on other platforms. Nonetheless it seems to me that Linux does not correctly implement the ACPI spec. Therefore, I propose a solution with the patch attached. Interpretation of control methods called in response to GPE events or Notify events is confined to one single-threaded workqueue. It is true that the AML interpreter is essentially single-threaded, because it is protected by a mutex and therefore only one kernel thread can be executing AML code at one time. However, this does NOT mean that the execution of different control methods should not overlap. The ACPI spec allows for the transfer of control between one method and another when the AML calls Sleep, Acquire, Wait etc. (see the ACPI-CA reference). The way Sleep() is implemented in Linux, it calls schedule() and transfers control to other kernel threads: but any other control method which is queued in the kacpid thread itself will not be able to run until the currently executing control method is finished. On the HP nx6125 laptop this is essential, otherwise the ACPI subsystem will block, thermal events will not be processed, and the system will overheat. http://bugzilla.kernel.org/show_bug.cgi?id=5534 So, if any of you have an nx6125, or an ACPI bug which you think may be caused by the mechanism I have described, please try this patch. This is my first attempt at a serious kernel hack, so please forgive the state of it: this is work in progress. At least it should show one approach to the problem. I'm sure it has loads of problems with respect to locking, SMP, etc. which you will point out. In order to enable the new behaviour you need to write a positive integer (e.g. 10) to /proc/acpi/poolsize. Instead of executing all the GPE and Notify handlers in the single kacpid thread, we create a pool of worker threads and hand over the work to them. These can now execute concurrently, though access to the interpreter is still serialized by the use of a mutex. The thread pool is allowed to grow dynamically up to the maximum size which is set by the user by writing an integer to /proc/acpi/poolsize. If this integer is 0 the thread pool is not used at all and the old behaviour is used. You can also read /proc/acpi/poolsize to see the maximum pool size and the currently allocated threads. There is a field "jiffies" in the thread pool entry structure which is written when a thread finishes execution of a control method. My intention is that in future this will be used to reap unused threads. Of course, the user-configurable pool size may not be necessary. We might hard-code it. Or even allow the AML to create as many threads as necessary (assuming we trust the BIOS). Peter Wainwright (P.S. not the Apache/Perl expert). (copied to linux-acpi list).
Created attachment 7812 [details] Patch for multi-threaded execution of control methods This patch is applied to vanilla kernel 2.6.16
Reply-To: dagarlas@gmail.com I recompiled 2.6.16 and it seems ok... (with nx6125, bios F.0D) however i'll test better in this weekend... now I tried running glxgears... the fan sarts and stops correctly
I'd better make clear that you use my patch at your own risk - it could hang your system, or at least make it unresponsive. It probably won't eat your data, but I can't give a warranty for that. It works for me in the normal mode of operation; however: This afternoon I resumed from Software Suspend 2 and tried to run glxgears again. I got a storm of ACPI events which made the system very unresponsive. I rebooted and tried again: then the thermal trip point did not trigger at all. Only power off and cold boot restored the normal function. I don't think the problem is necessarily in my patch, though. When I tried suspending an unpatched system the ACPI subsystem stops responding to events entirely. I guess there is some state in the ACPI thermal subsystem which is not reset correctly by suspend. I will continue to test my patch this week without using suspend, and report the results.
Reply-To: dagarlas@gmail.com don't worry I won't complain... I'm used to this notebook's instability with linux, I nearly hate it... However now the fan seems ok.. There's still a problem: sometimes it stops updating batteries' level. I don't know if it's related to this bug... sometimes a workaround is to plug the ac and remove it, but it doesn't work everytime...
Well Peter :) Big thanks you are my saviour, fan finally starts working... I'll make some more tests and post some reports here :)
Reply-To: dagarlas@gmail.com ok, after some testing i experienced some problems... I set poolsize=10 automatically at startup... 1 - I start X and kde, fan is off 2 - run glxgears until fan turns on 3 - as fan turns on I stop glxgears 4 - an soon as fans turns off i re-run glxgears (return to 2) after 2 loops... i get 100% cpu usage but I have no processes using such resources.. and executing top i see no processes using 100% cpu... System is so slowed down that i need to reboot I tried setting poolsize = 0 but i get the same behaviour (except the working fan :-D) To restore the system I needed to recompile the kernel without the patch... so, what is using all my cpu?
To dagarlas@gmail.com: It would be helpful if you could compile with ACPI debug statements, and echo 0x480 > /proc/acpi/debug_layer echo 0x0f > /proc/acpi/debug_level It would also be helpful if you could use the patch which allows you to use a modified DSDT, and use the modified DSDT here: http://www.ceiriog.eclipse.co.uk/acpi/DSDT.aml (or compile the source http://www.ceiriog.eclipse.co.uk/acpi/DSDT.asl using the iasl compiler). Then you will get some quite verbose debugging information in /var/log/messages and on the console. If you post it here that may be helpful (you may trim the tail of the listing, since it sounds like your system has got stuck in a loop). Also, it would be helpful to see the output of ps -ef | grep kacpid if you can get it, and possibly (if you can get it) the output of cat /proc/acpi/poolsize just to see how many worker threads have been created (in normal operation there should be 2, if things run away there can be up to 10). By the way, did you really need to recompile the kernel without the patch? I always keep my old kernel(s) around when I am experimenting with kernel stuff. Use GRUB to boot the old kernel when the new kernel crashes! Peter
Well I repeated glxgears test as Anonymous Emailer ;P suggested... each time fan started working. Maybe it's about the patched DSDT ?
Peter, In your config preemption is unset, could you please if turning on preemption helps?
Alexey, I tried CONFIG_PREEMPT_VOLUNTARY and CONFIG_PREEMPT. As I expected, neither had any effect. Preemption only effects the transfer of control between threads. The cause of this bug is the single-threaded nature of kacpid. Preemption cannot convert a single thread (kacpid) into multiple threads. Peter
I think I have the same problem with Linux on my HP Compaq nx6125. It also happens frequently that acpi just stops updating: for example battery level isn't updated, fans don't turn on when processor gets hot and dynamic frequency changing stops working. I'm not sure if this is related to this bug but if it is I would be more than glad to help with fixing this bug. Just give me the instructions on what to do. As detailed as possible because this is my first time to do anything else then configuring and compiling the kernel. Thanks!
Ok, this patch appears to help significantly. As Peter suggests, there are problems with suspend to disk - afterwards, I don't seem to get thermal events. Suspend to RAM works fine. The issue with suspend to disk also seems to occur on some other HP machines (such as the 6220) which don't suffer the original problem, so that may well be a separate bug. I'll try to narrow it down on my 6220. With a bit of luck that should give us a clue.
Anyway, here is the second version of my patch. It has been tested pretty thoroughly against 2.6.16.4, and applies cleanly to 2.6.16.11. As promised, the thread pool entries are now destroyed dynamically; if no ACPI events occur in 10 seconds, all the extra threads will disappear. I have added a kernel boot option "acpi_pool_size=". This is necessary because a thermal trip may occur between booting and running any init script which writes the /proc/acpi/pool_size (particularly if the system was warm to start with). Therefore, the thread pool needs to be enabled at boot time. I renamed /proc/acpi/poolsize to /proc/acpi/pool_size for greater readability and compatibility with other filenames. I downgraded the debugging messages to ACPI_DB_INFO level, so they are not printed by default. The patch is very robust for me now, so I don't think we should need them. Peter
Created attachment 7969 [details] Patch for multi-threaded execution of control methods (version 2)
There is a new BIOS version out - F.0E, Did anyone try it and check out if it fixes the problem? http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c00654176&jumpid=em_EL_Alerts/US/Apr06_ALL/Alerts
I've upgraded BIOS to F.0E and it didnt change anything. Fan wont kick in by itself. acpi -t helps but I've got strange readings from Thermal 1-> first: 58 C, then 65+(after issueing command second time) and then it drops below 50 in several seconds. Thermal 4 shows 50+ C after first acpi cmd, then it drops to 0 C in a few seconds and stays that way.
Peter, thank you for getting to the bottom of this. I haven't had time to try your patch, but I'll do so soon. Just a question: is there any risk of the patch having any adverse affects on other machines? The reason I ask is that if you don't think so, I'll try to lobby to have the patch included in the Ubuntu kernel, unless Matthew Garrett is already planning on putting it in, of course :).
My pleasure to be of assistance, Johan. The patch changes nothing (except at present it adds an ACPI_DEBUG_PRINT statement) unless the new behaviour is explicitly enabled using a boot option or by writing to a /proc file. So, it will not affect other machines. According to a report at https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/35455, it looks like the first version of my patch is already in Ben Collins's git tree at rsync.kernel.org:/pub/scm/linux/kernel/git/bcollins/ubuntu-2.6.git :-) You don't let the grass grow under your feet, you Ubuntu folks... Peter
Hi, it seems that the patch works. However I noticed that when resuming for suspend to ram the thermal events don't get updated anymore like it was without the patch. If run glxgears the fans don't kick in. If I than execute acpid -t they will kick in if the temperature is high enough but then they will stay on even if the temperature drops below 50
Sorry I forgot the kernel parameter acpi_pool_size now the problem after resuming seems to be gone. Only the dimming problem remains but is not a real bug. Christian
Peter, Many, many thanks for all the research into this problem and the patches. I've just applied your version 2 to kernel 2.6.16.13 and it works very well, both on battery and on AC. Is there some recommended value of poolsize? Or will the value of 10 that you suggested in an earlier post work in most (all) instances? Richard Mace
The boot parameter acpi_pool_size=10, or echo 10>/proc/acpi/pool_size, should be sufficient. The minimum necessary pool_size is probably 2. However, this number is just a limit on the number of threads which are created dynamically, and so it should never be reached on a well functioning system. There's no reason you shouldn't make it 99999, UNLESS you have some other ACPI bug which prevents the control methods from terminating properly. In that case a modest limit should prevent ACPI from using all your CPU time. Peter
Created attachment 8023 [details] patch to execute Notify handlers on a new thread Here is a simple patch to execute Notify handlers on a new thread immediately instead of adding them to common workqueue. Cures my nx6125, please check on yours.
Created attachment 8025 [details] Updated patch with feedback from Andy Kleen Add do_exit() at the end of Notifier to exit cleanly from thread.
Alexey, Tested your patch mentioned in #105 applied to 2.6.16.13. Works fine. I don't have suspend working, so could not test that. Richard
Marking it resolved as it solves a problem with at least two machines.
Created attachment 8029 [details] Reworked patch with feedback from Robert Moore One more try
Hi, Peter's 2nd patch worked well for me too. I the proc fs I saw up to three entries. Alexey's 2nd version didn't work for me on my 64bit/SuSE 10, the fan stopped forever. However after after upgrading to 32bit SuSE10.1 Alexey's 2nd version seemed to worked as well. But during suspend with 2.6.16.14 and Alexey's 2nd version I got a callstack: (note: noapic set, otherwise suspend would hang): May 6 16:25:11 turion kernel: ieee1394: Node removed: ID:BUS[0-00:1023] GUID[00023f992975320a] May 6 16:26:30 turion kernel: Stopping tasks: =============================================================================================| May 6 16:26:30 turion kernel: Shrinking memory... ^H-^H\^H|^H/^Hdone (55399 pages freed) May 6 16:26:30 turion kernel: pnp: Device 00:02 disabled. May 6 16:26:30 turion kernel: ......................................................................................................................... May 6 16:26:30 turion kernel: Intel machine check architecture supported. May 6 16:26:30 turion kernel: Intel machine check reporting enabled on CPU#0. May 6 16:26:30 turion kernel: swsusp: Restoring Highmem May 6 16:26:30 turion kernel: Debug: sleeping function called from invalid context at mm/slab.c:2729 May 6 16:26:30 turion kernel: in_atomic():0, irqs_disabled():1 May 6 16:26:30 turion kernel: [<c0146dfc>] kmem_cache_alloc+0x1b/0x4f May 6 16:26:30 turion kernel: [<c01bea3b>] acpi_os_acquire_object+0xb/0x36 May 6 16:26:30 turion kernel: [<c01d3b59>] acpi_ut_allocate_object_desc_dbg+0x10/0x3e May 6 16:26:30 turion kernel: [<c01d3b9c>] acpi_ut_create_internal_object_dbg+0x15/0x68 May 6 16:26:30 turion kernel: [<c01cfeb1>] acpi_rs_set_srs_method_data+0x3d/0xb7 May 6 16:26:30 turion kernel: [<c01d6ed7>] acpi_pci_link_set+0xf5/0x169 May 6 16:26:30 turion kernel: [<c01d6f7f>] irqrouter_resume+0x34/0x52 May 6 16:26:31 turion kernel: [<c01f7e43>] __sysdev_resume+0x11/0x53 May 6 16:26:31 turion kernel: [<c01f7f83>] sysdev_resume+0x16/0x47 May 6 16:26:31 turion kernel: [<c01fbb0e>] device_power_up+0x5/0xa May 6 16:26:30 turion ifplugd(eth0)[2730]: Link beat lost. May 6 16:26:31 turion kernel: [<c012bffd>] swsusp_suspend+0x6b/0x85 May 6 16:26:32 turion ifplugd(eth0)[2730]: Link beat detected. May 6 16:26:32 turion kernel: [<c012ce18>] pm_suspend_disk+0x44/0xd3 May 6 16:26:32 turion kernel: [<c012b604>] enter_state+0x50/0x16c May 6 16:26:32 turion kernel: [<c012b7a8>] state_store+0x88/0x95 May 6 16:26:33 turion kernel: [<c012b720>] state_store+0x0/0x95 May 6 16:26:33 turion kernel: [<c01796d6>] subsys_attr_store+0x1e/0x22 May 6 16:26:33 turion kernel: [<c0179915>] sysfs_write_file+0x98/0xbe May 6 16:26:33 turion kernel: [<c017987d>] sysfs_write_file+0x0/0xbe May 6 16:26:33 turion kernel: [<c0149def>] vfs_write+0xa1/0x146 May 6 16:26:33 turion kernel: [<c014a302>] sys_write+0x3c/0x63 May 6 16:26:33 turion kernel: [<c0102a29>] syscall_call+0x7/0xb May 6 16:26:33 turion kernel: unexpected IRQ trap at vector 89 May 6 16:26:33 turion kernel: unexpected IRQ trap at vector 89 May 6 16:26:33 turion kernel: unexpected IRQ trap at vector 89 May 6 16:26:33 turion kernel: unexpected IRQ trap at vector 89 May 6 16:26:33 turion kernel: unexpected IRQ trap at vector 89 Unfortunately the fan stopped after resuming, but the rest of the system seems to be working. I haven't tested Alexey's 3rd version so far. I didn't get it compile yet. Regards, Markus
3rd patch from Alexey either doesn't work.
Alexey's 2nd patch worked until I tried to suspend (using Software Suspend 2): then it hung. Alexey's 3rd patch applies to my 2.6.16.11 kernel with one change (ec->burst to ec->intr) but then causes strange problems in places which seem to have little to do with ACPI: maybe memory corruption? I don't know why these patches fail, but you should check you don't call a function which can sleep from an invalid context (e.g. from an interrupt handler). This is a no-no according to Linux Device Driver Development and other kernel documentation. This is why my patch takes a safety-first approach and spawns new threads from the normal kacpid thread only. This may take a little longer (it is a 2-stage dispatch process) but at least I can be pretty certain that it does not sleep in an interrupt.
Created attachment 8085 [details] Hopefully fixed Alexey's 3rd patch Straight comparisons were used. It helped on my box. Possibly, compiler issue?
New AcpiOsExecute interface integrated and released in ACPICA version 20060512
Hi, Peter's 2nd patch worked for me too. But it seems that patch does not work after reboot. For example,if AC is on, the fan setting in BIOS is set "always on when on AC", then when you first boot your PC - everything is OK. After soft reboot, even if AC is on, fan stops, and it seems, that patch is not working. If you issue acpi -t, fan starts, but even if temperature grows above e.g. 65 degrees, status of Thermal 1 is ok. After acpi -t, state changes to active. Situation the same as described Christian (hofrichter@freenet.de). My firmware is F.0E,kernel 2.6.16.14, acpi_pool_size is set.
applied patch in comment #112 to acpi-test should appear in the -mm tree shortly.
I tried Alexey Starikovskiy patch it seems that suspend to ram does not work anymore. Best regards Christian
On Alexey Starikovskiy version my laptop does not wake up from suspend. However this works with the unpatched kernel and with Peter Wainwright patch. On the other side I cannot resume from suspend to disk (suspend2) with Peter Wainwright patch because I get the message "ACPI Exception AE_NO_MEMORY ... unable to queue handler for GPE ... Event disabled ". I get this message printed a few hundred times, all I can do is to press the power button. Is there a possbility to get rid of this message ? Some kind of kernel option ? I am using kernel 2.6.17-rc4 at the moment. As far as I can remember it is possilbe with Alexey's version to resume from suspend to disk without this message. By the way I am getting hundreds of these : [acpid]: notifying client 3147[0:0] May 22 16:59:48 [acpid]: notifying client 3177[0:0] May 22 16:59:48 [acpid]: notifying client 3374[0:0] May 22 16:59:48 [acpid]: completed event "processor C000 00000080 000000 00" ... in my acpid log file. This is repeated about 30-40 times in a second (but not every second) . What does it mean and how can I turn this off ? I suppose it has something to do with the dynamic frequency change during workload. If it is important I am currently using SUSE 10.1. kernel option is 'noapic' for suspend. Hope this information helps Regards Christian
Running SUSE10.1 default kernel (2.6.16.13-4-default) with the patch from comment #112, seems to fix the problem with thermal events, but now I see a problem with enabling fans. After a few transitions into active[3] state I get the following message: 21:12:12 kernel: ACPI Warning (acpi_power-0445): Transitioning device [C273] to D0 [20060127] 21:12:12 kernel: ACPI Warning (acpi_bus-0267): Transitioning device [C273] to D0 [20060127] 21:12:12 kernel: ACPI Warning (acpi_thermal-0644): Unable to turn cooling device [ffff810037fa56c0] 'on' [20060127] After this the fan is no longer enabled when in state active[3], even when reading the thermal zone shows it as active. At this point if there is a transition to active[2] the fan is still enabled, but after a few such transitions I see: 21:21:35 kernel: ACPI Warning (acpi_power-0445): Transitioning device [C272] to D0 [20060127] 21:21:35 kernel: ACPI Warning (acpi_bus-0267): Transitioning device [C272] to D0 [20060127] 21:21:35 kernel: ACPI Warning (acpi_thermal-0644): Unable to turn cooling device [ffff810037fa57c0] 'on' [20060127] and again the fan stops working for state active[2]. After getting the above messages, the fans will not turn on even after moving out of the active state and back. Only a reboot seems to fix this.
Oops, forgot the DSDT patch to fix fan errors. Bios version is F.0E, by the way.
Made changes to _ON method in DSDT from comment #57 (but didn't add polling), now fans and thermal events working OK.
Is there a DSDT patch for the F0E Bios ? I could only find patches for the older F0D version. Is it necessary to patch the Bios with a new DSDT or is it enough to apply the acpi patch to the kernel ? Did anyone succeeed in using suspend to ram with the acpi patch ?
Yes, I'm still using the second patch from Peter, without patching the DSDT and suspend works like a charm, no problems with the fan...
Created attachment 8221 [details] DSDT for F.0E nx6125 bios with _ON method patch There isn't a patch at acpi4linux for the F.0E bios yet. I found some instructions for extracting the current DSDT and made the canges to the _ON method from comment #57. I also changed line 6550 to fix an error (iasl version 20060512), by swapping in a value from the F.0D custom DSDT. In SUSE you can add the DSDT to initrd, so you don't need to recompile the kernel. This appears to fix a problem enabling the fan, which I see even with an unpatched kernel while running a script to poll thermal events. If you don't see a problem with turning on the fan, then I guess you don't need the patched DSDT. I haven't tried suspend, so it may or may not work, and the patch does not include other fixes from the F.0D custom version. As soon as a patch is available for F.0E at acpi4linux I would suggest using that, but here's the DSDT I'm using if anyone wants to give it a try.
Please ignore the patched DSDT in my previous comment. I've doubled checked my BIOS version and it's F.0D not F.0E. Apologies for silly mistake and any inconvenience this may have caused. The changes in comment #57 do appear to fix the fan problem though, and the checks in the _ON method are still there in F.0E. If you see acpi warnings about being unable to turn on the fan with F.0E, it's worth trying the changes.
Thought I'd add my two cents on the patch mentioned in comment #112. Been using it for a week or so now and have found it to be rock solid. My fans cycle perfectly, turning on and off as expected, with the correct hysteresis. This is highly reproducible and independent of earlier usage. I don't use suspend, so I cannot comment on that. BIOS F.0D; Distro: Debian; Kernel: 2.6.16.13 My DSDT (F.0D) is unpatched.
Richard Mace is right, the fan and temperature checking work correctly now with the patch in comment #112. However I cannot get suspend to work correctly. This on the other side works with Peter Wainwright's patch. I am asking myself now what's the difference between the two version. As far as I know Alexey's patch executes the acpi events immediately. I think it will be hard to provide any log messages as everyting gets umounted correctly when going to suspend to ram. Then on reboot the screen stays black. I can hear the fan so I conclude that the kernel is not working on correctly resume because otherwise the fan will spin down immediately when the system wakes up.
With a SUSE10.1 default kernel (2.6.16.13), firmware (F.0D) and the comment #112 patch I still see the following log message after a while: 21:21:35 kernel: ACPI Warning (acpi_thermal-0644): Unable to turn cooling device [ffff810037fa57c0] 'on' [20060127] and then the fan stops working. I also get the message after a while with an unpatched kernel if I poll thermal events with a script. I had assumed that this was because I did not patch the DSDT, but now I am not sure if the #112 patch should be fixing this. Does anyone else see this behaviour after applying the patch?
In response to comment #127: Jim, I think that the patches found here apply to the official linux kernel (found at www.kernel.org). They may apply to other (modified) kernels, but then they aren't guaranteed to work as intended. At least that is my understanding...
I'm using the latest patch with 2.6.17-rc5 (vanilla) since 3 days without problems.
For get about the suspend problem. I have reinstalled my distribution and now it works. I also see no acpi warning message that the fan cannot be turned on. I am using kernel 2.6.17-rc5 so at least with this verion I think it should work.
How can I turn off or decrease the debug level of acpid because the log file gets really huge. I am getting someting like this every second notifying client 3073[0:0] [acpid]: notifying client 3144[0:0] [acpid]: notifying client 3320[0:0] [acpid]: completed event "processor C000 00000080 000000 00" which are the process numbers for hald-addon-acpi, powersaved and X Christian
You can disable the acpid messages somehow here (latest SUSE): /etc/syslog-ng/syslog-ng.conf: filter f_acpid { match('^\[acpid\]:'); }; I am not familiar with this config... I hope this is enough.
I've been using the patch of comment #112 for some weeks now, and it works well without suspend. However, this evening after resuming from suspend-to-disk (the mainline kernel version) the machine overheated again. acpi -t repeatedly showed the wrong temperature 58C (which is the first trip point) although the machine was a lot hotter than that. The last thing in /var/log/messages: Jun 2 17:29:59 ceiriog kernel: osl-0925 [19719] os_wait_semaphore : Failed to acquire semaphore[ffff810017879e00|1|0], AE_TIME
Sometimes it happens that the fan stays on although "acpi -t" says that all fans should be inactive as the cpu temperature stays far below 50
I'm running Ubuntu kernel 2.6.15-23-amd64-k8, 2.6.15-23.39 to be exact.This kernel incorporates some version of the multithreaded ACPI patch, although I'm not sure which one. I see exactly the same behaviour that Peter reported in comment #133 : everything works except that after a suspend/resume cycle, the fan won't turn on and the temperature reported ia a constant 58 degrees.
Since at least 2 of us have the same problem after suspend, I'd like to REOPEN this bug.
Created attachment 8361 [details] Attempt to fix suspend-resume Please try suspend-resume with this patch. Only one string is different. Also patches to bugs #6455 and #6687 may help as well.
patch in comment #112 shipped in 2.6.17-git9 so we need an incremental patch for the resume issue for testers using the latest kernel.
I patched my Ubuntu kernel with Alexey's patch from comment #133 and also with the patch from http://bugzilla.kernel.org/show_bug.cgi?id=6455 and I can happily report that I now have a working fan after suspend/resume. Thank you!
Sorry, that should of course be comment #137.
Also, it looks like I may have spoken too soon. It still happens sometimes that the fan refuses to start after resuming. Conversely, sometimes the fan starts immediately after resume and won't turn off. Unfortunately I couldn't find anything in the logs that looks relevant.
Johan, You could try the last patch for bug #5000 - it adds suspend/resume support for fan/thermal subsystems.
On Wed, 12 Jul 2006, Linus Torvalds wrote: > > Any reason to not just revert it? The fundamental problems that it > introduces are obviously much worse than the fix. Ok, that commit b8d35192c55fb055792ff0641408eaaec7c88988 is definitely horribly horribly broken. I'm going to revert it, because the "fix" is much worse than the problem it fixes. Instead of a fan not coming on, I now have ten thousand threads killing the machine instead - and the fan _still_ doesn't come on.. The thread approach doesn't even fix the fundamental problem itself. It doesn't help to start a new thread, when the AML interpreter holds a semaphore over the sleep, causing the events to be serialized, and the thermal events to be delayed _anyway_. The only thing the threading causes is that it guarantees that the machine ends up being totally overwhelmed by the thousands of threads, all blocked on the same semaphore. I don't know what the solution should be, but in the meantime, the "fix" is definitely unacceptable. Linus
FYI: This bug is also occuring on HP nx6115. I am using FEDORA CORE 5.
I'm starting to think that the notify handler should be executed synchronously in the same thread executing the _GPE method in order to prevent a flood of new GPEs. This will require additional investigation. ACPICA was modified in early 2001 to move execution of the notify handler to a new thread.
Created attachment 8601 [details] Limit number of concurrent threads Added limit on number of threads spawned by Notifies. Should make Linus' system at least debuggable.
I just tried 2.6.18-rc2. This kernel seemed to already have most of the ACPI patches - the only one I applied was the thread-limiting patch. Unfortunately I didn't have any luck. The temperature reading is still stuck at 58 degrees when resuming, so the fan never turns off.
FYI, this also happens to HP Compaq NX6120. I am using OpenSUSE 10.1. The fan does not turn on automatically until I do a acpi -t.
Still fails for me after a resume (both in-kernel swsusp and Software Suspend 2). I am now using kernel 2.6.18-rc2 with the multi-thread patch from here. I patched my DSDT with the following, which basically prints a message each time the loop in _L19 is executed, each time a _TMP method is called, and shows the value of the flag C176, which seems to control the resetting of the trip points. The result was, after resume, when the machine got hot, the reported temperature stuck at 58 and the following was repeatedly send to syslog: Aug 6 17:27:39 ceiriog kernel: [ACPI Debug] String: [0x19] "DSDT _L19 Local0=00000010" Aug 6 17:27:39 ceiriog kernel: [ACPI Debug] String: [0x20] "DSDT get_temp(00000000)=00000042" Aug 6 17:27:39 ceiriog kernel: [ACPI Debug] String: [0x1D] "DSDT need_reset_trip=00000001" Aug 6 17:27:39 ceiriog kernel: [ACPI Debug] String: [0x19] "DSDT _L19 Local0=00000010" Aug 6 17:27:39 ceiriog kernel: [ACPI Debug] String: [0x20] "DSDT get_temp(00000000)=00000042" Aug 6 17:27:39 ceiriog kernel: [ACPI Debug] String: [0x1D] "DSDT need_reset_trip=00000001" So, we are still stuck in the damn _L19 loop: the NOTIFY events are being processed and the _TMP methods called, but this does not seem to switch on the fan or reset the trip points.
Created attachment 8720 [details] Debugging patch for _L19 and _TMP in DSDT
You might want to have a look at: https://bugzilla.novell.com/show_bug.cgi?id=179702 - comment #47. Those HP BIOSes (including nx6125) have wrong OperationRegion declarations. Because of that the interpreter might read/write to random/wrong memory addresses. Maybe solving that one will also help here?
Judging by the previous comment, wouldn't that mean that the patch in: http://bugzilla.kernel.org/show_bug.cgi?id=6455 Is not resolving the core issue as well?
please try to do "echo platform > /sys/power/disk" before doing suspend. GPE block state is not preserved across shutdown, so thermal events become disabled after resume if "cat sys/power/disk = shutdown". Thomas, this is different issue.
Hi, I have the same problem. Suspending causes the fan to stay on all the time or even worse the fan will not turn on at all. This happens after resuming from suspend to disk or suspend to ram. Using "platform" instead of "shutdown" did not solve the problem. Kernel version is 2.6.18-rc4 with noapic at boot Alexey's patch for kernel 2.6.17 worked well. The newer kernel versions 2.6.18-rcX seem to have most of the patches applied already as the fan management works when booting. However there seems still to be the problem with resuming.
Installed the thread limiting patch and thermal management after that suspend to ram seems to work with noapic. The funny thing is suspend to disk does not work although it should be less problematic than suspend to ram. However if I am going to suspend to disk I get something like "ACPI: Transitioning device [C258] to D3" right at the moment when the kernel enters the critical section and tries to copy to disk. The fan is blowing like hell now at the highest speed although the cpu is pretty cool and the system hangs. I don't know if this problem also persists with suspend2 as there is no patch yet for rc4.
Hi, I have the same issue on HPC nx6325 with the 2.6.18-rc6 kernel (64-bit) and the Peter's patch from Comment #95 seems to fix it. I have observed additional symptoms that without the patch kacpid constantly generates about 4% of CPU load (on one core) and it's impossible to unload any ACPI modules (the rmmod process gets stuck in TASK_UNINTERRUPTIBLE).
Of course I mean the same as described in the report and Comment #1.
Created attachment 8950 [details] don't defer release of global lock This one removes one user of the deferred queue, so it should become less busy and more deadlock prune.
Created attachment 8951 [details] don't defer release of global lock This one removes one user of the deferred queue, so it should become less busy and more deadlock prune.
Created attachment 8952 [details] create another workqueue for notify() execution 2.6.18-rc6 with this and global lock patches is able to do all the tricks including fan control before and after suspend to disk or memory. Tried with 'echo platform > /sys/power/disk' and "noapic" on command line on nx6125.
The patches from Comment #159 and Comment #160 fix the issue for me. Thanks a lot!
On the 2.6.18-rc5-mm1 kernel with the patches from Comment #159 and Comment #160 I'm seeing symptoms similar to those described in Comment #141. Unfortunately setting the suspend mode to 'platform' doesn't help and I cannot boot with 'noapic', because it's an SMP system. However, it seems that after a resume (from disk) the temperatures are read accurately, but for some reason the fans are out of control. It seems that the states of the fans are not read correctly too (eg. if two fans are on, the system reports only one etc.). IOW, this seems to be a separate problem.
It looks like on my box the fan(s) resume issue may be resolved by not loading the fan module from the initrd, so that the "resuming" kernel does not attempt to control the fans before the "restored" kernel takes over. I've tried it with 'echo reboot > /sys/power/disk', so it should also work with 'shutdown'. Interenstingly, the trick doesn't work with 'platform'.
This is interesting indeed... I don't use initrd at all, may be this is why I could not see the problem...
Following Comment #163: Unfortunately I spoke too soon. It sometimes works with 'reboot', but it doesn't work with 'shutdown'. Still the information that the suspend to RAM works but the suspend to disk doesn't (Comment #155) suggests that the ACPI suspend actually suspends the hardware in the prethaw phase (ie. suspend in the "resuming" kernel) which it shouldn't do. Or there is some state that we should save during the suspend to disk and we don't. For example, AFAICT, the ACPI spec says we should save the ACPI NVS regions, but we don't. Anyway the issue seems to be separate from the one this Bug has been opened for, so I think I'll open a new Bug for it.
I tried the two patches. - Suspend to ram works withou problems - Suspend to disk shuts down and resumes but thermal management behaves strange. The fans only kick in at 82
I have created a new bugzilla entry for the thermal issue after resume from disk (Bug #7122).
Apparently, vanilla 2.6.18-rc6-mm1 works on my box like 2.6.18-rc5-mm1 with the patches from Comment #159 and Comment #160. Also the thermal management problems after a resume form disk seem to be the same.
Patches from Comment #159 and Comment #160 applied to acpi-test.
Patches from Comment #159 and Comment #160 were pulled upstream immediately after 2.6.19-rc2.
Created attachment 9631 [details] Fix patch to work with Linus' Compaq n620c Add yield before any execution of deferred functions in order to give Notify() workqueue a chance to run.
patch from comment 171 applied to acpi-test
I wanted to file another bug, but this one looks very close to my problem: a simple "while true; do true; done" will overhead my CPU and result into a machine shutdown (hardware protection or smth) bucause the fans are not started. But if I run "acpi -t", the fans kick in and don't stop not even after I kill the script. I have to run another "acpi -t" to shut them down. I tried to do a "watch 'acpi -t'" to get my fans on and off when they have to, but it does not work like this. It appears I have to run the command only when needed. An old 2.6.15 ubuntu kernel had this fixed, but none of the vanilla kernels (or at least the ones I tried: 2.6.18 and 2.6.19).
So could you check that the latest patch helps?
Not quite. "acpi -t" still makes the rules (influences when the fans start/stop), however at some moment in time, when the temperature reaches 83 the fans do start, cool the CPU down to 73 and then stop. It's not quite "laptop behaviour", that's why I'll keep my CPU at 800MHz until a proper fix is made. It's quite interesting how a read from /proc/acpi/thermal_zone/TZ1/temperature gets things moving.
please open a new bug then, and append output of dmesg and acpidump.
I think some good news are welcome, given the lenght of this "thread": it appears the patch from #171 fixed my "APIC error on CPU0: 40(40)" bug/feature. So there, something good did come out of it :) Ok, I'll open a new bug.
I didn't want to rush into things, so I made a lil' test: I powered off my laptop for an hour or so, let the things cool and the hardware come to it's senses. When I booted back and did a "emerge --update --deep world" (which needless to say, is as hardcore as it looks), my fans started to work as expected (slow at 60+ degrees, full power at 80+ degrees). Don't know what to say. I guess the patch does work. I don't see it in 2.6.19 so I hope it will reach mainline eventually. Thanks for all the help, Mike.
Created attachment 9733 [details] Big comment and removal of void * casts Andrew Morton asked to make a description descriptive and drop all void * casts.
Created attachment 9746 [details] Change from sys_sched_yield() to cond_resched() Switching from sys_sched_yield() to cond_resched(). Please test.
patch from #180 with F.04 BIOS, works fine, now thermal status updated immediately. Hope it be into 2.6.20, thanks for your goog work.
Patch from comment 180 fix the problem for me.(F.0E BIOS) There is small 'but' however. On patched kernel 2.6.19 and on ubuntu kernel 2.6.17 (with one of the earlier patches applied) there is no hysteresis in fan behavior. It is turned on and off at 58 degrees and that makes them to spin on and off very often. Appending kernel option noapic and nolapic (as suggested on one forum) fix the problem and fans behave as expected, turning on at 58, turning off at 50. This problem could be related. Anyone have similarly problem or it is just my kernel conifg ? (and ubuntu?)
The patch from comment 180 tested on a nx6325 box (BIOS F.02). Works fine. There is even a small drop in temperature at TZ1 (circa 2-3 C) when compared to a previous patch set. Might suggests lower cpu workload. Thermal hysteresis functions as well.
I've tested the patch #180 (with a vanilla 2.6.19.1) on a brand new nx6325 (BIOS version F.04) and have something new: My fans don't start all. I see how the temperature rises, nothing happens. Manually trying to switch them on returns no error (and no notice at all in the kernel log), but they stay off. Any ideas?
Sounds like you did not apply the patch. Also, take a look at 7122, it is related.
Created attachment 9947 [details] Update patch for 2.6.20
I've used the patch from Comment #179 for quite a long time with SUSE 10.1, but after I've upgraded to OpenSUSE 10.2, it's no longer needed. It's even harmful, since with this patch applied acpid takes about 20% of CPU time on my box (HPC nx6325) permanently).
Is "polling" mode enabled in 10.2? What is the value?
First, I really meant the patch from Comment #180 (attachement #9746). Sorry for the confusion. Secondly, I've found this in /etc/sysconfig/powersave/common ## Path: System/Powermanagement/Powersave/General ## Type: integer(1:10000) ## Default: "333" # # The powersave daemon watches the usage of your CPU # and other hardware concerning power consumption # Please set the time in milliseconds for what the # daemon should sleep before checking your system again # Good values are about 200-1000(milliseconds) # POLLING_INTERVAL="" so I guess the answer is 'yes', sort of.
Cool, so system every tick poll all the thermal zones and it consuming less than 20% of CPU? Great results I should say... Regarding you 179/180 typo -- do you see the difference between two?
> Cool, so system every tick poll all the thermal zones and it consuming less > than 20% of CPU? Great results I should say... It's really below 1% on the average and it's consistent with what gkrellm is showing. I don't know what powersaved is polling though, because it doesn't even show up in the top's output. > Regarding you 179/180 typo -- do you see the difference between two? Not quite. Still, I switched to the Comment #180 version soon after it had appeared, so I can't really say.
What happens if you kill powersaved or disable polling?
Without the patch nothing really happens after I stop powersaved. To test it with the patch applied I'd have to recompile the kernel. I can do this tomorrow.
comment #189: this polling interval is how often powersaved checks for CPU load if userspace governor is activated, default is ondemand governor. What you are searching for is: THERMAL_POLLING_FREQUENCY="" in /etc/sysconfig/powersave/thermal The default value is too low ("2") on 10.2, setting it to 10 or 20 should be ok. I still wonder why you get higher loads on acpid/powersaved/acpid_notify processes with the patch applied... comment #187: > I've used the patch from Comment #179 for quite a long time with SUSE 10.1, > but after I've upgraded to OpenSUSE 10.2, it's no longer needed. It's even > harmful, since with this patch applied acpid takes about 20% of CPU time on > my box (HPC nx6325) permanently). Do not make me nervous..., latest 10.2 has the patch applied (from comment #180), I just rechecked and I do not see any issues.
I have an nx6125 with the latest BIOS F.11. I've tried two different kernels: 1. The latest Ubuntu kernel 2.6.20-6.11, where the only patch related to this problem is the patch from comment #186. 2. 2.6.20-rc5 with all the patches from http://www.sisk.pl/kernel/patches/2.6.20-rc5/ With both of them, I get the same behaviour: Usually (but not always) the fan starts running at a low speed when booting and never turns off. If I run something cpu-intensive that brings the temperature up above the first trip point (57 degrees) and let it cool down again, the fan *does* turn off and operates normally afterwards. Thermal polling is disabled by default. Enabling thermal polling does not affect this problem. If I echo the value 3 into /proc/acpi/fan/*/state, the fan does turn off but after that, it is no longer automatically controlled and I can only start it again by echoing the value 0 into one of the /proc/acpi/fan/*/state files. I'm not alone with this problem, see https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/75398/comments/13 I'm attaching my DSDT.
Sorry, my statement in Comment #187 was actually premature. The patch from Comment #180 _is_ needed to make thermal management work after a fresh boot. I was confused by side-effects of some initialization problems (apparently, my box sometimes is not initialized properly, eg. the keyboard doesn't work after a power on) and there are memory effects that "survive" reboots and powering off. I guess the only way to get rid of them is to remove the battery. Referring to Comment #194: > The default value is too low ("2") on 10.2, setting it to 10 or 20 should be ok. I'll try. > I still wonder why you get higher loads on acpid/powersaved/acpid_notify processes with the patch applied... I've got them for a couple of times without the patch too. It seems to depend on how the box was initialized and what survived from the previous run. Sigh.
Created attachment 10320 [details] Gzipped DSDT for nx6125 BIOS F.11
I got a report that on an nx6125 the machine hangs after some minutes(keyboard, mouse, nothing responds) for some seconds. This is with SLE10-SP1 (2.6.16 based kernel with the patch from comment #180) applied. It still could be something else, maybe it's even not ACPI, but it sounds related (possibly when battery and thermal info is accessed at the same time or simlar?). I try to have a look at that machine tomorrow. I wonder whether anyone else experienced similar problems.
I've running your patched kernel from the suse ftp server (2.6.18.?-35 I think) since several days on a nx6125 without having troubles. I didn't test suspend but at least fan is working correctly except when booting with connected power supply. Then it is usually required to generate one acpi event by unplugging and plugging power again. Afterwards fan slows down and is regulated as expected. Markus On 2/8/07, bugme-daemon@bugzilla.kernel.org < bugme-daemon@bugzilla.kernel.org> wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=5534 > > > > > > ------- Additional Comments From trenn@suse.de 2007-02-08 06:59 ------- > I got a report that on an nx6125 the machine hangs after some > minutes(keyboard, > mouse, nothing responds) for some seconds. > This is with SLE10-SP1 (2.6.16 based kernel with the patch from comment > #180) > applied. It still could be something else, maybe it's even not ACPI, but > it > sounds related (possibly when battery and thermal info is accessed at the > same > time or simlar?). I try to have a look at that machine tomorrow. I wonder > whether anyone else experienced similar problems. > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. > I've running your patched kernel from the suse ftp server (2.6.18.?-35 I think) since several days<br>on a nx6125 without having troubles. I didn't test suspend but at least fan is working correctly except<br>when booting with connected power supply. Then it is usually required to generate one acpi event <br>by unplugging and plugging power again. Afterwards fan slows down and is regulated as expected.<br><br>Markus<br><br><div><span class="gmail_quote">On 2/8/07, <b class="gmail_sendername"><a href="mailto:bugme-daemon@bugzilla.kernel.org"> bugme-daemon@bugzilla.kernel.org</a></b> <<a href="mailto:bugme-daemon@bugzilla.kernel.org">bugme-daemon@bugzilla.kernel.org</a>> wrote:</span><blockquote class="gmail_quote" DEFANGED_style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> <a href="http://bugzilla.kernel.org/show_bug.cgi?id=5534">http://bugzilla.kernel.org/show_bug.cgi?id=5534</a><br><br><br><br><br><br>------- Additional Comments From <a href="mailto:trenn@suse.de">trenn@suse.de</a> 2007-02-08 06:59 ------- <br>I got a report that on an nx6125 the machine hangs after some minutes(keyboard,<br>mouse, nothing responds) for some seconds.<br>This is with SLE10-SP1 (2.6.16 based kernel with the patch from comment #180)<br>applied. It still could be something else, maybe it's even not ACPI, but it <br>sounds related (possibly when battery and thermal info is accessed at the same<br>time or simlar?). I try to have a look at that machine tomorrow. I wonder<br>whether anyone else experienced similar problems.<br><br>------- You are receiving this mail because: ------- <br>You are on the CC list for the bug, or are watching someone who is.<br></blockquote></div><br>
Sorry for polluting this bugzilla with html :-( Mea culpa!
No problem. Markus Walser: I've running your patched kernel from the suse ftp server (2.6.18.?-35 I think) since several days on a nx6125 without having troubles. I didn't test suspend but at least fan is working correctly except when booting with connected power supply. Then it is usually required to generate one acpi event by unplugging and plugging power again. Afterwards fan slows down and is regulated as expected. Such reports really help a lot! I'll try to reproduce, I expect one of Rafaels patches could help here. I'll give it a try as soon as I have the time... Any similar reports, especially with experiences with (best single) or several patches from Rafeal or bug #7122 or others and what effects they show are very welcome. The stated in comment #198, seem to come from a recent C-state patch, I get C2 unsupported, but C3 as valid state. This is a SUSE only SLE10-SP1 Beta problem. Summary: Patch from comment #180 works fine for *a lot* people and received huge testing.
Should kernel 2.6.20 already have all the patches? I'm asking because I just upgraded to that final version of kernel and it still happens that things like battery status stop getting updated from time to time on my HP Compaq nx6125 until I run "cat /proc/acpi/thermal_zone/TZ1/temperature" at which point kacpid starts using 100% of CPU and when it finishes the status is updated again.
No, they are not present in 2.6.20
Created attachment 10429 [details] syncronous execution of notify requires modifications to mutex, in the next patch
Created attachment 10430 [details] fix mutex reentrancy for method Creation of second thread for Notify execution is considered no-no, so here is an attempt to execute notify in-place. Please test.
patches in comment #205 and comment #204 applied to acpi-test.
Hi guys, When I try to apply the latest acpi-test (20070126-2.6.20), I errors reporting that drivers/acpi/bay.c and drivers/misc/asus-laptop.c are missing. Sure enough, they're not there. If I apply the rc6 and/or rc7 patch first, it's fine, those files are created, but I get another afailure on drivers/usb/misc/appledisplay.c Is there a special order to these? I get the same herrors on 2.6.20.1 sources. Anyway, just to give some more input on the nx6125. I have the Turion ML-34 with 2GB memory and F11 bios. I'm running Debian Etch i386 (20070214 snapshot) with the k7 kernel. With this kernel, the fourth trip point (80 centigrade) seems to make the first trip point (50 centigrade) fan active: theluggage:~# acpi -t Battery 1: charged, 96% Thermal 1: active[3], 82.0 degrees C Thermal 2: ok, 64.0 degrees C Thermal 3: ok, 32.0 degrees C Thermal 4: ok, 50.0 degrees C and: critical (S5): 95 C passive: 88 C: tc1=1 tc2=2 tsp=100 devices=0xc1fde338 active[0]: 80 C: devices=0xc1fe46f8 active[1]: 75 C: devices=0xc1fe46a8 active[2]: 65 C: devices=0xc1fe4658 active[3]: 50 C: devices=0xc1fe4608 Hope this helps - let me know if I need to post any more info.
I think (not sure) that the 2 patches above are in kernel's git repository: http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.20-git15.log Commits: 5f7748cf91558a5026ded5be93c5bf6c1ac34edf c0d127b56937c3e72c2b1819161d2f6718eee877 The comments of these commits mention this bug. In order to fix this bug, besides these 2 patches, for nx6325, is there anything else needed?
should be enough
patches in comment #205 and comment #204 shipped in linux-2.6.20-rc1 Closed.
er, linux-2.6.21-rc1, that is.