Bug 5534 - No thermal events until acpi -t - HP nx6125
No thermal events until acpi -t - HP nx6125
Status: CLOSED CODE_FIX
Product: ACPI
Classification: Unclassified
Component: Power-Fan
i386 Linux
: P2 high
Assigned To: Alexey Starikovskiy
:
Depends on:
Blocks: 5962
  Show dependency treegraph
 
Reported: 2005-11-02 03:00 UTC by Richard Mace
Modified: 2007-02-20 22:38 UTC (History)
37 users (show)

See Also:
Kernel Version: 2.6.14,2.6.13.4,2.6.12-1-amd64 (debian)
Tree: Mainline
Regression: ---


Attachments
acpidump (136.45 KB, text/plain)
2005-11-10 22:47 UTC, Richard Mace
Details
dmesg (kernel 2.6.14.1) (17.20 KB, text/plain)
2005-11-10 22:50 UTC, Richard Mace
Details
dmidecode (hp nx6125) (6.64 KB, text/plain)
2005-11-10 22:55 UTC, Richard Mace
Details
lspci -v (hp nx6125) (7.24 KB, text/plain)
2005-11-10 22:56 UTC, Richard Mace
Details
cat /proc/interrupts (hp nx6125) (575 bytes, text/plain)
2005-11-10 23:01 UTC, Richard Mace
Details
.config (kernel 2.6.14.1) (56.76 KB, text/plain)
2005-11-21 11:49 UTC, Richard Mace
Details
cat /proc/acpi/thermal_zone/TZ1/* (474 bytes, text/plain)
2005-11-21 11:55 UTC, Richard Mace
Details
cat /proc/acpi/thermal_zone/TZ2/* (233 bytes, text/plain)
2005-11-21 11:57 UTC, Richard Mace
Details
cat /proc/acpi/thermal_zone/TZ3/* (233 bytes, text/plain)
2005-11-21 11:58 UTC, Richard Mace
Details
polling freq selection patch (1.04 KB, patch)
2005-11-22 07:02 UTC, Konstantin Karasyov
Details | Diff
acpidump (136.45 KB, text/plain)
2005-11-22 13:14 UTC, Carsten Tschense
Details
dmesg (12.86 KB, text/plain)
2005-11-22 13:15 UTC, Carsten Tschense
Details
/var/log/messages (231.13 KB, text/plain)
2005-11-22 13:25 UTC, Carsten Tschense
Details
DSDT debug patch (7.61 KB, patch)
2005-11-24 10:36 UTC, Konstantin Karasyov
Details | Diff
polling freq selection patch (with debug) (1.45 KB, patch)
2005-11-24 10:41 UTC, Konstantin Karasyov
Details | Diff
patch to avoid redundant thermal notifications (1.82 KB, patch)
2005-12-14 07:10 UTC, Konstantin Karasyov
Details | Diff
kernel config (where fan still don't work) (32.45 KB, text/plain)
2006-03-25 00:39 UTC, Maciej Paszta
Details
Peter Wainwright's kernel configuration (180 bytes, text/html)
2006-03-25 02:31 UTC, Peter Wainwright
Details
Peter Wainwright's ACPI event tracking patch (186 bytes, patch)
2006-03-25 02:33 UTC, Peter Wainwright
Details | Diff
Peter Wainwright's DSDT patch (186 bytes, patch)
2006-03-25 02:44 UTC, Peter Wainwright
Details | Diff
Peter Wainwright's DSDT patch (7.61 KB, patch)
2006-03-25 02:51 UTC, Peter Wainwright
Details | Diff
config (2.6.16) (55.47 KB, text/plain)
2006-03-25 09:49 UTC, Richard Mace
Details
Patch for multi-threaded execution of control methods (6.97 KB, patch)
2006-04-08 03:04 UTC, Peter Wainwright
Details | Diff
Patch for multi-threaded execution of control methods (version 2) (8.32 KB, patch)
2006-04-27 12:32 UTC, Peter Wainwright
Details | Diff
patch to execute Notify handlers on a new thread (3.14 KB, patch)
2006-05-04 08:58 UTC, Alexey Starikovskiy
Details | Diff
Updated patch with feedback from Andy Kleen (4.19 KB, patch)
2006-05-04 11:09 UTC, Alexey Starikovskiy
Details | Diff
Reworked patch with feedback from Robert Moore (8.34 KB, patch)
2006-05-06 04:56 UTC, Alexey Starikovskiy
Details | Diff
Hopefully fixed Alexey's 3rd patch (7.68 KB, patch)
2006-05-10 04:13 UTC, Konstantin Karasyov
Details | Diff
DSDT for F.0E nx6125 bios with _ON method patch (26.53 KB, patch)
2006-05-28 07:48 UTC, Jim Needham
Details | Diff
Attempt to fix suspend-resume (7.68 KB, patch)
2006-06-21 06:18 UTC, Alexey Starikovskiy
Details | Diff
Limit number of concurrent threads (2.58 KB, patch)
2006-07-22 10:18 UTC, Alexey Starikovskiy
Details | Diff
Debugging patch for _L19 and _TMP in DSDT (1.22 KB, patch)
2006-08-06 09:25 UTC, Peter Wainwright
Details | Diff
don't defer release of global lock (863 bytes, patch)
2006-09-06 06:52 UTC, Alexey Starikovskiy
Details | Diff
don't defer release of global lock (863 bytes, patch)
2006-09-06 06:54 UTC, Alexey Starikovskiy
Details | Diff
create another workqueue for notify() execution (2.33 KB, patch)
2006-09-06 07:05 UTC, Alexey Starikovskiy
Details | Diff
Fix patch to work with Linus' Compaq n620c (3.59 KB, patch)
2006-11-27 08:32 UTC, Alexey Starikovskiy
Details | Diff
Big comment and removal of void * casts (5.86 KB, patch)
2006-12-04 09:20 UTC, Alexey Starikovskiy
Details | Diff
Change from sys_sched_yield() to cond_resched() (5.63 KB, patch)
2006-12-06 10:32 UTC, Alexey Starikovskiy
Details | Diff
Update patch for 2.6.20 (192 bytes, patch)
2006-12-25 06:43 UTC, Alexey Starikovskiy
Details | Diff
Gzipped DSDT for nx6125 BIOS F.11 (27.57 KB, application/x-gzip)
2007-02-06 10:56 UTC, Johan Brannlund
Details
syncronous execution of notify (879 bytes, patch)
2007-02-15 09:11 UTC, Alexey Starikovskiy
Details | Diff
fix mutex reentrancy for method (7.97 KB, patch)
2007-02-15 09:14 UTC, Alexey Starikovskiy
Details | Diff

Description Richard Mace 2005-11-02 03:00:03 UTC
Most recent kernel where this bug did not occur: none, as far as I can tell

Distribution: Debian (amd64 port)

Hardware Environment: HP nx6125 (AMD Turion ML 34, ATI Radeon express 200M
chipset, onboard ATI X300)

Software Environment: Kernel 2.6.13.4 (with Matthew Garrett's double timer patch
applied), Debian amd64 (testing/unstable), WM = KDE 3.4.2. Also tried 2.6.14
(boot with no_timer_check) with same results

Problem Description: ACPI thermal events rarely get processed, especially under
moderate to high CPU load. This results in *no* or erratic fan use and potential
damage to the machine/electronics. However, if the CPU temperature exceeds a
thermal trip point and then one issues a cat
/proc/acpi/thermal_zone/TZ?/temperature or an acpi -t, then, after a brief
machine pause, the thermal event is processed by the kernel and the fans
respond. This can be observed by stopping acpid and doing a cat
/proc/acpi/event, which gives the most graphic evidence. A further and more
detailed desciption/diagnosis of the problem can be found here ==>
http://lists.debian.org/debian-amd64/2005/10/msg01002.html

Steps to reproduce: With a warm processor < 58 degrees C (less than first
thermal trip point), run glxgears and wait about a minute or so. Your fan will
90% of the time not kick in. Then execute an acpi -t or a cat
/proc/acpi/thermal_zone/TZ?/temperature and almost immediately you will observe
that (i) at least one of your thermal trip points have been exceeded and (ii) as
a response to the cat command, the fans immediately turn on. Visual evidence can
be had by first, before you do anything, stopping acpid and doing a cat
/proc/acpi/event (as root). Then do the above procedure. You will observe no
thermal event register *until* you do the cat or acpi -t.
Comment 1 Alessandro 2005-11-02 03:41:25 UTC
Distribution: Ubuntu Breezy (amd64 port)

Hardware Environment: HP nx6125 (AMD Turion ML 34, ATI Radeon express 200M
chipset, onboard ATI X300)

Software Environment: Kernel 2.6.12-9-amd64-generic (without Matthew Garrett's 
double timer patch applied, i had it running Breezy Colony 4. On Colony 5 i can 
use my laptop without the patch), Breezy AMD64, WM = Gnome 2.12.

I've the same Laptop. I've also the same problem with the CPU-Fan. It works 
properly if i don't run hard applications on my laptop. The fan starts to cool 
around 58
Comment 2 Richard Mace 2005-11-09 05:48:39 UTC
Small update: I have just downloaded and compiled kernel 2.6.14.1 and the bug
is still present in that version.
Comment 3 Len Brown 2005-11-10 19:14:48 UTC
cool failure -- good job isolating it
Comment 4 Luming Yu 2005-11-10 19:17:42 UTC
Please demsg, and acpidump output.
Comment 5 Richard Mace 2005-11-10 22:47:20 UTC
Created attachment 6533 [details]
acpidump
Comment 6 Richard Mace 2005-11-10 22:50:38 UTC
Created attachment 6534 [details]
dmesg (kernel 2.6.14.1)
Comment 7 Richard Mace 2005-11-10 22:55:14 UTC
Created attachment 6535 [details]
dmidecode (hp nx6125)
Comment 8 Richard Mace 2005-11-10 22:56:36 UTC
Created attachment 6536 [details]
lspci -v (hp nx6125)
Comment 9 Richard Mace 2005-11-10 23:01:14 UTC
Created attachment 6537 [details]
cat /proc/interrupts (hp nx6125)
Comment 10 Vladimir Lebedev 2005-11-21 09:45:59 UTC
Please send the output from 'cat /proc/acpi/thermal_zone/TZ*/*' command and 
the '.config' file from your kernel.
Comment 11 Richard Mace 2005-11-21 11:49:37 UTC
Created attachment 6635 [details]
.config (kernel 2.6.14.1)
Comment 12 Richard Mace 2005-11-21 11:55:16 UTC
Created attachment 6636 [details]
cat /proc/acpi/thermal_zone/TZ1/*

Note: I currently have "fan always on when A/C plugged in" set in the bios.
This sets the first thermal trip point to 16 C. Without this bios setting the
first thermal trip is set at 58 C and when the CPU temp reaches this value it
drops to 50 C.
Comment 13 Richard Mace 2005-11-21 11:57:19 UTC
Created attachment 6637 [details]
cat /proc/acpi/thermal_zone/TZ2/*

See comments for cat /proc/acpi/thermal_zone/TZ1/*
Comment 14 Richard Mace 2005-11-21 11:58:39 UTC
Created attachment 6638 [details]
cat /proc/acpi/thermal_zone/TZ3/*

See comments for cat /proc/acpi/thermal_zone/TZ1/*
Comment 15 Konstantin Karasyov 2005-11-22 07:02:07 UTC
Created attachment 6644 [details]
polling freq selection patch

There is no asynchronous notification for thermal events on you system, so the
system should use polling frequency to poll its state. It could be either
'_TZP' value from DSDT or OS-provided value. If '_TZP' evaluates to zero then
the polling is disabled. There was a bug which disabled polling if no '_TZP'
were provided by DSDT. The attached patch should fix this issue.
Comment 16 Carsten Tschense 2005-11-22 07:25:52 UTC
Distribution: Gentoo

Hardware: HP nx6125 (AMD Sempron)

Kernel: 2.6.14 with Gentoo's patchset r2

Same problem here, but I don't have the 64-bit version of the laptop.
Note that also the battery state doesn't update after the cpu temperature
reaches the 58
Comment 17 Carsten Tschense 2005-11-22 07:39:14 UTC
Thanks Konstantin, the patch works for me! :)
Comment 18 Vladimir Lebedev 2005-11-22 08:33:51 UTC
Did the battery behavior also changed? 

Could you please attach the /var/log/messages, /var/log/boot.msg, output 
from 'cat /proc/acpi/ac_adapter/*/*', 'cat /proc/acpi/battery/*/*', dmesg and 
acpidump, available in pmtools here: 
http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

Also, could you explain the battery behavior more clear, i.e. when it's being 
updated, in what state it is, etc?
Comment 19 Richard Mace 2005-11-22 11:35:07 UTC
Konstantin, many thanks for the patch. I am compiling the kernel right now with
the patch applied and will let you know what I find. However, I am surprised
that you say that the DSDT provides no asynchronous method of notification of
thermal events, because right now during the kernel compilation my second trip
point (65 C) was exceeded and the second fan came on (or first fan sped up?). I
was just sitting here watching the compilation and the fan suddenly blew harder.
I quickly did an acpi -t to see what the temp was and it was 63 C -- I guess it
had just come down from 65 C--the second trip point. So, as my earlier
experiments seem to show... sometimes it works and sometimes it doesn't. (Right
now I'm on A/C and I've set bios to have the fan on when on A/C. My first trip
point is 16 C and my second is at 65 C. The 65 C trip was exceeded during the
compilation and the fan turned up a notch. Here's the output of my current acpi
-t      
     Battery 1: charged, 94%
     Thermal 1: active[2], 63.0 degrees C
     Thermal 2: ok, 54.0 degrees C
     Thermal 3: ok, 22.0 degrees C
) I want to stress again: this occurred *without* the patch applied and using
the unpatched 2.6.14.1 kernel (so, it seems that there are asynchronous
notifications... sometimes). As I'm typing this... fan just turned down a notch
without any intervention (no acpi -t to trigger it). The second trip point had
been re-set to 59 C when the 65 C was exceeded. My current acpi -t:
     Battery 1: charged, 94%
     Thermal 1: active[3], 59.0 degrees C
     Thermal 2: ok, 53.0 degrees C
     Thermal 3: ok, 22.0 degrees C
Yep, now I'm certain--there are asynchronous notifications...
Comment 20 Richard Mace 2005-11-22 12:09:36 UTC
Dear Konstantin,

Results of patch test... (failure, unfortunately)

Kernel 2.6.14.1 with polling patch applied. Unplugged A/C (on battery), so first
thermal trip is at 58 C, 2nd at 65 C. I ran glxgears for a few minutes and the
fans didn't come on. So I issued an acpi -t and here are the results:
richm@dilbert:~$ acpi -t
     Battery 1: discharging, 87%, 02:02:57 remaining
     Thermal 1: ok, 68.0 degrees C
     Thermal 2: ok, 51.0 degrees C
     Thermal 3: ok, 28.0 degrees C

Notice that the state of Thermal 1 is registered as ok even though *two* thermal
trip points have been exceeded (58 and 65 C). Otherwise the influence of the
patch can be seen now with  

$ cat /proc/acpi/thermal_zone/TZ?/*
<setting not supported>
cooling mode:   active
polling frequency:       5 seconds
state:                   ok
temperature:             53 C
critical (S5):           95 C
passive:                 88 C: tc1=1 tc2=2 tsp=100 devices=0xffff810037fc8dc0
active[0]:               80 C: devices=0xffff810037f93f00
active[1]:               75 C: devices=0xffff810037f93dc0
active[2]:               65 C: devices=0xffff810037f93cc0
active[3]:               58 C: devices=0xffff810037f93bc0
<setting not supported>
cooling mode:   passive
polling frequency:       5 seconds
state:                   ok
temperature:             51 C
critical (S5):           100 C
passive:                 90 C: tc1=1 tc2=2 tsp=300 devices=0xffff810037fc8dc0
<setting not supported>
cooling mode:   passive
polling frequency:       5 seconds
state:                   ok
temperature:             27 C
critical (S5):           100 C
passive:                 60 C: tc1=1 tc2=2 tsp=300 devices=0xffff810037fc8dc0

The polling frequency is set to 5 seconds as it defaults to with the new patch.

However, please see my earlier post that I wrote during the kernel compile
(while on A/C). It seems that asynchronous notifications events were being
generated and correctly interpreted.... (much to my surprise!)

Richard
Comment 21 Richard Mace 2005-11-22 12:42:49 UTC
Further test: Just booted back into my unpatched 2.6.14.1 kernel. Ran glxgears
and exceeded a trip point. Fans didn't come on--no suprise. Did an acpi -t to
trigger fans into action and waited to see if the fans would turn off
automatically when the temp dropped below the re-set trip point (50 C). After a
while, and with no intervention (didn't do any acpi -t), fans stopped. After
fans stopped, quickly did an acpi -t and the temp was 50 C. So, as I've
suggested elsewhere, this behaviour is asymmetric. The fan is less likely to
turn on when a trip point is exceeded, but more likely to turn off when the CPU
temp drops below a trip point. 

With the patch applied, the fans only responded to trip points (turning fans on
and off) when I issued an acpi -t, so the behaviour is worse in the case with
the patch applied. Hope that this helps...

Richard
Comment 22 Carsten Tschense 2005-11-22 13:13:42 UTC
Sorry for my early response, of course it doesn't work ;)

Well, the battery state:
It stops updating after the temperature of TZ1 reaches the first trip point, and
updates again after a "cat /proc/acpi/thermal_zone/TZ1/temperature"
I'm not sure about the point it stops updating (but I don't have a counter-example)

/proc/acpi/ac_adapter/C173/state:
state:                   off-line
Comment 23 Carsten Tschense 2005-11-22 13:14:43 UTC
Created attachment 6645 [details]
acpidump
Comment 24 Carsten Tschense 2005-11-22 13:15:28 UTC
Created attachment 6646 [details]
dmesg
Comment 25 Carsten Tschense 2005-11-22 13:25:30 UTC
Created attachment 6647 [details]
/var/log/messages

the whole day, the earlyer sessions were without the patch
Comment 26 Richard Mace 2005-11-22 23:11:49 UTC
COMMENTS ON BATTERY ISSUE: Carsten recently brought up the issue of the battery
and I'm worried that this could sideline the more important issue of the thermal
events (notifications). 

My interpretation of the battery issue is the following (I am no expert on ACPI,
so I stand to be corrected): When the ACPI subsystem generates a thermal
notification it appears not to be processed by the kernel. It appears (and I'm
guessing here) that it "blocks" the ACPI subsystem (perhaps a queue?). When you
thereafter issue a cat /proc/acpi/thermal_zone/TZ1/temperature then there is a
perceptibile pause, like you're bringing the ACPI subsystem out of a "hung
state". After this pause, the thermal event is processed (unblocking the event
queue?) and the battery events are once again processed. The fact that, as
Carsten says, it updates again after a cat
/proc/acpi/thermal_zone/TZ1/temperature seems to strongly support this claim.

Anyway, my naive two cents worth. I hope that the kind folk at Intel who are
spending their time trying to fix this will take note. I believe that if we fix
the thermal event notifications, the battery notifications will automatically work.
Comment 27 Carsten Tschense 2005-11-23 02:20:47 UTC
Yesterday (Germany) I  noticed that my fans did't run, even if I read the
temperature from thermal_zone/TZ1/temperature or did a acpi -t. The laptop was
already running for about 5 hours.
Strange...
First everything works as expected, after a while you have to active the fans
with acpi -t, now nothing works... :(

Thermal 1: active[3], 61.0 degrees C

The battery state did update.

I think it's time to describe my wlan problem...
Don't know if any of you had the same issue, but the wlan-chip stops working
after 15 minutes without sending any packets, I made a cronjob that sends a ping
every 10 minutes (strange solution, but it works ^^)... Without doing so, you
can't send anything. iwconfig shows that the card is not associated,
reassociating doesn't work. Now you can reload the ndiswrapper module a few
times and nothing changes and you don't get any errors. After a while every time
you load ndiswrapper you get
ACPI: PCI interrupt for device 0000:02:02.0 disabled
after a second. The module gets unloaded. Looks like some kind of powersaving ^^
Pressing the wlan-button while the everything is still working disables the card
(without any messages, exept a button-event). To reactivate you have to press
the button again and reload ndiswrapper, and it works as expected.
Yesterday there was the same problem again, you couldn't reload the module, the
card stopped working after the fans stopped working.
Comment 28 Carsten Tschense 2005-11-23 02:38:02 UTC
Something to add: the system didn't use 100% cpu after reading the temperature
yesterday as the fans stopped working completly.
And I'm booting with ec_burst=1

Aaah, just saw a new behavior: I let the fans running for a while, then I wanted
to stop them with another acpi -t, but that told me that the cpu had 74
Comment 29 Alexey Starikovskiy 2005-11-23 07:16:22 UTC
Carsten,
It is really hard to work on bugs, if there is more than issue reported, so
could you please open new bugs for issues you have and keep this bug pure ACPI
thermal related?
Comment 30 Konstantin Karasyov 2005-11-24 10:33:22 UTC
There were some issues reported for booting with ec_burst=1 and using 
spinlocks. Here is the patch which replaces spinlocks with semaphores:
https://sourceforge.net/project/showfiles.php?group_id=129330

One could try it to evaluate if he gets the same case. Note also that this 
patch is stated as temporary solution.
Comment 31 Konstantin Karasyov 2005-11-24 10:36:19 UTC
Created attachment 6677 [details]
DSDT debug patch

Here is the DSDT patch for the DSDT table, available here:
http://bugzilla.kernel.org/attachment.cgi?id=6533

It adds debug prints from _ACx, _ALx, Notify (... , 0x80/0x81) methods for
thermal zone TZ1.

It would be interesting to observe system's behavior under different
curcumstances: with polling turned on/off, with ec_burst=1/0, using
spinlocks/semaphores.
Comment 32 Konstantin Karasyov 2005-11-24 10:41:32 UTC
Created attachment 6678 [details]
polling freq selection patch (with debug)

Another patch for the same purpose - it add event notification debug info to
the thermal subsystem.
Comment 33 Richard Mace 2005-11-26 12:47:28 UTC
First results with the latest patch (attachment 6678 [details]) applied. 
(Machine on battery: trips at 58, 65, etc deg C)

I ran tail -f /var/log/syslog. Then I started glxgears and watched. The
appearance of thermal events in the syslog did not seem to correlate with fan
usage. I don't like polling because it results in worse behaviour than the
asynchronous mode, so  I echoed 0 into each of
/proc/acpi/thermal_zone/TZ?/polling_frequency, to disable polling and ran the
test again. This time didn't observe thermal events until I did an acpi -t (by
which time my CPU was frying -- 71 deg C!). Stopped glxgears and watched. Saw
thermal event register in the logs and at the same time fan turned off. 

Anyway, here's the part of my syslog showing this state of affairs. 

--------------------------begin---------------------------------------------
Nov 26 22:27:24 localhost kernel: APIC error on CPU0: 00(40)
Nov 26 22:27:52 localhost kernel: APIC error on CPU0: 40(40)
Nov 26 22:28:20 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:28:20 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:29:28 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:29:28 localhost last message repeated 156 times
Nov 26 22:29:28 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:29:52 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:29:52 localhost last message repeated 49 times
Nov 26 22:29:52 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:30:42 localhost kernel: APIC error on CPU0: 40(40)
Nov 26 22:31:45 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:31:45 localhost last message repeated 399 times
Nov 26 22:31:45 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:32:49 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:32:49 localhost last message repeated 39 times
Nov 26 22:32:49 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:33:11 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:33:11 localhost last message repeated 14 times
Nov 26 22:33:11 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:36:28 localhost kernel: APIC error on CPU0: 40(40)
Nov 26 22:36:32 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:36:32 localhost last message repeated 164 times
Nov 26 22:36:32 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:37:05 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:37:05 localhost last message repeated 44 times
Nov 26 22:37:05 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:37:51 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:37:51 localhost last message repeated 8 times
Nov 26 22:37:51 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:39:03 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:39:03 localhost last message repeated 8 times
Nov 26 22:39:50 localhost last message repeated 5 times
Nov 26 22:39:50 localhost kernel: ------------------  Got thermal event 0x81
Nov 26 22:40:36 localhost kernel: ------------------  Got thermal event 0x80
Nov 26 22:40:36 localhost last message repeated 8 times
Nov 26 22:40:36 localhost kernel: ------------------  Got thermal event 0x81
-------------------end-----------------------------------------------------
The second from last event 0x80 corresponded to the fans turning off, then I got
an 0x80 (x8). It seems like a lot of events are being generated without being
processed. Can anybody understand this?

--
Richard
Comment 34 Konstantin Karasyov 2005-11-27 23:13:47 UTC
Please try the spinlock patch from here:
https://sourceforge.net/project/showfiles.php?group_id=129330
Comment 35 Richard Mace 2005-11-28 05:23:08 UTC
Tried the spinlock patch on a vanilla 2.6.14.1 kernel. I patched the kernel with
patch -p1 < acpi-ec-nospinlock-2.6.14.diff, as explained in the README. I did
not use *any other* patches. My boot options were: disable_timer_pin_1 and
ec_burst=0.
I was on battery, so that my trips were at 58C, 65C, etc.

RESULT: Failure. Doing the standard test with glxgears permitted the CPU temp to
rise above 58C without a hint of movement from the fans. Let the CPU cool and
repeated the test. Failure again. Running acpi -t triggers fan response, as before.

Now I think my machine needs a rest from this thermal overload ;-) Anyone else
want to try some patches....?
--
Richard
Comment 36 Richard Mace 2005-11-28 10:53:21 UTC
There are bios upgrades available at HP's website. The latest BIOS revision is
F.09 (see
http://h18007.www1.hp.com/support/files/hpcpqnk/us/locate/64_6170.html). I was
wondering if anyone on the CC has tried this upgrade and whether it still has
the problem. (Perhaps they could report...)

To help make further progress, could I ask those reading this to please check
their BIOS version (f10 on boot) and post it together with their CPU type (AMD
64/Sempron) and whether this thermal bug is present as follows (this is my data)

BIOS: F.07, CPU: Turion ML-34, BUG: present

This should help us determine if the bug is specific to a particular BIOS version.

Thanks for your cooperation.
--
Richard
Comment 37 dagarlas 2005-11-28 11:26:43 UTC
I use F.09 bios version but the bug is still present
Comment 38 dagarlas 2005-11-28 11:28:08 UTC
Sorry I forgotten:

BIOS: F.09, CPU: Turion ML-34, BUG: present
Comment 39 jfp 2005-11-28 13:31:28 UTC
BIOS: F.07, CPU: Turion ML-40, BUG: present
Comment 40 Alessandro 2005-11-28 13:49:45 UTC
BIOS: F.07, CPU: Turion ML-34, BUG: present
Comment 41 jfp 2005-11-28 20:32:29 UTC
This from the syslog could also be relevant:
Nov 29 10:46:08 localhost kernel: warning: many lost ticks.
Nov 29 10:46:08 localhost kernel: Your time source seems to be instable or some
driver is hogging interupts
Nov 29 10:46:08 localhost kernel: rip acpi_ec_read+0xc4/0xe5
Comment 42 Richard Mace 2005-11-28 22:31:43 UTC
Jfp, yes, I suspect that the timer problem is related to our thermal problem.
This issue has now been fixed with a patch from andi kleen (see bug #3927) for
kernel 2.6.15rc2. I haven't tried it yet, but I'm hoping that the patch will be
incorporated into 2.6.15 (has anybody tried this patch?).
--
Richard
Comment 43 Richard Mace 2005-11-30 02:22:07 UTC
Some additional information: I've compiled a good few kernels over the past few
days and in each case my fan usage has been working on a regular basis (while on
A/C). Let me qualify... I'm on A/C, so my trips are at 16 C and 65 C, etc. My
fan is always blowing while on A/C (set in BIOS). During the long compilations
my CPU temp inevitably reaches 65C, but then my fan always turns up a notch
without any intervention (no acpi -t required). So, it seems that in this case
the fans/thermal events are working... Hope that helps.
Comment 44 dagarlas 2005-11-30 02:45:41 UTC
I use 2.6.14, bios is F.09. Sometimes I get no thermal events even with acpi -t
but only unplugging and re-plugging ac connector... It seems this behavior is
random... I'll do more tests...
Comment 45 Peter Juritz 2005-12-06 16:35:19 UTC
Distribution: Slackware 10.2
System: AMD 64 Turion Mobile ML-28 F0.7

Funny behaviour on my box: when i set "Fan Always on while on AC Power" the fan
, after booting Linux 2.6.13 turns off. Ill confirm the erratic behaviour -
somethimes the events are processed , sometimes not. The AC Power problem is
most concerning.
Has anyone been able to get a system working 100%?
Comment 46 Alessandro 2005-12-12 05:37:51 UTC
Last week i noticed, that the operation acpi -t that i scheduled doesn't
work (it works but i has no influence on the cpufan). I try to explain
you. I can not work without acpi -t with my laptop because of our fun problem. 
I noticed that the temperature of my laptop was stable by 78
Comment 47 Yung-Chin Oei 2005-12-13 03:45:20 UTC
Problem present here as well on ubuntu 5.10 (amd64), kernel 2.6.12-10. I do not
have the BIOS set to have the fan always on, and without that setting, the fan
does sometimes kick in by itself, and sometimes does not.

Today I've tried with kernel 2.6.15-8 and it does not improve matters (rather,
my fan does not run at all anymore, so maybe I'm having more problems now -
nevertheless, I can confirm the observation that the fan does kick in by itself
on low to moderate loads).
Comment 48 Konstantin Karasyov 2005-12-14 07:10:34 UTC
Created attachment 6823 [details]
patch to avoid redundant thermal notifications

The purpose of this patch is to avoid redundant notifications handling.
It allows temperature notification handling not often then once per second.
Comment 49 Peter Juritz 2005-12-16 05:50:58 UTC
A tempory fix would of course be to boot the kernel with acpi=off. The bios
remains in control of the fan though it will remain on constantly.(This was
suggested to me by a friend of mine) I have tried it and it works prefectly.
Comment 50 Maciej Paszta 2005-12-17 03:34:59 UTC
Hi,
My nx6125 (amd64) runs on Gentoo 2005.1 with kernel-2.6.14. bootparams:
disable_timer_pin_1  no_timer_check
In BIOS I have selected fan always on when AC connected. Now I had used cron job
to periodically run acpi -t but my fans didn't work at all (besides the one that
is selected in BIOS). Then i turned off cron and after a while I noticed that my
fan just turned on when I exceeded 65 degrees Celsius. I thought this was random
so I restarted and started playing ET and once again my fans started - without
need to run acpi -t. Though sometimes they don't turn off (but that's the
smallest problem at least my laptop won't get fried). 
Comment 51 dagarlas 2005-12-18 05:07:09 UTC
bios updated to F.0D, nothing changes...
Comment 52 Alessandro 2005-12-18 06:35:21 UTC
> My nx6125 (amd64) runs on Gentoo 2005.1 with kernel-2.6.14. bootparams:
> disable_timer_pin_1  no_timer_check

Hi,

can you told me, what disable_timer_pin_1 on bootup does?

Thanks.
Comment 53 Maciej Paszta 2005-12-18 12:05:20 UTC
Well I found disable_timer_pin_1 as a solution to double timer speed. I really
don't know what's the difference between no_timer_check and that option, but I
found somewhere an exemple where both were used simultanously so I appended
exactly the same boot options.
Comment 54 Richard Mace 2005-12-25 11:19:02 UTC
Tested patch in #48 with kernel 2.6.15-rc6 using the glxgears test. First with
polling enabled (uggh!), which is the default behaviour of the patch. Then I
echoed 0 into /proc/acpi/thermal_zone/TZ[1-3]/polling_frequency to turn off
polling and re-ran glxgears. In both tests the thermal trip point of 58C was
exceeded without fans turning on. Here's an excerpt from my syslog...
==============================================================================
Dec 25 21:07:54 localhost kernel: ------------------  Got thermal event 0x81
Dec 25 21:07:54 localhost kernel: ------------------  Got thermal event 0x81
Dec 25 21:08:13 localhost kernel: APIC error on CPU0: 00(40)
Dec 25 21:08:49 localhost kernel: ------------------  Got thermal event 0x80
Dec 25 21:08:49 localhost kernel: ------------------  SKIP thermal event 0x80
Dec 25 21:08:49 localhost last message repeated 161 times
Dec 25 21:08:49 localhost kernel: ------------------  Got thermal event 0x81
Dec 25 21:10:11 localhost kernel: ------------------  Got thermal event 0x80
Dec 25 21:10:11 localhost last message repeated 443 times
Dec 25 21:10:11 localhost kernel: ------------------  Got thermal event 0x81
Dec 25 21:11:38 localhost kernel: APIC error on CPU0: 40(40)
Dec 25 21:11:44 localhost kernel: ------------------  Got thermal event 0x80
Dec 25 21:12:10 localhost last message repeated 266 times
Dec 25 21:12:10 localhost kernel: ------------------  Got thermal event 0x81
Dec 25 21:12:33 localhost kernel: ------------------  Got thermal event 0x80
Dec 25 21:12:33 localhost last message repeated 33 times
Dec 25 21:12:33 localhost kernel: ------------------  Got thermal event 0x81
=============================================================================
The line that reads "Dec 25 21:08:49 localhost last message repeated 161 times"
occurred when I issued my first acpi -t (polling enabled).

The line that reads "Dec 25 21:12:10 localhost last message repeated 266 times"
occurred when I issued an acpi -t during the second test (polling disabled).

In both cases the trip point of 58C was exceeded without any response from the
fans (until I issued an acpi -t).

It seems that many thermal notifications aren't being skipped by the patch and
are still being sent in "bursts".
Comment 55 Matthew Garrett 2005-12-29 10:23:48 UTC
Adding debug code to the kernel, it seems that acpi_ev_queue_notify_request
never gets called for thermal events on this machine (using 2.6.15). The
behaviour is identical with and without an enabled apic, so it's nothing to do
with the timer problem. The obvious question now is, why are these thermal
events not getting through?
Comment 56 Yung-Chin Oei 2006-01-10 04:38:33 UTC
although not able to reliably replicate the behaviour, I am now certain that I
have observed the same behaviour on several occassions running Windows XP (!).
The last time I had observed it was after a warm reboot; the fan ran during boot
and never stopped afterwards even when exhaust air was getting really cold.

Other information: I am using the rmclock utility provided by rightmark.org to
do frequency changing in Windows, rather than the official AMD driver. I do not
know if this has anything to do with it - as said, I haven't been able to
replicate the behaviour and thus can't test swapping the AMD and the rightmark
drivers.

In Windows XP, I cannot probe the ACPI temps - using a utility called Speedfan
(from almico.com), the ACPI temps never get updated unless a thermal trip point
is passed. The SMBus temps do get updated, but I'm not sure these refer to the
same thermal zones.

The only utility that does provide ACPI temps is the Dashboard utility provided
on the AMD website - this however installs its own low-level drivers; which you
can see if you try to run the program after installing it but before the
required rebooting.

Hope it helps, please let me know if I can provide further information.
Comment 57 Sylvain Collange 2006-01-26 14:14:41 UTC
This bug can be worked around by patching the DSDT (at least for me).

The two changes I have done:
1. Enabling thermal zone polling by adding a _TZP method for the main thermal zone.

2. In the _ON method of power resources associated with the fan (C25C to C25F),
there is code that seems to check the host OS:
           Method (_ON, 0, NotSerialized)
           {
               If (LNot (LGreater (\C008 (), 0x03)))   /* OS != WinXP */
               {
                   C256 (0x08, 0x32)
               }
               Else                                    /* OS == WinXP */
               {
                   If (LGreater (DerefOf (Index (C252, 0x00)), C258 (C248, 0x00)))
                   {
                       C256 (0x08, 0x32)
                   }
               }
           }

For some reason (bug or feature?), Linux is identified as Windows XP.
This causes another check involving the current temperature and a trip point to
be made.

Removing both checks and calling C256 directly seems to fix the problem of fan
refusing to start blowing after a random amount of time (as reported in comment
#27).

What I still don't understand is why the temperature check did fail after some
time, causing the fan not to restart. Could this be a synchronization issue like
a race condition?

This does not solve the problem of not-proceeded ACPI events, though.

The patched DSDT for BIOS F.09:
http://acpi.sourceforge.net/dsdt/view.php?id=525
And for BIOS F.0D:
http://acpi.sourceforge.net/dsdt/view.php?id=561
Comment 58 Robert Moore 2006-01-26 15:47:55 UTC
>For some reason (bug or feature?), Linux is identified as Windows XP.

Feature. This allows Linux to execute the AML code down what is often the 
*only* tested path.
Comment 59 Markus Thorsten Wenzel 2006-02-15 01:44:07 UTC
Hi,

NX6125, ML-40;
upgraded my BIOS from F.09 to F.0D; formerly both with Gentoo AMD64 and openSuse
10.1beta2 x86_64 the fan would turn on and off properly when running a "watch
cat acpi -t" in a console, and even turn on and off accidentially if not.

After BIOS upgrade, the fan starts running dependent on CPU temperature during
system startup and will never alter its speed, whatever I do. My kernel boot
options are "noapic nolapic all-generic-ide". 

This leads to one recommendation and one question: I recommend not upgrading the
BIOS to F.0D if not for testing purposes, and I wonder if anyone can provide me
with a BIOS downgrade to F.09 or F.07, prefereably windows softpaq SP31482.exe.
HP has deleted this older version from their website/ftp.

Regards,
Markus
Comment 60 Peter Wainwright 2006-02-15 11:09:42 UTC
Hi,

Markus, I've also upgraded from F.07 to F.0D, but it is now working no better
and no worse than the old version. I.e. under heavy CPU load, with
"watch 'acpi -t'" running, there is the normal hysteresis of the fan speed
between 58 and 65 degrees. But without the watch command, the system misses
the trip point at 65.

However, I did have a problem at first because I had forgotten to remove a
patched version of the old DSDT which I had placed in my initrd for debugging -
maybe that is the problem?

Peter
Comment 61 Peter Wainwright 2006-02-16 11:00:26 UTC
Alas, I spoke too soon.  Even with "watch acpi -t" running in a terminal,
sometimes the fans won't turn on.
My system is an NX6125 with Turion ML-40
running a self-compiled kernel-2.6.14-1.1653_1.rhfc4.cubbi_swsusp2.src.rpm
(that is, Fedora 4 kernel with Software Suspend 2 patches).  Also some
debugging patch from this thread.  I now get, in /var/log/messages, stuff like this:


Feb 16 18:18:31 ceiriog su(pam_unix)[3200]: session opened for user root by
(uid=1002)
Feb 16 18:19:28 ceiriog kernel: ------------------  Got thermal event 0x80
Feb 16 18:19:28 ceiriog last message repeated 8 times
Feb 16 18:19:28 ceiriog kernel: ------------------  Got thermal event 0x81
Feb 16 18:20:33 ceiriog kernel: acpi_power-0435 [77] power_transition      :
acpi_power_on failed (-8)
Feb 16 18:20:33 ceiriog kernel: acpi_power-0459 [77] power_transition      :
Error transitioning device [C272] to D0
Feb 16 18:20:33 ceiriog kernel: acpi_bus-0266 [76] bus_set_power         : Error
transitioning device [C272] to D0
Feb 16 18:20:33 ceiriog kernel: acpi_thermal-0652 [75] thermal_active        :
Unable to turn cooling device [ffff8100016219f0] 'on'
Feb 16 18:20:34 ceiriog kernel: ------------------  Got thermal event 0x80
Feb 16 18:20:34 ceiriog last message repeated 2 times
Feb 16 18:20:34 ceiriog kernel: ------------------  Got thermal event 0x81
Feb 16 18:21:55 ceiriog kernel: ------------------  Got thermal event 0x80
Feb 16 18:21:55 ceiriog last message repeated 8 times


the line with acpi_power_on failed (-8) was added by me in drivers/acpi/power.c
so I could see exactly what error code was returned by the acpi_power_on
call.  If it means anything to anyone, please tell...

Just at the moment, I can control the fan C273 by
echo 0 > /proc/acpi/fan/C273/state,
echo 3 > /proc/acpi/fan/C273/state.
The former command turns the fan ON, the latter turns it OFF.
echoing 1 or 2 to this file causes error messages like

Feb 16 18:24:56 ceiriog kernel: acpi_bus-0216 [150] bus_set_power         :
Device does not support D1

So evidently the integers here represent the ACPI device levels D0 to D3,
and the intermediate levels are not supported.

However, the same trick with the other fan devices C270 - C272 now does nothing.


I don't think this problem is new since I can see it in my backup
/var/log/messages.{1,2,3}.  And in those, the fan numbers are C260 - C263, which
indicates they were from the original BIOS (F.07).  Has anyone else seen this
problem? I think it may require CONFIG_ACPI_DEBUG enabled in your kernel config
to see this...

I don't know whether this is a NEW bug or related to the missing of thermal
events; if you like I will open a new bug for it.


Regards,
Peter
Comment 62 Peter Wainwright 2006-02-16 11:03:20 UTC
Sorry, you may also want to know my kernel command line options: they are

ro root=LABEL=/ no_timer_check=0 rhgb quiet enforcing=0 noapic nolapic

(rhgb is redhat-specific "graphical boot", enforcing=0 is for SELinux)
Peter

Comment 63 Peter Wainwright 2006-02-16 13:23:03 UTC
Silly me: after upgrading my BIOS I forgot to re-install the patched DSDT:

http://acpi.sourceforge.net/dsdt/view.php?id=561

As stated before in this thread, this is necessary for control of the fans.
Otherwise the fan power resource
_ON method does nothing (the code being conditional on some OS checks).

Peter
Comment 64 Richard Mace 2006-02-28 12:34:22 UTC
Just compiled 2.6.16-rc5, which, incidentally, fixes the double timer problem on
the hp nx6125 (see bug # 3927). More significantly, I noticed the following
messages in the output of dmesg, which I cannot recall seeing with previous
kernels, and I thought they may be relevant....

ACPI Error (evgpeblk-0284): Unknown GPE method type: C265 (name not of form _Lxx
 or _Exx) [20060127]
ACPI Error (evgpeblk-0284): Unknown GPE method type: C266 (name not of form _Lxx
 or _Exx) [20060127]

--Richard
Comment 65 Peter Wainwright 2006-03-01 03:20:04 UTC
OK, I've been tearing out my remaining hair and have spent the last
couple of weekends trying to track down the problem, though I knew
absolutely zilch about ACPI when I started.

However, here is my speculation, based on reading the DSDT, available
at http://acpi.sourceforge.net/dsdt/view.php?id=558, and the kernel
source.

I am hoping that some kernel expert will jump in and tell
me if I'm right or wrong.

Asynchronous notification of thermal events is handled by a
level-triggered interrupt which causes execution of the ACPI control
method _L19 (defined by AML code in the DSDT).  This method enters a
loop which polls the temperature sensor via the SMBus (I assume this
is something similar to
http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2408).  If the status
bits of the temperature sensor indicate a trip point has been exceeded
a thermal Notify() event is generated.

At the end of the loop, the control method relinquishes control via a
Sleep() call for 100 microseconds before polling again.  In this
interval, one would hope that the OS would take control and process
the outstanding Notify() events.

HOWEVER, as far as I can tell from the kernel source, both the _L19
interrupt and the Notify handlers are run from a single workqueue in
the thread known as "kacpid".  This means the Notify() events do not
get processed until the _L19 method has completed. They just pile
up in the queue.

Presumably reading the temperature (using cat
/proc/acpi/thermal_zone/*/temperature or acpi -t) causes the queue to
be flushed immediately.

The question is: does the DSDT or the kernel behaviour better
represent the ACPI spec?

According to the ACPI spec: "When a control method does block, the
operating software can initiate or continue the execution of a
different control method.  A control method can only assume that
access to global objects is exclusive for any period the control
method does not block.".  I take this to mean that it would be
acceptable to process the thermal Notify() events as they occur, and
before the interrupt handler _L19 returns.  This would presumably
require a separate kernel thread.

Does this make sense?



Comment 66 Peter Wainwright 2006-03-04 01:23:16 UTC
OK, I had another look at this this morning.  I think I understand
most of it.

All of this refers to the nx6125 BIOS version F.0D and linux
kernel 2.6.16-rc5.

In addition to the GPE and Notify events, the thermal
zone polling (_TZP) is also done from the same single threaded
workqueue.  This explains why enabling TZP does not solve the problem.
Once the _L19 control method is entered it blocks processing of all
other events.  However, the userspace workaround "watch acpi -t" (or
"watch cat /proc/acpi/thermal_zone/*/temperature") works, because it
forces action immediately without queueing it on the kacpid workqueue.
Reading the temperature sensor status bits has the effect of clearing
them, so on the next loop _L19 returns, and then the Notify events get
processed.

I do not fully understand the quote from the ACPI spec in comment 65,
nor this from the ACPI CA Programmer's Reference:

"Because of the constraints of the ACPI specification, there is a
major limitation on the concurrency that can be achieved within the
AML interpreter portion of the subsystem. The specification states
that at most one control method can be actually executing AML code at
any given time. If a control method blocks (an event that can occur
only under a few limited conditions), another method may begin
execution. However, ???it can be said that the specification precludes
the concurrent execution of control methods???. Therefore, the AML
interpreter itself is essentially a single-threaded component of the
ACPI subsystem. Serialization of both internal and external requests
for execution of control methods is performed and managed by the
front-end of the interpreter."

However, the DSDT for the HP nx6125 seems to expect a cooperative
multithreading model in which one control method can relinquish
control using Sleep() and wait for another to complete.

I am not a computer scientist, but a mathematical physicist and part
time system administrator.  I hope my analysis will be useful to
someone, but I don't have the time or the skills to take this further.
It would be helpful to have some comment from the people who wrote the
ACPI spec (HP/Intel/MS/Phoenix/Toshiba) as to what level of
concurrency is really expected from the OS.

P.S. I notice that acpi_os_queue_for_execution() takes a "priority"
argument which is unused.  What is the intention of this?

So, I am relinquishing control here and hoping someone else will
run:

Sleep(100000000....)


Comment 67 Robert Moore 2006-03-08 15:16:26 UTC
Looking at method _L19, the Sleep() operator should relinquish the processor 
and allow the Notify dispatch to occur.

AcpiOsQueueForExecution should be using a different thread than the thread 
that executes the AML interpreter. The Notify dispatcher does not enter the 
AML interpreter mutex, so there should not be a deadlock. The actual handler 
for the Notify should be in a driver somewhere.

I have found that the best method to determine exactly what is going on is to 
enable full debug tracing in the ACPICA subsystem (via acpi_dbg_level) and 
analyze the sequence of events.

Comment 68 Matthew Garrett 2006-03-22 14:51:33 UTC
Robert - are you suggesting that you'd like a full debug trace in order to
examine this further? I can produce one of those without too much trouble, but
it would be helpful to know how much debugging you'd like.
Comment 69 Vladimir Lebedev 2006-03-24 02:25:46 UTC
Richard,

We have nx6125 now and do not see any problems, we use latest stable kernel 
2.6.16 based on your config file. So, pease try 2.6.16 and post the results.





Comment 70 Maciej Paszta 2006-03-24 16:14:20 UTC
Well problem still remains, at least for me. This is how it can be verified
Turn on computer with AC
Run some stupid cpu-eater like

int main() {
 while(1);
}

Plug out notebook from AC... and wait.. just few minutes earlier my CPU came to
nearly 80*C degrees without turning any of the fans. I use
vanilla-sources-2.6.16 with gentoo patches applied. I don't use any modified
DSDT and my BIOS version is F.0D
Comment 71 Vladimir Lebedev 2006-03-24 23:12:57 UTC
Your test was repeated step by step, unfortunately fan works. Please attach 
your '.config' file; I want to reproduce exactly your situation.
Comment 72 Maciej Paszta 2006-03-25 00:39:05 UTC
Created attachment 7662 [details]
kernel config (where fan still don't work)

Here's my .config, just be 100% sure I will check vanilla-sources without any
patches, and I will report here ASAP
Comment 73 Peter Wainwright 2006-03-25 01:02:06 UTC
I can still reproduce the problem.

A description of my kernel can be found at
http://www.ceiriog.eclipse.co.uk/2.6.16-prw7

My kernel configuration (.config) can be found at
http://www.ceiriog.eclipse.co.uk/dot-config-2.6.16-prw7

Additional patches mentioned in the kernel description can be found at
http://gaugusch.at/acpi-dsdt-initrd-patches/acpi-dsdt-initrd-v0.7e-2.6.14.patch
http://www.suspend2.net/
and in the directory
http://www.ceiriog.eclipse.co.uk/patches

I find that full ACPI debugging gives me too much information, so I have
prepared a small patch which will show you the flow of control
http://www.ceiriog.eclipse.co.uk/patches/acpi_events.patch

In order to reproduce the problem reliably you need to turn thermal
zone polling OFF with
echo 0 > /proc/acpi/thermal_zone/TZ1/polling_frequency.
You also need to turn off any application which is reading the
temperature /proc/acpi/thermal_zone/TZ1/temperature.
You might also want to enable some limited ACPI debugging with
echo 0x0f > /proc/acpi/debug_level.

Then run a CPU-intensive application (glxgears will do it eventually,
so will a kernel compilation). You may monitor the temperature using
watch "cat /proc/acpi/thermal_zone/*/temperature", but kill this
process just before the temperature reaches 65C. Wait a while until
you are confident that the temperature has exceeded 65C. Then do
"acpi -t" or "cat /proc/acpi/thermal_zone/*/temperature" and you should
find (1) the system seems to hang for a fraction of a second (or several
seconds if it is a long time since the trip point was exceeded) and (2)
you get something in the system log like
http://www.ceiriog.eclipse.co.uk/patches/messages

Note how the thermal notify events Notify (\_TZ.TZ1, 0x80) are repeatedly
queued and never processed until after "acpi -t" causes the evaluation of
_TMP.

Peter Wainwright

Comment 74 Maciej Paszta 2006-03-25 01:26:45 UTC
As for me problem still exists even on vanilla kernel. 
Comment 75 Peter Wainwright 2006-03-25 02:31:45 UTC
Created attachment 7663 [details]
Peter Wainwright's kernel configuration
Comment 76 Peter Wainwright 2006-03-25 02:33:57 UTC
Created attachment 7664 [details]
Peter Wainwright's ACPI event tracking patch
Comment 77 Peter Wainwright 2006-03-25 02:44:27 UTC
Created attachment 7665 [details]
Peter Wainwright's DSDT patch

I have uploaded my config and my event tracking patch to bugzilla
for reference.	I forgot to mention that the /var/log/messages
excerpt was created with a patched DSDT: I have also uploaded
that patch (it is similar to the one in comment #31 but the
symbols in my BIOS are numbered differently). This gives the
ACPI Debug lines so you can see the relationship between the
ACPI AML code and the kernel stuff. Vladimir, what
BIOS version are you using? I am using F.0D, which is the latest
I could find on the HP website
(and at http://acpi.sourceforge.net/dsdt/view.php?id=558). If
you have purchased your box very recently maybe it has a newer
BIOS?


Peter Wainwright
Comment 78 Peter Wainwright 2006-03-25 02:51:44 UTC
Created attachment 7666 [details]
Peter Wainwright's DSDT patch

Corrected upload
Comment 79 Richard Mace 2006-03-25 09:44:09 UTC
In response to #69: Tested vanilla kernel 2.6.16 with glxgears while on battery.
The CPU temperature rose to 64 degrees without any response from fans, i.e., the
problem still persists.

BIOS: F.07    CPU: Turion ML-34    DISTRO: Debian amd64

Vladimir: I've included the distribution information above because, for example,
SuSE seems to implement their own polling (at intervals of 2s) by default. As
pointed out a bit earlier, you need to disable polling if enabled. However, my
experiments with polling (under Debian) have all failed to improve matters.

-Richard
Comment 80 Richard Mace 2006-03-25 09:49:19 UTC
Created attachment 7675 [details]
config (2.6.16)

My 2.6.16 kernel .config
Comment 81 Peter Wainwright 2006-04-08 03:02:23 UTC
There has been no significant movement on this bug for months.
I can only conclude that the problem is only triggered by a few
very sophisticated DSDTs (the HP nx6125 among them) or has remained
unrecognized on other platforms. Nonetheless it seems to me that
Linux does not correctly implement the ACPI spec.

Therefore, I propose a solution with the patch attached.

Interpretation of control methods called in response to GPE events or
Notify events is confined to one single-threaded workqueue.

It is true that the AML interpreter is essentially single-threaded,
because it is protected by a mutex and therefore only one kernel thread
can be executing AML code at one time. However, this does NOT mean that
the execution of different control methods should not overlap. The ACPI
spec allows for the transfer of control between one method and another
when the AML calls Sleep, Acquire, Wait etc. (see the ACPI-CA
reference).

The way Sleep() is implemented in Linux, it calls schedule() and transfers
control to other kernel threads: but any other control method which is queued 
in the kacpid thread itself will not be able to run until the currently
executing control method is finished.

On the HP nx6125 laptop this is essential, otherwise the ACPI subsystem
will block, thermal events will not be processed, and the system will
overheat. http://bugzilla.kernel.org/show_bug.cgi?id=5534

So, if any of you have an nx6125, or an ACPI bug which you think may be
caused by the mechanism I have described, please try this patch.

This is my first attempt at a serious kernel hack, so please forgive
the state of it: this is work in progress. At least it should show one
approach to the problem. I'm sure it has loads of problems with respect
to locking, SMP, etc. which you will point out.

In order to enable the new behaviour you need to write a positive
integer (e.g. 10) to /proc/acpi/poolsize.

Instead of executing all the GPE and Notify handlers in the single
kacpid thread, we create a pool of worker threads and hand over the
work to them. These can now execute concurrently, though access to the
interpreter is still serialized by the use of a mutex. The thread pool
is allowed to grow dynamically up to the maximum size which is set by
the user by writing an integer to /proc/acpi/poolsize. If this integer
is 0 the thread pool is not used at all and the old behaviour is used.
You can also read /proc/acpi/poolsize to see the maximum pool size and
the currently allocated threads. There is a field "jiffies" in the
thread pool entry structure which is written when a thread finishes
execution of a control method. My intention is that in future this will
be used to reap unused threads.

Of course, the user-configurable pool size may not be necessary. We
might hard-code it. Or even allow the AML to create as many threads as
necessary (assuming we trust the BIOS).


Peter Wainwright (P.S. not the Apache/Perl expert).

(copied to linux-acpi list).
Comment 82 Peter Wainwright 2006-04-08 03:04:11 UTC
Created attachment 7812 [details]
Patch for multi-threaded execution of control methods

This patch is applied to vanilla kernel 2.6.16
Comment 83 Anonymous Emailer 2006-04-08 04:07:50 UTC
Reply-To: dagarlas@gmail.com

I recompiled 2.6.16 and it seems ok... (with nx6125, bios F.0D)

however i'll test better in this weekend... now I tried running
glxgears... the fan sarts and stops correctly

Comment 84 Peter Wainwright 2006-04-08 10:31:02 UTC
I'd better make clear that you use my patch at your own risk - it could
hang your system, or at least make it unresponsive.  It probably won't eat
your data, but I can't give a warranty for that. It works for me in the
normal mode of operation; however:

This afternoon I resumed from Software Suspend 2 and tried to run glxgears
again. I got a storm of ACPI events which made the system very unresponsive.
I rebooted and tried again: then the thermal trip point did not trigger at all.
Only power off and cold boot restored the normal function.

I don't think the problem is necessarily in my patch, though. When I tried
suspending an unpatched system the ACPI subsystem stops responding to events
entirely. I guess there is some state in the ACPI thermal subsystem which is
not reset correctly by suspend. I will continue to test my patch this week
without using suspend, and report the results.
Comment 85 Anonymous Emailer 2006-04-08 11:06:28 UTC
Reply-To: dagarlas@gmail.com

don't worry I won't complain... I'm used to this notebook's instability
with linux, I nearly hate it...

However now the fan seems ok.. There's still a problem: sometimes it
stops updating batteries' level. I don't know if it's related to this
bug... sometimes a workaround is to plug the ac and remove it, but it
doesn't work everytime...

Comment 86 Maciej Paszta 2006-04-08 13:20:10 UTC
Well Peter :) Big thanks you are my saviour, fan finally starts working... I'll
make some more tests and post some reports here :)
Comment 87 Anonymous Emailer 2006-04-08 14:29:18 UTC
Reply-To: dagarlas@gmail.com

ok, after some testing i experienced some problems...

I set poolsize=10 automatically at startup...

1 - I start X and kde, fan is off
2 - run glxgears until fan turns on
3 - as fan turns on I stop glxgears
4 - an soon as fans turns off i re-run glxgears (return to 2)

after 2 loops... i get 100% cpu usage but I have no processes using such
resources.. and executing top i see no processes using 100% cpu...
System is so slowed down that i need to reboot

I tried setting poolsize = 0 but i get the same behaviour (except the
working fan :-D)

To restore the system I needed to recompile the kernel without the patch...

so, what is using all my cpu?


Comment 88 Peter Wainwright 2006-04-09 00:09:50 UTC
To dagarlas@gmail.com:

It would be helpful if you could compile with ACPI debug statements, and
echo 0x480 > /proc/acpi/debug_layer
echo 0x0f > /proc/acpi/debug_level
It would also be helpful if you could use the patch which allows you to
use a modified DSDT, and use the modified DSDT here:
http://www.ceiriog.eclipse.co.uk/acpi/DSDT.aml
(or compile the source http://www.ceiriog.eclipse.co.uk/acpi/DSDT.asl
using the iasl compiler).

Then you will get some quite verbose debugging information in /var/log/messages
and on the console. If you post it here that may be helpful (you may trim
the tail of the listing, since it sounds like your system has got stuck in
a loop).

Also, it would be helpful to see the output of
ps -ef | grep kacpid
if you can get it, and possibly (if you can get it) the output of
cat /proc/acpi/poolsize

just to see how many worker threads have been created (in normal operation there
should be 2, if things run away there can be up to 10).

By the way, did you really need to recompile the kernel without the patch?
I always keep my old kernel(s) around when I am experimenting with kernel stuff.
Use GRUB to boot the old kernel when the new kernel crashes!

Peter
Comment 89 Maciej Paszta 2006-04-09 13:36:09 UTC
Well I repeated glxgears test as Anonymous Emailer ;P suggested... each time fan
started working. Maybe it's about the patched DSDT ?
Comment 90 Alexey Starikovskiy 2006-04-12 05:15:35 UTC
Peter,
In your config preemption is unset, could you please if turning on preemption helps?
Comment 91 Peter Wainwright 2006-04-13 13:55:18 UTC
Alexey,

I tried CONFIG_PREEMPT_VOLUNTARY and CONFIG_PREEMPT. As I expected, neither had
any effect. Preemption only effects the transfer of control between threads.
The cause of this bug is the single-threaded nature of kacpid.
Preemption cannot convert a single thread (kacpid) into multiple threads.

Peter
Comment 92 Jure Repinc 2006-04-22 18:03:21 UTC
I think I have the same problem with Linux on my HP Compaq nx6125. It also
happens frequently that acpi just stops updating: for example battery level
isn't updated, fans don't turn on when processor gets hot and dynamic frequency
changing stops working. I'm not sure if this is related to this bug but if it is
I would be more than glad to help with fixing this bug. Just give me the
instructions on what to do. As detailed as possible because this is my first
time to do anything else then configuring and compiling the kernel. Thanks!
Comment 93 Matthew Garrett 2006-04-23 12:27:09 UTC
Ok, this patch appears to help significantly. As Peter suggests, there are
problems with suspend to disk - afterwards, I don't seem to get thermal events.
Suspend to RAM works fine. The issue with suspend to disk also seems to occur on
some other HP machines (such as the 6220) which don't suffer the original
problem, so that may well be a separate bug. I'll try to narrow it down on my
6220. With a bit of luck that should give us a clue. 
Comment 94 Peter Wainwright 2006-04-27 12:30:16 UTC
Anyway, here is the second version of my patch. It has been
tested pretty thoroughly against 2.6.16.4, and applies
cleanly to 2.6.16.11.

As promised, the thread pool entries are now destroyed
dynamically; if no ACPI events occur in 10 seconds, all the
extra threads will disappear.

I have added a kernel boot option "acpi_pool_size=". This is
necessary because a thermal trip may occur between booting
and running any init script which writes the
/proc/acpi/pool_size (particularly if the system was warm to
start with). Therefore, the thread pool needs to be enabled
at boot time.

I renamed /proc/acpi/poolsize to /proc/acpi/pool_size for
greater readability and compatibility with other filenames.

I downgraded the debugging messages to ACPI_DB_INFO level,
so they are not printed by default. The patch is very robust
for me now, so I don't think we should need them.

Peter
Comment 95 Peter Wainwright 2006-04-27 12:32:49 UTC
Created attachment 7969 [details]
Patch for multi-threaded execution of control methods (version 2)
Comment 96 Jure Repinc 2006-04-28 10:55:43 UTC
There is a new BIOS version out - F.0E, Did anyone try it and check out if it
fixes the problem?
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c00654176&jumpid=em_EL_Alerts/US/Apr06_ALL/Alerts
Comment 97 Arkadiusz J 2006-04-28 17:24:31 UTC
I've upgraded BIOS to F.0E and it didnt change anything. Fan wont kick in by
itself. acpi -t helps but I've got strange readings from Thermal 1-> first: 58
C, then 65+(after issueing command second time) and then it drops below 50 in
several seconds. Thermal 4 shows 50+ C after first acpi cmd, then it drops to 0
C in a few seconds and stays that way.
Comment 98 Johan Brannlund 2006-05-02 13:02:45 UTC
Peter, thank you for getting to the bottom of this. I haven't had time to try
your patch, but I'll do so soon.

Just a question: is there any risk of the patch having any adverse affects on
other machines?

The reason I ask is that if you don't think so, I'll try to lobby to have the
patch included in the Ubuntu kernel, unless Matthew Garrett is already planning
on putting it in, of course :).

Comment 99 Peter Wainwright 2006-05-02 13:15:56 UTC
My pleasure to be of assistance, Johan.

The patch changes nothing (except at present it adds an ACPI_DEBUG_PRINT
statement) unless the new behaviour is explicitly enabled using
a boot option or by writing to a /proc file. So, it will not affect other
machines.

According to a report at
https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/35455,
it looks like the first version of my patch is already in Ben Collins's git
tree at rsync.kernel.org:/pub/scm/linux/kernel/git/bcollins/ubuntu-2.6.git :-)
You don't let the grass grow under your feet, you Ubuntu folks...

Peter
Comment 100 hofrichter 2006-05-02 16:03:35 UTC
Hi,
it seems that the patch works. However I noticed that when resuming for suspend
to ram the thermal events don't get updated anymore like it was without the
patch. If run glxgears the fans don't kick in. If I than execute acpid -t they
will kick in if the temperature is high enough but then they will stay on even
if the temperature drops below 50
Comment 101 hofrichter 2006-05-03 07:14:24 UTC
Sorry I forgot the kernel parameter acpi_pool_size now the problem after
resuming seems to be gone. Only the dimming problem remains but is not a real bug.

Christian
Comment 102 Richard Mace 2006-05-03 13:41:25 UTC
Peter, 

Many, many thanks for all the research into this problem and the patches. I've
just applied your version 2 to kernel 2.6.16.13 and it works very well, both on
battery and on AC. 

Is there some recommended value of poolsize? Or will the value of 10 that you
suggested in an earlier post work in most (all) instances? 

Richard Mace
Comment 103 Peter Wainwright 2006-05-03 23:40:03 UTC
The boot parameter acpi_pool_size=10, or echo 10>/proc/acpi/pool_size, should
be sufficient. The minimum necessary pool_size is probably 2. However, this
number is just a limit on the number of threads which are created dynamically,
and so it should never be reached on a well functioning system.
There's no reason you shouldn't make it 99999, UNLESS you have
some other ACPI bug which prevents the control methods from terminating properly.
In that case a modest limit should prevent ACPI from using all your CPU time.

Peter
Comment 104 Alexey Starikovskiy 2006-05-04 08:58:03 UTC
Created attachment 8023 [details]
patch to execute Notify handlers on a new thread

Here is a simple patch to execute Notify handlers on a new thread immediately
instead of adding them to common workqueue.
Cures my nx6125, please check on yours.
Comment 105 Alexey Starikovskiy 2006-05-04 11:09:15 UTC
Created attachment 8025 [details]
Updated patch with feedback from Andy Kleen

Add do_exit() at the end of Notifier to exit cleanly from thread.
Comment 106 Richard Mace 2006-05-04 12:17:23 UTC
Alexey,

Tested your patch mentioned in #105 applied to 2.6.16.13. Works fine. I don't
have suspend working, so could not test that.

Richard
Comment 107 Alexey Starikovskiy 2006-05-05 03:23:08 UTC
Marking it resolved as it solves a problem with at least two machines.
Comment 108 Alexey Starikovskiy 2006-05-06 04:56:14 UTC
Created attachment 8029 [details]
Reworked patch with feedback from Robert Moore

One more try
Comment 109 Markus Walser 2006-05-06 07:45:21 UTC
Hi,
Peter's 2nd patch worked well for me too. I the proc fs I saw up to three entries.
Alexey's 2nd version didn't work for me on my 64bit/SuSE 10, the fan stopped
forever. However after after upgrading to 32bit SuSE10.1 Alexey's 2nd version
seemed to worked as well.
But during suspend with 2.6.16.14 and Alexey's 2nd version I got a callstack:
(note: noapic set, otherwise suspend would hang):

May  6 16:25:11 turion kernel: ieee1394: Node removed: ID:BUS[0-00:1023] 
GUID[00023f992975320a]
May  6 16:26:30 turion kernel: Stopping tasks:
=============================================================================================|
May  6 16:26:30 turion kernel: Shrinking memory...  ^H-^H\^H|^H/^Hdone (55399
pages freed)
May  6 16:26:30 turion kernel: pnp: Device 00:02 disabled.
May  6 16:26:30 turion kernel:
.........................................................................................................................
May  6 16:26:30 turion kernel: Intel machine check architecture supported.
May  6 16:26:30 turion kernel: Intel machine check reporting enabled on CPU#0.
May  6 16:26:30 turion kernel: swsusp: Restoring Highmem
May  6 16:26:30 turion kernel: Debug: sleeping function called from invalid
context at mm/slab.c:2729
May  6 16:26:30 turion kernel: in_atomic():0, irqs_disabled():1
May  6 16:26:30 turion kernel:  [<c0146dfc>] kmem_cache_alloc+0x1b/0x4f
May  6 16:26:30 turion kernel:  [<c01bea3b>] acpi_os_acquire_object+0xb/0x36
May  6 16:26:30 turion kernel:  [<c01d3b59>]
acpi_ut_allocate_object_desc_dbg+0x10/0x3e
May  6 16:26:30 turion kernel:  [<c01d3b9c>]
acpi_ut_create_internal_object_dbg+0x15/0x68
May  6 16:26:30 turion kernel:  [<c01cfeb1>] acpi_rs_set_srs_method_data+0x3d/0xb7
May  6 16:26:30 turion kernel:  [<c01d6ed7>] acpi_pci_link_set+0xf5/0x169
May  6 16:26:30 turion kernel:  [<c01d6f7f>] irqrouter_resume+0x34/0x52
May  6 16:26:31 turion kernel:  [<c01f7e43>] __sysdev_resume+0x11/0x53
May  6 16:26:31 turion kernel:  [<c01f7f83>] sysdev_resume+0x16/0x47
May  6 16:26:31 turion kernel:  [<c01fbb0e>] device_power_up+0x5/0xa
May  6 16:26:30 turion ifplugd(eth0)[2730]: Link beat lost.
May  6 16:26:31 turion kernel:  [<c012bffd>] swsusp_suspend+0x6b/0x85
May  6 16:26:32 turion ifplugd(eth0)[2730]: Link beat detected.
May  6 16:26:32 turion kernel:  [<c012ce18>] pm_suspend_disk+0x44/0xd3
May  6 16:26:32 turion kernel:  [<c012b604>] enter_state+0x50/0x16c
May  6 16:26:32 turion kernel:  [<c012b7a8>] state_store+0x88/0x95
May  6 16:26:33 turion kernel:  [<c012b720>] state_store+0x0/0x95
May  6 16:26:33 turion kernel:  [<c01796d6>] subsys_attr_store+0x1e/0x22
May  6 16:26:33 turion kernel:  [<c0179915>] sysfs_write_file+0x98/0xbe
May  6 16:26:33 turion kernel:  [<c017987d>] sysfs_write_file+0x0/0xbe
May  6 16:26:33 turion kernel:  [<c0149def>] vfs_write+0xa1/0x146
May  6 16:26:33 turion kernel:  [<c014a302>] sys_write+0x3c/0x63
May  6 16:26:33 turion kernel:  [<c0102a29>] syscall_call+0x7/0xb
May  6 16:26:33 turion kernel: unexpected IRQ trap at vector 89
May  6 16:26:33 turion kernel: unexpected IRQ trap at vector 89
May  6 16:26:33 turion kernel: unexpected IRQ trap at vector 89
May  6 16:26:33 turion kernel: unexpected IRQ trap at vector 89
May  6 16:26:33 turion kernel: unexpected IRQ trap at vector 89

Unfortunately the fan stopped after resuming, but the rest of the system seems 
to be working.
I haven't tested Alexey's 3rd version so far. I didn't get it compile yet.

Regards, Markus
Comment 110 Konstantin Karasyov 2006-05-06 09:49:33 UTC
3rd patch from Alexey either doesn't work. 
Comment 111 Peter Wainwright 2006-05-08 01:07:26 UTC
Alexey's 2nd patch worked until I tried to suspend (using Software Suspend 2):
then it hung.

Alexey's 3rd patch applies to my 2.6.16.11 kernel with one change (ec->burst
to ec->intr) but then causes strange problems in places which seem to have
little to do with ACPI: maybe memory corruption?

I don't know why these patches fail, but you should check you don't call
a function which can sleep from an invalid context (e.g. from an interrupt
handler).  This is a no-no according to Linux Device Driver Development
and other kernel documentation.

This is why my patch takes a safety-first approach and spawns new
threads from the normal kacpid thread only.  This may take a little longer
(it is a 2-stage dispatch process) but at least I can be pretty certain
that it does not sleep in an interrupt.
Comment 112 Konstantin Karasyov 2006-05-10 04:13:34 UTC
Created attachment 8085 [details]
Hopefully fixed Alexey's 3rd patch

Straight comparisons were used. It helped on my box.
Possibly, compiler issue?
Comment 113 Robert Moore 2006-05-12 14:27:05 UTC
New AcpiOsExecute interface integrated and released in ACPICA version 20060512
Comment 114 Darius Povilaitis 2006-05-14 01:04:42 UTC
Hi,
Peter's 2nd patch worked for me too. But it seems that patch does not work after
reboot.  For example,if AC is on, the fan setting in BIOS is set "always on when
on AC", then when you first boot your PC - everything is OK. After soft reboot,
even if AC is on, fan stops, and it seems, that patch is not working. If you
issue acpi -t, fan starts, but even if temperature grows above e.g. 65 degrees,
status of Thermal 1 is ok. After acpi -t, state changes to active. Situation the
same as described Christian (hofrichter@freenet.de). My firmware is F.0E,kernel
2.6.16.14, acpi_pool_size is set.

Comment 115 Len Brown 2006-05-15 00:45:26 UTC
applied patch in comment #112 to acpi-test 
should appear in the -mm tree shortly. 
Comment 116 hofrichter 2006-05-22 14:59:32 UTC
I tried Alexey Starikovskiy patch it seems that suspend to ram does not work
anymore.

Best regards Christian
Comment 117 hofrichter 2006-05-22 16:12:07 UTC
On Alexey Starikovskiy version my laptop does not wake up from suspend. However
this works with the unpatched kernel and  with Peter Wainwright patch. On the
other side I cannot resume from suspend to disk (suspend2) with Peter Wainwright
patch because I get the message "ACPI Exception AE_NO_MEMORY ... unable to queue
handler for GPE ... Event disabled ". I get this message printed a few hundred
times, all I can do is to press the power button. Is there a possbility to get
rid of this message ? Some kind of kernel option ? I am using kernel 2.6.17-rc4
at the moment. As far as I can remember it is possilbe with Alexey's version to
resume from suspend to disk without this message.
By the way I am getting hundreds of these :
[acpid]: notifying client 3147[0:0]
May 22 16:59:48 [acpid]: notifying client 3177[0:0]
May 22 16:59:48 [acpid]: notifying client 3374[0:0]
May 22 16:59:48 [acpid]: completed event "processor C000 00000080 000000
00"
... in my acpid log file.
This is repeated about 30-40 times in a second (but not every second)
. What does it mean and how can I turn this off ? I suppose it has something to
do with the dynamic frequency change during workload.
If it is important I am currently using SUSE 10.1. kernel option is 'noapic' for
suspend.

Hope this information helps
Regards Christian
Comment 118 Jim Needham 2006-05-27 02:10:44 UTC
Running SUSE10.1 default kernel (2.6.16.13-4-default) with the patch from
comment #112, seems to fix the problem with thermal events, but now I see a
problem with enabling fans.
 
After a few transitions into active[3] state I get the following message:

21:12:12 kernel: ACPI Warning (acpi_power-0445): Transitioning device [C273] to
D0 [20060127]
21:12:12 kernel: ACPI Warning (acpi_bus-0267): Transitioning device [C273] to D0
[20060127]
21:12:12 kernel: ACPI Warning (acpi_thermal-0644): Unable to turn cooling device
[ffff810037fa56c0] 'on' [20060127]

After this the fan is no longer enabled when in state active[3], even when
reading the thermal zone shows it as active.

At this point if there is a transition to active[2] the fan is still enabled,
but after a few such transitions I see:
 
21:21:35 kernel: ACPI Warning (acpi_power-0445): Transitioning device [C272] to
D0 [20060127]
21:21:35 kernel: ACPI Warning (acpi_bus-0267): Transitioning device [C272] to D0
[20060127]
21:21:35 kernel: ACPI Warning (acpi_thermal-0644): Unable to turn cooling device
[ffff810037fa57c0] 'on' [20060127]

and again the fan stops working for state active[2].
After getting the above messages, the fans will not turn on even after moving
out of the active state and back. Only a reboot seems to fix this.
Comment 119 Jim Needham 2006-05-27 14:49:29 UTC
Oops, forgot the DSDT patch to fix fan errors. Bios version is F.0E, by the way.
Comment 120 Jim Needham 2006-05-27 18:25:32 UTC
Made changes to _ON method in DSDT from comment #57 (but didn't add polling),
now fans and thermal events working OK.
Comment 121 hofrichter 2006-05-28 03:35:59 UTC
Is there a DSDT patch for the F0E Bios ? I could only find patches for the older
F0D version. Is it necessary to patch the Bios with a new DSDT or is it enough
to apply the acpi patch to the kernel ?
Did anyone succeeed in using suspend to ram with the acpi patch ?
Comment 122 Carsten Tschense 2006-05-28 05:26:33 UTC
Yes, I'm still using the second patch from Peter, without patching the DSDT and
suspend works like a charm, no problems with the fan...
Comment 123 Jim Needham 2006-05-28 07:48:33 UTC
Created attachment 8221 [details]
DSDT for F.0E nx6125 bios with _ON method patch

There isn't a patch at acpi4linux for the F.0E bios yet. I found some
instructions for extracting the current DSDT and made the canges to the _ON
method from comment #57. I also changed line 6550 to fix an error (iasl version
20060512), by swapping in a value from the F.0D custom DSDT. In SUSE you can
add the DSDT to initrd, so you don't need to recompile the kernel.

This appears to fix a problem enabling the fan, which I see even with an
unpatched kernel while running a script to poll thermal events. If you don't
see a problem with turning on the fan, then I guess you don't need the patched
DSDT.

I haven't tried suspend, so it may or may not work, and the patch does not
include other fixes from the F.0D custom version. As soon as a patch is
available for F.0E at acpi4linux I would suggest using that, but here's the
DSDT I'm using if anyone wants to give it a try.
Comment 124 Jim Needham 2006-05-28 09:02:14 UTC
Please ignore the patched DSDT in my previous comment. I've doubled checked my
BIOS version and it's F.0D not F.0E. Apologies for silly mistake and any
inconvenience this may have caused.
The changes in comment #57 do appear to fix the fan problem though, and the
checks in the _ON method are still there in F.0E. If you see acpi warnings about
being unable to turn on the fan with F.0E, it's worth trying the changes.
Comment 125 Richard Mace 2006-05-28 11:39:34 UTC
Thought I'd add my two cents on the patch mentioned in comment #112. Been using
it for a week or so now and have found it to be rock solid. My fans cycle
perfectly, turning on and off as expected, with the correct hysteresis. This is
highly reproducible and independent of earlier usage. I don't use suspend, so I
cannot comment on that.

BIOS F.0D; Distro: Debian; Kernel: 2.6.16.13

My DSDT (F.0D) is unpatched.
Comment 126 hofrichter 2006-05-29 13:32:08 UTC
Richard Mace is right,
the fan and temperature checking work correctly now with the patch in comment
#112. However I cannot get suspend to work correctly. This on the other side
works with Peter Wainwright's patch. I am asking myself now what's the
difference between the two version. As far as I know Alexey's patch executes the
acpi events immediately. 
I think it will be hard to provide any log messages as everyting gets umounted
correctly when going to suspend to ram. Then on reboot the screen stays black. I
can hear the fan so I conclude that the kernel is not working on correctly
resume  because otherwise the fan will spin down immediately when the system
wakes up.
Comment 127 Jim Needham 2006-05-30 01:35:40 UTC
With a SUSE10.1 default kernel (2.6.16.13), firmware (F.0D) and the comment #112
patch I still see the following log message after a while:

21:21:35 kernel: ACPI Warning (acpi_thermal-0644): Unable to turn cooling device
[ffff810037fa57c0] 'on' [20060127]

and then the fan stops working.

I also get the message after a while with an unpatched kernel if I poll thermal
events with a script.

I had assumed that this was because I did not patch the DSDT, but now I am not
sure if the #112 patch should be fixing this. Does anyone else see this
behaviour after applying the patch?
Comment 128 Richard Mace 2006-05-30 02:35:01 UTC
In response to comment #127:

Jim, I think that the patches found here apply to the official linux kernel
(found at www.kernel.org). They may apply to other (modified) kernels, but then
they aren't guaranteed to work as intended. At least that is my understanding...
Comment 129 Carsten Tschense 2006-05-30 06:44:56 UTC
I'm using the latest patch with 2.6.17-rc5 (vanilla) since 3 days without problems.
Comment 130 hofrichter 2006-05-30 07:55:40 UTC
For get about the suspend problem. I have reinstalled my distribution and now it
works. 
I also see no acpi warning message that the fan cannot be turned on. I am using
kernel 2.6.17-rc5 so at least with this verion I think it should work.
Comment 131 hofrichter 2006-05-31 07:27:31 UTC
How can I turn off or decrease the debug level of acpid because the log file
gets really huge. I am getting someting like this every second

notifying client 3073[0:0]
[acpid]: notifying client 3144[0:0]
[acpid]: notifying client 3320[0:0]
[acpid]: completed event "processor C000 00000080 000000
00"

which are the process numbers for hald-addon-acpi, powersaved and X

Christian
Comment 132 Thomas Renninger 2006-06-01 03:18:29 UTC
You can disable the acpid messages somehow here (latest SUSE):
/etc/syslog-ng/syslog-ng.conf:

filter f_acpid      { match('^\[acpid\]:'); };
I am not familiar with this config... I hope this is enough.
Comment 133 Peter Wainwright 2006-06-02 09:38:33 UTC
I've been using the patch of comment #112 for some weeks now,
and it works well without suspend.  However, this evening after
resuming from suspend-to-disk (the mainline kernel version)
the machine overheated again.  acpi -t repeatedly
showed the wrong temperature
58C (which is the first trip point) although the machine was
a lot hotter than that.  The last thing in /var/log/messages:

Jun  2 17:29:59 ceiriog kernel:      osl-0925 [19719] os_wait_semaphore     :
Failed to acquire semaphore[ffff810017879e00|1|0], AE_TIME

Comment 134 hofrichter 2006-06-04 11:01:10 UTC
Sometimes it happens that the fan stays on although "acpi -t" says that all fans
should be inactive as the cpu temperature stays far below 50
Comment 135 Johan Brannlund 2006-06-19 23:00:04 UTC
I'm running Ubuntu kernel 2.6.15-23-amd64-k8, 2.6.15-23.39 to be exact.This
kernel incorporates some version of the multithreaded ACPI patch, although I'm
not sure which one.

I see exactly the same behaviour that Peter reported in comment #133 :
everything works except that after a suspend/resume cycle, the fan won't turn on
and the temperature reported ia a constant 58 degrees.
Comment 136 Peter Wainwright 2006-06-19 23:41:19 UTC
Since at least 2 of us have the same problem after suspend,
I'd like to REOPEN this bug.
Comment 137 Alexey Starikovskiy 2006-06-21 06:18:07 UTC
Created attachment 8361 [details]
Attempt to fix suspend-resume

Please try suspend-resume with this patch. Only one string is different.
Also patches to bugs #6455 and #6687 may help as well.
Comment 138 Len Brown 2006-06-25 21:47:34 UTC
patch in comment #112 shipped in 2.6.17-git9 
so we need an incremental patch for the resume issue 
for testers using the latest kernel. 
  
Comment 139 Johan Brannlund 2006-06-26 20:34:35 UTC
I patched my Ubuntu kernel with Alexey's patch from comment #133 and also with
the patch from http://bugzilla.kernel.org/show_bug.cgi?id=6455 and I can happily
report that I now have a working fan after suspend/resume. Thank you!



Comment 140 Johan Brannlund 2006-06-26 20:49:31 UTC
Sorry, that should of course be comment #137.


Comment 141 Johan Brannlund 2006-07-08 18:25:04 UTC
Also, it looks like I may have spoken too soon. It still happens sometimes that
the fan refuses to start after resuming. Conversely, sometimes the fan starts
immediately after resume and won't turn off.

Unfortunately I couldn't find anything in the logs that looks relevant.
Comment 142 Konstantin Karasyov 2006-07-10 02:44:53 UTC
Johan,

You could try the last patch for bug #5000 - it adds suspend/resume support 
for fan/thermal subsystems.
Comment 143 Linus Torvalds 2006-07-12 16:20:24 UTC
On Wed, 12 Jul 2006, Linus Torvalds wrote:
> 
> Any reason to not just revert it? The fundamental problems that it 
> introduces are obviously much worse than the fix.

Ok, that commit b8d35192c55fb055792ff0641408eaaec7c88988 is definitely 
horribly horribly broken.

I'm going to revert it, because the "fix" is much worse than the problem 
it fixes. Instead of a fan not coming on, I now have ten thousand threads 
killing the machine instead - and the fan _still_ doesn't come on..

The thread approach doesn't even fix the fundamental problem itself. It 
doesn't help to start a new thread, when the AML interpreter holds a 
semaphore over the sleep, causing the events to be serialized, and the 
thermal events to be delayed _anyway_.

The only thing the threading causes is that it guarantees that the machine 
ends up being totally overwhelmed by the thousands of threads, all blocked 
on the same semaphore.

I don't know what the solution should be, but in the meantime, the "fix" 
is definitely unacceptable.

			Linus

Comment 144 Denny Vriesman 2006-07-18 07:16:36 UTC
FYI:
This bug is also occuring on HP nx6115. I am using FEDORA CORE 5.

Comment 145 Robert Moore 2006-07-18 08:02:47 UTC
I'm starting to think that the notify handler should be executed synchronously 
in the same thread executing the _GPE method in order to prevent a flood of 
new GPEs. This will require additional investigation. ACPICA was modified in 
early 2001 to move execution of the notify handler to a new thread.
Comment 146 Alexey Starikovskiy 2006-07-22 10:18:44 UTC
Created attachment 8601 [details]
Limit number of concurrent threads

Added limit on number of threads spawned by Notifies. Should make Linus' system
at least debuggable.
Comment 147 Johan Brannlund 2006-07-25 00:20:07 UTC
I just tried 2.6.18-rc2. This kernel seemed to already have most of the ACPI
patches - the only one I applied was the thread-limiting patch.

Unfortunately I didn't have any luck. The temperature reading is still stuck at
58 degrees when resuming, so the fan never turns off.

Comment 148 Phuah Yee Keat 2006-08-03 23:23:47 UTC
FYI, this also happens to HP Compaq NX6120. I am using OpenSUSE 10.1.

The fan does not turn on automatically until I do a acpi -t.
Comment 149 Peter Wainwright 2006-08-06 09:22:53 UTC
Still fails for me after a resume (both in-kernel swsusp and Software Suspend 2).
I am now using kernel 2.6.18-rc2 with the multi-thread patch from here.

I patched my DSDT with the following, which basically prints a message
each time the loop in _L19 is executed, each time a _TMP method is called, and
shows the value of the flag C176, which seems to control the resetting of the
trip points.

The result was, after resume, when the machine got hot, the reported temperature
stuck at 58 and the following was repeatedly send to syslog:

Aug  6 17:27:39 ceiriog kernel: [ACPI Debug]  String: [0x19] "DSDT _L19
Local0=00000010"
Aug  6 17:27:39 ceiriog kernel: [ACPI Debug]  String: [0x20] "DSDT
get_temp(00000000)=00000042"
Aug  6 17:27:39 ceiriog kernel: [ACPI Debug]  String: [0x1D] "DSDT
need_reset_trip=00000001"
Aug  6 17:27:39 ceiriog kernel: [ACPI Debug]  String: [0x19] "DSDT _L19
Local0=00000010"
Aug  6 17:27:39 ceiriog kernel: [ACPI Debug]  String: [0x20] "DSDT
get_temp(00000000)=00000042"
Aug  6 17:27:39 ceiriog kernel: [ACPI Debug]  String: [0x1D] "DSDT
need_reset_trip=00000001"

So, we are still stuck in the damn _L19 loop: the NOTIFY events are being processed
and the _TMP methods called, but this does not seem to switch on the fan or
reset the trip points.


Comment 150 Peter Wainwright 2006-08-06 09:25:22 UTC
Created attachment 8720 [details]
Debugging patch for _L19 and _TMP in DSDT
Comment 151 Thomas Renninger 2006-08-11 04:07:03 UTC
You might want to have a look at:
https://bugzilla.novell.com/show_bug.cgi?id=179702 - comment #47.

Those HP BIOSes (including nx6125) have wrong OperationRegion declarations.
Because of that the interpreter might read/write to random/wrong memory
addresses. Maybe solving that one will also help here?
Comment 152 Dan Dahlberg 2006-08-11 06:45:19 UTC
Judging by the previous comment, wouldn't that mean that the patch in:

http://bugzilla.kernel.org/show_bug.cgi?id=6455

Is not resolving the core issue as well?
Comment 153 Alexey Starikovskiy 2006-08-12 13:53:38 UTC
please try to do "echo platform > /sys/power/disk" before doing suspend.
GPE block state is not preserved across shutdown, so thermal events become
disabled after resume if "cat sys/power/disk = shutdown".

Thomas, this is different issue.
Comment 154 hofrichter 2006-08-20 05:46:06 UTC
Hi,
I have the same problem. Suspending causes the fan to stay on all the time or
even worse the fan will not turn on at all. This happens after resuming from
suspend to disk or suspend to ram. Using "platform" instead of "shutdown" did
not solve the problem.
Kernel version is 2.6.18-rc4 with noapic at boot 

Alexey's patch for kernel 2.6.17 worked well. The newer kernel versions
2.6.18-rcX seem to have most of the patches applied already as the fan
management works when booting. However there seems still to be the problem with
resuming.
Comment 155 hofrichter 2006-08-20 16:12:32 UTC
Installed the thread limiting patch and thermal management after that suspend to
ram seems to work with noapic. The funny thing is suspend to disk does not work
although it should be less problematic than suspend to ram. 
However if I am going to suspend to disk I get something like "ACPI:
Transitioning device [C258] to D3" right at the moment when the kernel enters
the critical section and tries to copy to disk. The fan is blowing like hell now
at the highest speed although the cpu is pretty cool and the system hangs. I
don't know if this problem also persists with suspend2 as there is no patch yet
for rc4.
Comment 156 Rafael J. Wysocki 2006-09-05 14:42:28 UTC
Hi,

I have the same issue on HPC nx6325 with the 2.6.18-rc6 kernel (64-bit) and the
Peter's patch from Comment #95 seems to fix it.

I have observed additional symptoms that without the patch kacpid constantly
generates about 4% of CPU load (on one core) and it's impossible to unload any
ACPI modules (the rmmod process gets stuck in TASK_UNINTERRUPTIBLE).
Comment 157 Rafael J. Wysocki 2006-09-05 14:47:46 UTC
Of course I mean the same as described in the report and Comment #1.
Comment 158 Alexey Starikovskiy 2006-09-06 06:52:53 UTC
Created attachment 8950 [details]
don't defer release of global lock

This one removes one user of the deferred queue, so it should become less busy
and more deadlock prune.
Comment 159 Alexey Starikovskiy 2006-09-06 06:54:53 UTC
Created attachment 8951 [details]
don't defer release of global lock

This one removes one user of the deferred queue, so it should become less busy
and more deadlock prune.
Comment 160 Alexey Starikovskiy 2006-09-06 07:05:19 UTC
Created attachment 8952 [details]
create another workqueue for notify() execution

2.6.18-rc6 with this and global lock patches is able to do all the tricks
including fan control before and after suspend to disk or memory. Tried with
'echo platform > /sys/power/disk' and "noapic" on command line on nx6125.
Comment 161 Rafael J. Wysocki 2006-09-07 00:00:51 UTC
The patches from Comment #159 and Comment #160 fix the issue for me.

Thanks a lot!
Comment 162 Rafael J. Wysocki 2006-09-07 14:11:36 UTC
On the 2.6.18-rc5-mm1 kernel with the patches from Comment #159 and Comment #160
I'm seeing symptoms similar to those described in Comment #141.  Unfortunately
setting the suspend mode to 'platform' doesn't help and I cannot boot with
'noapic', because it's an SMP system.

However, it seems that after a resume (from disk) the temperatures are read
accurately, but for some reason the fans are out of control.  It seems that the
states of the fans are not read correctly too (eg. if two fans are on, the
system reports only one etc.).  IOW, this seems to be a separate problem.
Comment 163 Rafael J. Wysocki 2006-09-07 14:48:47 UTC
It looks like on my box the fan(s) resume issue may be resolved by not loading
the fan module from the initrd, so that the "resuming" kernel does not attempt
to control the fans before the "restored" kernel takes over.

I've tried it with 'echo reboot > /sys/power/disk', so it should also work with
'shutdown'.  Interenstingly, the trick doesn't work with 'platform'.
Comment 164 Alexey Starikovskiy 2006-09-07 14:52:08 UTC
This is interesting indeed... I don't use initrd at all, may be this is why I 
could not see the problem...
Comment 165 Rafael J. Wysocki 2006-09-08 03:07:31 UTC
Following Comment #163:

Unfortunately I spoke too soon.  It sometimes works with 'reboot', but it
doesn't work with 'shutdown'.

Still the information that the suspend to RAM works but the suspend to disk
doesn't (Comment #155) suggests that the ACPI suspend actually suspends the
hardware in the prethaw phase (ie. suspend in the "resuming" kernel) which it
shouldn't do.

Or there is some state that we should save during the suspend to disk and we
don't.  For example, AFAICT, the ACPI spec says we should save the ACPI NVS
regions, but we don't.

Anyway the issue seems to be separate from the one this Bug has been opened for,
 so I think I'll open a new Bug for it.
Comment 166 hofrichter 2006-09-08 03:57:21 UTC
I tried the two patches. 
- Suspend to ram works withou problems
- Suspend to disk shuts down and resumes but thermal management behaves strange.
The fans only kick in at 82
Comment 167 Rafael J. Wysocki 2006-09-08 04:11:26 UTC
I have created a new bugzilla entry for the thermal issue after resume from disk
(Bug #7122).
Comment 168 Rafael J. Wysocki 2006-09-08 16:15:57 UTC
Apparently, vanilla 2.6.18-rc6-mm1 works on my box like 2.6.18-rc5-mm1 with the
patches from Comment #159 and Comment #160.

Also the thermal management problems after a resume form disk seem to be the same.
Comment 169 Len Brown 2006-09-26 13:40:55 UTC
Patches from Comment #159 and Comment #160 applied to acpi-test. 
Comment 170 Len Brown 2006-10-15 22:51:03 UTC
Patches from Comment #159 and Comment #160  
were pulled upstream immediately after 2.6.19-rc2. 
 
Comment 171 Alexey Starikovskiy 2006-11-27 08:32:00 UTC
Created attachment 9631 [details]
Fix patch to work with Linus' Compaq n620c

Add yield before any execution of deferred functions in order to give Notify()
workqueue a chance to run.
Comment 172 Len Brown 2006-11-27 11:11:26 UTC
patch from comment 171 applied to acpi-test  
Comment 173 Mihai Donțu 2006-12-02 11:28:19 UTC
I wanted to file another bug, but this one looks very close to my problem: a 
simple "while true; do true; done" will overhead my CPU and result into a 
machine shutdown (hardware protection or smth) bucause the fans are not 
started. But if I run "acpi -t", the fans kick in and don't stop not even after 
I kill the script. I have to run another "acpi -t" to shut them down.
I tried to do a "watch 'acpi -t'" to get my fans on and off when they have to, 
but it does not work like this. It appears I have to run the command only when 
needed.

An old 2.6.15 ubuntu kernel had this fixed, but none of the vanilla kernels (or 
at least the ones I tried: 2.6.18 and 2.6.19).

Comment 174 Alexey Starikovskiy 2006-12-02 12:05:56 UTC
So could you check that the latest patch helps?
Comment 175 Mihai Donțu 2006-12-02 12:38:14 UTC
Not quite. "acpi -t" still makes the rules (influences when the fans 
start/stop), however at some moment in time, when the temperature reaches 83 
the fans do start, cool the CPU down to 73 and then stop. It's not 
quite "laptop behaviour", that's why I'll keep my CPU at 800MHz until a proper 
fix is made.

It's quite interesting how a read from /proc/acpi/thermal_zone/TZ1/temperature  
gets things moving.
Comment 176 Alexey Starikovskiy 2006-12-02 12:40:54 UTC
please open a new bug then, and append output of dmesg and acpidump.
Comment 177 Mihai Donțu 2006-12-02 12:57:24 UTC
I think some good news are welcome, given the lenght of this "thread": it 
appears the patch from #171 fixed my "APIC error on CPU0: 40(40)" bug/feature. 
So there, something good did come out of it :)

Ok, I'll open a new bug.
Comment 178 Mihai Donțu 2006-12-04 04:54:04 UTC
I didn't want to rush into things, so I made a lil' test: I powered off my
laptop for an hour or so, let the things cool and the hardware come to it's
senses. When I booted back and did a "emerge --update --deep world" (which
needless to say, is as hardcore as it looks), my fans started to work as
expected (slow at 60+ degrees, full power at 80+ degrees).
Don't know what to say. I guess the patch does work.
I don't see it in 2.6.19 so I hope it will reach mainline eventually.

Thanks for all the help,
Mike.
Comment 179 Alexey Starikovskiy 2006-12-04 09:20:55 UTC
Created attachment 9733 [details]
Big comment and removal of void * casts 

Andrew Morton asked to make a description descriptive and drop all void *
casts.
Comment 180 Alexey Starikovskiy 2006-12-06 10:32:31 UTC
Created attachment 9746 [details]
Change from sys_sched_yield() to cond_resched()

Switching from sys_sched_yield() to cond_resched(). Please test.
Comment 181 Emfox Zhou 2006-12-07 20:42:25 UTC
patch from #180 with F.04 BIOS, works fine, now thermal status updated
immediately. Hope it be into 2.6.20, thanks for your goog work.
Comment 182 Arkadiusz J 2006-12-09 03:48:27 UTC
Patch from comment 180 fix the problem for me.(F.0E BIOS) 

There is small 'but' however.
On patched kernel 2.6.19 and on ubuntu kernel 2.6.17 (with one of the earlier
patches applied) there is no hysteresis in fan behavior. It is turned on and off
at 58 degrees and that makes them to spin on and off very often.

Appending kernel option noapic and nolapic (as suggested on one forum) fix the
problem and fans behave as expected, turning on at 58, turning off at 50.

This problem could be related. Anyone have similarly problem or it is just my
kernel conifg ? (and ubuntu?)
Comment 183 Aleksander Trofimowicz 2006-12-10 20:44:19 UTC
The patch from comment 180 tested on a nx6325 box (BIOS F.02). Works fine. There
is even a small drop in temperature at TZ1 (circa 2-3 C) when compared to a
previous patch set. Might suggests lower cpu workload. 

Thermal hysteresis functions as well.
Comment 184 Marc Brockschmidt 2006-12-25 04:30:42 UTC
I've tested the patch #180 (with a vanilla 2.6.19.1) on a brand new nx6325 (BIOS
version F.04) and have something new: My fans don't start all. I see how the
temperature rises, nothing happens. Manually trying to switch them on returns no
error (and no notice at all in the kernel log), but they stay off. Any ideas?
Comment 185 Alexey Starikovskiy 2006-12-25 06:00:51 UTC
Sounds like you did not apply the patch.
Also, take a look at 7122, it is related.
Comment 186 Alexey Starikovskiy 2006-12-25 06:43:10 UTC
Created attachment 9947 [details]
Update patch for 2.6.20
Comment 187 Rafael J. Wysocki 2007-02-05 10:06:53 UTC
I've used the patch from Comment #179 for quite a long time with SUSE 10.1, but
after I've upgraded to OpenSUSE 10.2, it's no longer needed.  It's even harmful,
since with this patch applied acpid takes about 20% of CPU time on my box (HPC
nx6325) permanently).
Comment 188 Alexey Starikovskiy 2007-02-05 10:09:27 UTC
Is "polling" mode enabled in 10.2? What is the value?
Comment 189 Rafael J. Wysocki 2007-02-05 10:14:04 UTC
First, I really meant the patch from Comment #180 (attachement #9746).  Sorry
for the confusion.

Secondly, I've found this in /etc/sysconfig/powersave/common

## Path:                System/Powermanagement/Powersave/General
## Type:                integer(1:10000)
## Default:             "333"
#
# The powersave daemon watches the usage of your CPU
# and other hardware concerning power consumption
# Please set the time in milliseconds for what the
# daemon should sleep before checking your system again
# Good values are about 200-1000(milliseconds)
#
POLLING_INTERVAL=""

so I guess the answer is 'yes', sort of.
Comment 190 Alexey Starikovskiy 2007-02-05 10:19:43 UTC
Cool, so system every tick poll all the thermal zones and it consuming less 
than 20% of CPU? Great results I should say...
Regarding you 179/180 typo -- do you see the difference between two?
Comment 191 Rafael J. Wysocki 2007-02-05 10:31:49 UTC
> Cool, so system every tick poll all the thermal zones and it consuming less 
> than 20% of CPU? Great results I should say...

It's really below 1% on the average and it's consistent with what gkrellm is
showing.  I don't know what powersaved is polling though, because it doesn't
even show up in the top's output.

> Regarding you 179/180 typo -- do you see the difference between two?

Not quite.  Still, I switched to the Comment #180 version soon after it had
appeared, so I can't really say.
Comment 192 Alexey Starikovskiy 2007-02-05 10:37:10 UTC
What happens if you kill powersaved or disable polling?
Comment 193 Rafael J. Wysocki 2007-02-05 11:11:31 UTC
Without the patch nothing really happens after I stop powersaved.

To test it with the patch applied I'd have to recompile the kernel.  I can do
this tomorrow.
Comment 194 Thomas Renninger 2007-02-06 05:45:53 UTC
comment #189: this polling interval is how often powersaved checks for CPU load 
if userspace governor is activated, default is ondemand governor.

What you are searching for is:
THERMAL_POLLING_FREQUENCY=""
in /etc/sysconfig/powersave/thermal
The default value is too low ("2") on 10.2, setting it to 10 or 20 should be ok.
I still wonder why you get higher loads on acpid/powersaved/acpid_notify 
processes with the patch applied...

comment #187:
> I've used the patch from Comment #179 for quite a long time with SUSE 10.1,
> but after I've upgraded to OpenSUSE 10.2, it's no longer needed.  It's even
> harmful, since with this patch applied acpid takes about 20% of CPU time on
> my box (HPC nx6325) permanently).

Do not make me nervous..., latest 10.2 has the patch applied (from comment 
#180), I just rechecked and I do not see any issues.
Comment 195 Johan Brannlund 2007-02-06 10:50:51 UTC
I have an nx6125 with the latest BIOS F.11. I've tried two different kernels:

1. The latest Ubuntu kernel 2.6.20-6.11, where the only patch related to this
problem is the patch from comment #186.

2. 2.6.20-rc5 with all the patches from
http://www.sisk.pl/kernel/patches/2.6.20-rc5/

With both of them, I get the same behaviour:

Usually (but not always) the fan starts running at a low speed when booting and
never turns off. If I run something cpu-intensive that brings the temperature up
above the first trip point (57 degrees) and let it cool down again, the fan
*does* turn off and operates normally afterwards.

Thermal polling is disabled by default. Enabling thermal polling does not affect
this problem.

If I echo the value 3 into /proc/acpi/fan/*/state, the fan does turn off but
after that, it is no longer automatically controlled and I can only start it
again by echoing the value 0 into one of the /proc/acpi/fan/*/state files.

I'm not alone with this problem, see
https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/75398/comments/13

I'm attaching my DSDT.
Comment 196 Rafael J. Wysocki 2007-02-06 10:55:43 UTC
Sorry, my statement in Comment #187 was actually premature.  The patch from
Comment #180 _is_ needed to make thermal management work after a fresh boot.

I was confused by side-effects of some initialization problems (apparently, my
box sometimes is not initialized properly, eg. the keyboard doesn't work after a
power on) and there are memory effects that "survive" reboots and powering off.
 I guess the only way to get rid of them is to remove the battery.

Referring to Comment #194:

> The default value is too low ("2") on 10.2, setting it to 10 or 20 should be ok.

I'll try.

> I still wonder why you get higher loads on acpid/powersaved/acpid_notify
processes with the patch applied...

I've got them for a couple of times without the patch too.  It seems to depend
on how the box was initialized and what survived from the previous run.  Sigh.
Comment 197 Johan Brannlund 2007-02-06 10:56:30 UTC
Created attachment 10320 [details]
Gzipped DSDT for nx6125 BIOS F.11
Comment 198 Thomas Renninger 2007-02-08 06:59:43 UTC
I got a report that on an nx6125 the machine hangs after some minutes(keyboard, 
mouse, nothing responds) for some seconds.
This is with SLE10-SP1 (2.6.16 based kernel with the patch from comment #180) 
applied. It still could be something else, maybe it's even not ACPI, but it 
sounds related (possibly when battery and thermal info is accessed at the same 
time or simlar?). I try to have a look at that machine tomorrow. I wonder 
whether anyone else experienced similar problems.
Comment 199 Markus Walser 2007-02-08 07:29:03 UTC
I've running your patched kernel from the suse ftp server (2.6.18.?-35 I
think) since several days
on a nx6125 without having troubles. I didn't test suspend but at least fan
is working correctly except
when booting with connected power supply. Then it is usually required to
generate one acpi event
by unplugging and plugging power again. Afterwards fan slows down and is
regulated as expected.

Markus

On 2/8/07, bugme-daemon@bugzilla.kernel.org <
bugme-daemon@bugzilla.kernel.org> wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5534
>
>
>
>
>
> ------- Additional Comments From trenn@suse.de  2007-02-08 06:59 -------
> I got a report that on an nx6125 the machine hangs after some
> minutes(keyboard,
> mouse, nothing responds) for some seconds.
> This is with SLE10-SP1 (2.6.16 based kernel with the patch from comment
> #180)
> applied. It still could be something else, maybe it's even not ACPI, but
> it
> sounds related (possibly when battery and thermal info is accessed at the
> same
> time or simlar?). I try to have a look at that machine tomorrow. I wonder
> whether anyone else experienced similar problems.
>
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
>
I&#39;ve running your patched kernel from the suse ftp server (2.6.18.?-35 I think) since several days<br>on a nx6125 without having troubles. I didn&#39;t test suspend but at least fan is working correctly except<br>when booting with connected power supply. Then it is usually required to generate one acpi event
<br>by unplugging and plugging power again. Afterwards fan slows down and is regulated as expected.<br><br>Markus<br><br><div><span class="gmail_quote">On 2/8/07, <b class="gmail_sendername"><a href="mailto:bugme-daemon@bugzilla.kernel.org">
bugme-daemon@bugzilla.kernel.org</a></b> &lt;<a href="mailto:bugme-daemon@bugzilla.kernel.org">bugme-daemon@bugzilla.kernel.org</a>&gt; wrote:</span><blockquote class="gmail_quote" DEFANGED_style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<a href="http://bugzilla.kernel.org/show_bug.cgi?id=5534">http://bugzilla.kernel.org/show_bug.cgi?id=5534</a><br><br><br><br><br><br>------- Additional Comments From <a href="mailto:trenn@suse.de">trenn@suse.de</a>&nbsp;&nbsp;2007-02-08 06:59 -------
<br>I got a report that on an nx6125 the machine hangs after some minutes(keyboard,<br>mouse, nothing responds) for some seconds.<br>This is with SLE10-SP1 (2.6.16 based kernel with the patch from comment #180)<br>applied. It still could be something else, maybe it&#39;s even not ACPI, but it
<br>sounds related (possibly when battery and thermal info is accessed at the same<br>time or simlar?). I try to have a look at that machine tomorrow. I wonder<br>whether anyone else experienced similar problems.<br><br>------- You are receiving this mail because: -------
<br>You are on the CC list for the bug, or are watching someone who is.<br></blockquote></div><br>
Comment 200 Markus Walser 2007-02-08 07:32:50 UTC
Sorry for polluting this bugzilla with html :-( Mea culpa!
Comment 201 Thomas Renninger 2007-02-09 02:45:12 UTC
No problem. Markus Walser:
I've running your patched kernel from the suse ftp server (2.6.18.?-35 I think) 
since several days on a nx6125 without having troubles.
I didn't test suspend but at least fan is working correctly except when booting 
with connected power supply. Then it is usually required to generate one acpi 
event by unplugging and plugging power again. Afterwards fan slows down and is 
regulated as expected.

Such reports really help a lot! I'll try to reproduce, I expect one of Rafaels 
patches could help here. I'll give it a try as soon as I have the time... Any 
similar reports, especially with experiences with (best single) or several 
patches from Rafeal or bug #7122 or others and what effects they show are very 
welcome.

The stated in comment #198, seem to come from a recent C-state patch, I get C2 
unsupported, but C3 as valid state. This is a SUSE only SLE10-SP1 Beta problem.
Summary:
Patch from comment #180 works fine for *a lot* people and received huge testing.
Comment 202 Jure Repinc 2007-02-10 08:42:22 UTC
Should kernel 2.6.20 already have all the patches? I'm asking because I just
upgraded to that final version of kernel and it still happens that things like
battery status stop getting updated from time to time on my HP Compaq nx6125
until I run "cat /proc/acpi/thermal_zone/TZ1/temperature" at which point kacpid
starts using 100% of CPU and when it finishes the status is updated again.
Comment 203 Rafael J. Wysocki 2007-02-10 08:49:58 UTC
No, they are not present in 2.6.20
Comment 204 Alexey Starikovskiy 2007-02-15 09:11:41 UTC
Created attachment 10429 [details]
syncronous execution of notify

requires modifications to mutex, in the next patch
Comment 205 Alexey Starikovskiy 2007-02-15 09:14:45 UTC
Created attachment 10430 [details]
fix mutex reentrancy for method

Creation of second thread for Notify execution is considered no-no, so here is
an attempt to execute notify in-place. Please test.
Comment 206 Len Brown 2007-02-15 13:19:09 UTC
patches in comment #205 and comment #204 applied to acpi-test.
Comment 207 Hans du Plooy 2007-02-20 12:47:21 UTC
Hi guys,

When I try to apply the latest acpi-test (20070126-2.6.20), I errors reporting
that drivers/acpi/bay.c and drivers/misc/asus-laptop.c are missing.  Sure
enough, they're not there.  If I apply the rc6 and/or rc7 patch first, it's
fine, those files are created, but I get another afailure on
drivers/usb/misc/appledisplay.c
Is there a special order to these? I get the same herrors on 2.6.20.1 sources.

Anyway, just to give some more input on the nx6125.  I have the Turion ML-34
with 2GB memory and F11 bios.  I'm running Debian Etch i386 (20070214 snapshot)
with the k7 kernel.

With this kernel, the fourth trip point (80 centigrade) seems to make the first
trip point (50 centigrade) fan active:

theluggage:~# acpi -t
     Battery 1: charged, 96%
     Thermal 1: active[3], 82.0 degrees C
     Thermal 2: ok, 64.0 degrees C
     Thermal 3: ok, 32.0 degrees C
     Thermal 4: ok, 50.0 degrees C

and:

critical (S5):           95 C
passive:                 88 C: tc1=1 tc2=2 tsp=100 devices=0xc1fde338
active[0]:               80 C: devices=0xc1fe46f8
active[1]:               75 C: devices=0xc1fe46a8
active[2]:               65 C: devices=0xc1fe4658
active[3]:               50 C: devices=0xc1fe4608

Hope this helps - let me know if I need to post any more info.
Comment 208 Mircea Bardac 2007-02-20 13:21:23 UTC
I think (not sure) that the 2 patches above are in kernel's git repository:
http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.20-git15.log

Commits:
5f7748cf91558a5026ded5be93c5bf6c1ac34edf
c0d127b56937c3e72c2b1819161d2f6718eee877

The comments of these commits mention this bug.

In order to fix this bug, besides these 2 patches, for nx6325, is there anything
else needed?
Comment 209 Alexey Starikovskiy 2007-02-20 13:54:30 UTC
should be enough
Comment 210 Len Brown 2007-02-20 22:29:52 UTC
patches in comment #205 and comment #204 shipped in linux-2.6.20-rc1
Closed.
Comment 211 Len Brown 2007-02-20 22:38:10 UTC
er, linux-2.6.21-rc1, that is.

Note You need to log in before you can comment on or make changes to this bug.