Bug 7122

Summary: Thermal management problems - HPC nx6325
Product: ACPI Reporter: Rafael J. Wysocki (rjwysocki)
Component: Power-FanAssignee: Konstantin Karasyov (konstantin.karasyov)
Status: CLOSED CODE_FIX    
Severity: high CC: acpi-bugzilla, bas, bug-report, chris.todorov, chris, jlp.bugs, meijer.o, mjg59-kernel, pavel, terragonjohn, tommi.kyntola, trenn, tuukka.tolvanen
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.18-rc5-mm1, 2.6.18-rc6-mm1, 2.6.18-mm2 Subsystem:
Regression: --- Bisected commit-id:
Attachments: swsusp debug patch
Proof-of-concept and debug patch
dmesg output from full boot-suspend-resume cycle with the patch
acpidump output from HPC nx6325
Fan and thermal debug patch
dmesg output with the fan and debug patch applied
Patch (on top of the previous one) that fixes the issue
Part of dmesg output with the fix patch applied
Proposed fix patch
power suspend/resume patch
ACPI suspend resume patch
Modified power suspend/resume patch
power resource references implemented as a list
same as 9335, but GFP_ATOMIC is used
more safe use of GFP_ATOMIC
the same patch for kernel 2.6.20-rc3

Description Rafael J. Wysocki 2006-09-08 04:05:26 UTC
Most recent kernel where this bug did not occur: Unknown
Distribution: SUSE 10.1
Hardware Environment: HPC nx6325 (AMD Turion 64 X2)
Software Environment: x86_64, SMP, non-preemptible kernel
Problem Description:

Fans don't work properly after a resume from disk.

Namely, according to ACPI there are four fans in this box, denoted by 34F, 350,
351, and 352.  Usually, the  352 fan is always on and the 351 fan is turned on
when the temperature reported in TZ1 and TZ2 gets too high (above ~50 C). 
However, after a resume from disk either all fans are reported to be 'off' and
they are never turned on, even if the temperature in TZ1 gets above 70 C, or the
352 fan is reported to be 'on' and the other ones are reported to be 'off', but
the 351 fan is actually always on, even if the temperature in TZ1 and TZ2 is
well below 45 C.

If the kernel is compiled with CONFIG_ACPI_DEBUG=y, the following statements
appear in dmesg during the final resume (ie. the resume in the "restored" kernel):

acpi acpi: resuming
ACPI: Transitioning device [C34F] to D3
ACPI: Transitioning device [C34F] to D3
ACPI: Transitioning device [C350] to D3
ACPI: Transitioning device [C350] to D3
ACPI: Transitioning device [C351] to D3
ACPI: Transitioning device [C351] to D3
ACPI Exception (evregion-0424): AE_TIME, Returned by Handler for
[EmbeddedControl] [20060707]
ACPI Exception (dswexec-0458): AE_TIME, While resolving operands for [Store]
[20060707]
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.C349] (Node
ffff810037a3f490), AE_TIME
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.TZ3_._TMP] (Node
ffff810037a40458), AE_TIME

Steps to reproduce:
Suspend to disk and resume.
Comment 1 Rafael J. Wysocki 2006-09-08 04:15:43 UTC
It actually tries to turn the 352 fan on but it can't.  I've got the following
messages in dmesg:

ACPI: Transitioning device [C352] to D0
ACPI: Transitioning device [C352] to D0
ACPI: Unable to turn cooling device [ffff810037a3fb90] 'on'
APIC error on CPU0: 00(40)
Comment 2 Rafael J. Wysocki 2006-09-08 05:15:31 UTC
If I reboot the machine after a resume from disk, the thermal management still
doesn't work properly and the symptoms are like right after the resume (eg.
problems with turning fans on, overheating).
Comment 3 Pavel Machek 2006-09-08 05:29:36 UTC
> If I reboot the machine after a resume from disk, the thermal management still
> doesn't work properly and the symptoms are like right after the resume (eg.
> problems with turning fans on, overheating).

If reboot does not fix it... then it looks hardware problem to me. We
may be able to work around it somehow, but reboot should definitely
fix it.

Comment 4 Rafael J. Wysocki 2006-09-08 06:43:40 UTC
Hm, can ACPI NVS survive accross reboots?
Comment 5 Johan Brannlund 2006-09-08 12:55:03 UTC
I've seen similar ACPI problems after reboots on my nx6125, so I don't think
it's a hardware problem.
Comment 6 Rafael J. Wysocki 2006-09-08 15:54:31 UTC
Well, the box seems to have thermal management problems independent of the
suspend.  For example, I often get messages like these:

ACPI: Transitioning device [C351] to D0
ACPI: Transitioning device [C351] to D0
ACPI: Unable to turn cooling device [ffff810037a32570] 'on'

after fresh boots (on 2.6.18-rc6-mm1 as well as on -rc5-mm1).
Comment 7 Rafael J. Wysocki 2006-09-08 16:10:16 UTC
I'm starting to see a pattern.

If the last two fans (352 and 351) are on right before the suspend, the thermal
management seems to work (sort of) after rhe resume.  However, if only the 352
fan is on before the suspend, then usually after the resume two fans are on, but
only one of them (352) is reported to be 'on' by the kernel.  As a result, if
the temperature falls below the next tripping point and the kernel turns one of
the fans off, it thinks that is the last one (352), but actually it is the 351.
 Then, the kernel reports all of the fans 'off', but one of them (352) is
spinning, so when the kernel wants to turn it on (after the temperature rises),
it can't and the operation fails.

Also it looks like the kernel sometimes thinks it should turn on the fan 351
when in fact the fan 350 should be switched.
Comment 8 Rafael J. Wysocki 2006-09-09 06:32:08 UTC
Comment #2 is apparently wrong, because I confused the symptoms related to
Comment #6 with the swsusp-related problem.

The swsusp-related problem appears to be that if the actual configuration of the
fans during resume doesn't match the one from before the suspend, the thermal
management after the resume is busted.
Comment 9 Rafael J. Wysocki 2006-09-09 15:21:34 UTC
Created attachment 8978 [details]
swsusp debug patch

I can break the thermal management by using the attached patch and doing
'echo test > /sys/power/disk; echo disk > /sys/power/state'.

After the test the system behaves exactly like after a resume from disk (ie.
the fans seem to be out of control).
Comment 10 Pavel Machek 2006-09-10 12:11:15 UTC
Intel people, please help...

Rafael, what happens if you rmmod thermal, then insmod it after resume?
Comment 11 Rafael J. Wysocki 2006-09-11 12:33:08 UTC
Well, nothing interesting.  The symptoms are a bit different, but it's still broken.

Anyway, I think I have figured out what to do to make it work. :-)

Namely, the problem is not with the fans themselves, but with the corresponding
power resources.

There are four power resources C34B - C34E that correspond to the fans C34F -
C352, respectively.  For each fan to work, its corresponding power resource
needs to be switched to 'on'.  However, the power resources usually become 'on'
after several attempts to switch them.  Moreover, each time

acpi_evaluate_object(resource->device->handle, "_ON", NULL, NULL)

is successful, even if the resource "refuses" to change its state.  Gently
speaking, the code in drivers/acpi/power.c is not prepared to cope with that.

For example, to switch fan C351 on, we need to turn on the power resource C34D,
which is done via acpi_power_on().  This function increases resource->references
to reflect the fact that the resource is needed for one more device.  However,
if the attempt to turn the resource 'on' actually fails, resource->references is
not decreased.  Moreover, if resource->references is greater than one, the
function doesn't even try to actually switch the resource, because it assumes it
has been turned 'on' already.  This way, after two unsuccessful attempts to turn
the resource 'on' we're toast (ie. resource->references is greater than one and
there won't be more attempts to turn in 'on' whatsoever).  As a result, the fan
is reported to be 'on', which is plain wrong and dangerous.

The solution is to make acpi_power_on() and acpi_power_off_device() manage
resource->references so that _unsuccessful_ operations are not counted.  Still,
although this also seems to help solve the suspend-resume related problem, the
initialization, suspend and resume of thermal zones and fans seems to be
logically wrong (I'll get back to this later).
Comment 12 Rafael J. Wysocki 2006-09-11 12:39:13 UTC
Created attachment 8995 [details]
Proof-of-concept and debug patch

This patch changes acpi_power_on() and acpi_power_off_device() manage
resource->references in such a way that unsuccessful operations are not
counted.

It also makes the kernel printk() quite a lot of debugging information and
makes the initialization, resume and suspend of fans behave in a bit more
friendly way.
Comment 13 Rafael J. Wysocki 2006-09-11 12:47:03 UTC
Created attachment 8996 [details]
dmesg output from full boot-suspend-resume cycle with the patch

This is a dmesg output from a full boot-suspend-resume cycle of a kernel with
Attachement 8995 applied.

It clearly shows that sometimes several attempts are necessary to turn on a
power resource corresponding to a fan.
Comment 14 Rafael J. Wysocki 2006-09-11 14:20:06 UTC
The thermal zones and fans initialization code seems to be incorrect, because it
may cause the same powered resource to be acquired twice in a row for the same
purpose.

For example on this box the thermal zones initialization is carried out before
the initialization of fans.  However, during the thermal zones initialization
some fans may be turned on due to the thermal requirements.  On this box this
generally leads to some power resources being turned on.  For example, if the
thermal code decides it should turn fan C352 on, it will attempt to do so and
power resource C34E will be turned on.  Next, the initialization of fans will
attempt to unconditionally turn fan C352 on again and that will lead to
resource->references in C34E being increased for the second time for the same
purpose (ie. keeping fan C352 on).  From the logical point of view it's as
though fan C352 had two references on the power resource C34E, which is
incorrect, because it suggests there are two devices that need this power
resource to stay on, but in fact there's only one.

My understanding is that we may want to make device->power_state reflect the
actual state of the device (eg. fan), but IMHO for this purpose we should use
a simplified version of acpi_power_transision() that calls a function which
doesn't change resource->references instead of acpi_power_on(), if
device->flags.force_power_state is set in acpi_bus_set_power().
Comment 15 Rafael J. Wysocki 2006-09-11 14:34:26 UTC
The suspend/resume of fans and the resume of thermal zones also seem to be
incorrect, because, for example, on this box the resume stage of swsusp kicks in
_before_ the initialization of thermal zones and fans.  For this reason, it
seems, acpi_fan_resume() and acpi_thermal_resume() should make sure the power
resources are reset and then do more-or-less what acpi_fan_add() and
acpi_thermal_add() do, with one exception: acpi_fan_resume() is called _before_
acpi_thermal_resume().

Also I don't think acpi_fan_suspend() is really needed (doesn't work anyway on
this box due to the problems with turning on the power resources from the first
kick).
Comment 16 Len Brown 2006-09-11 14:35:13 UTC
Rafael, 
is it possible to reproduce this failure without a suspend/resume 
cycle involved?  That may complicate this issue, and it is best 
if we can remove any complications. 
 
>        if ((resource->references > 1) 
>            || (resource->state == ACPI_POWER_RESOURCE_STATE_ON)) { 
>-               ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Resource [%s] already on\n", 
>-                                 resource->name)); 
>+               printk(PREFIX "Resource [%s] already on\n", resource->name); 
>                return 0; 
>        } 
> 
>+       printk(PREFIX "Trying to turn resource [%s] on\n", resource->name); 
>+ 
>        status = acpi_evaluate_object(resource->device->handle, "_ON", NULL, 
NULL); 
>-       if (ACPI_FAILURE(status)) 
>+       if (ACPI_FAILURE(status)) { 
>+               printk(KERN_WARNING PREFIX "Unable to turn resource [%s] 
on\n", 
>+                       resource->name); 
>+               resource->references--; 
>                return -ENODEV; 
>+       } 
 
Unfortunately this doesn't mean that _ON failed 
(_ON doesn't return any status) 
it means that acpi_evaluate_object() failed. 
Please dump out the bad status value. 
One way would be to build with CONFIG_ACPI_DEBUG=y and before the call 
to acpi_evaluate_object, enable debug, something like below, 
which may show us where the failure is. 
 
int old_layer = acpi_dbg_layer; 
int old_level = acpi_dbg_level; 
 
acpi_dbg_layer=0xFFFFFFFF; 
acpi_dbg_level=0xFFFFFFFF; 
 
acpi_evaluate_object()... 
 
acpi_dbg_layer = old_layer; 
acpi_dbg_level = old_level; 
 
Also, please attach the output from acpidump -- I'd like to see 
if there any anything special happening in the _ON and _OFF methods. 
 
re: the logic changes in your patch. 
There are all kinds of errors in the dmesg -- it isn't clear 
where things first go bad.  I'd like to see a dmesg showing the failure 
with the additional debug output, but no changes in logic 
or error handling from the original kernel. 
 
Comment 17 Rafael J. Wysocki 2006-09-11 14:46:04 UTC
> Rafael, 
> is it possible to reproduce this failure without a suspend/resume 
> cycle involved?  That may complicate this issue, and it is best 
> if we can remove any complications. 

Yes, it is.  I'll do that tomorrow.

> Unfortunately this doesn't mean that _ON failed 
> (_ON doesn't return any status) 
> it means that acpi_evaluate_object() failed. 
> Please dump out the bad status value. 

I will.

> Also, please attach the output from acpidump -- I'd like to see 
> if there any anything special happening in the _ON and _OFF methods. 

Will go in the next comment.

> re: the logic changes in your patch. 
> There are all kinds of errors in the dmesg -- it isn't clear 
> where things first go bad.  I'd like to see a dmesg showing the failure 
> with the additional debug output, but no changes in logic 
> or error handling from the original kernel. 

I'll do my best. :-)
Comment 18 Rafael J. Wysocki 2006-09-11 14:47:44 UTC
Created attachment 8998 [details]
acpidump output from HPC nx6325
Comment 19 Rafael J. Wysocki 2006-09-12 14:44:57 UTC
Created attachment 9004 [details]
Fan and thermal debug patch

This patch adds some debug printk()s and makes some of the ACPI debug messages
more verbose.
Comment 20 Rafael J. Wysocki 2006-09-12 15:15:19 UTC
Created attachment 9005 [details]
dmesg output with the fan and debug patch applied

This dmesg output shows what the problem really is.

The first interesting part of it starts right after the ACPI processor module
configuration and reflects the thermal and fan initialization (lines 387 -
425).  It shows that the initialization code is actually correct and my Comment
#14 is wrong.  Sorry for that.

The second interesting part starts after the hda-intel message towards the end
(lines 694 - 737) and it illustrates the problem quite well.  Namely, it
corresponds to the following operations:
1) Successful transition of fan [C351] to D3 (lines 694 - 700)
2) Successful transition of fan [C351] from D3 to D0 (lines 701 - 704)
3) Successful transition of fan [C351] from D0 to D3 (705 - 708)
4) Unsuccessful transition of fan [C351] from D3 to D0 (lines 709 - 714), which
has failed because the power resource [C34D] has not acutually been switched
on, although acpi_evaluate_object returned 0.  Nonetheless,
resource->references for [C34D] has been increased, so the power resource now
appears to be 'on', which is evidently wrong.
5) Transition of fan [C351] from D3 to D0 that appears to be successful, but in
fact is not (lines 715 - 718), because [C34D] only appears to be 'on' due to
resource->references increased by the previous operation, but in fact it is
'off'.	As a result, fan [C351] is now considered to be 'on' by the thermal
code, which is not true.
6) Successful transition of fan [C350] from D3 to D0 (lines 719 - 722)
7) Successful transition of fan [C350] from D0 to D3 (lines 723 - 726) [I have
no idea about what might have caused the APIC error.]
8) Unsuccessful transition of fan [C350] from D3 to D0 (lines 728 - 733), which
has failed because the power resource [C34C] has not acutually been switched
on, although acpi_evaluate_object returned 0.  Again, resource->references for
[C34C] has been increased, so the power resource now appears to be 'on', which
is evidently wrong.
9) Transition of fan [C350] from D3 to D0 that appears to be successful, but in
fact is not (lines 734 - 737), because [C34C] is 'off', although it appears to
be 'on' due to resource->references increased by the previous operation.  [Now
the thermal code thinks two fans, [C351] and [C350], are 'on', but none of the
actually is and the system goes above 70 C easily.]
Comment 21 Rafael J. Wysocki 2006-09-12 15:20:44 UTC
Created attachment 9006 [details]
Patch (on top of the previous one) that fixes the issue

This patch fixes the issue for me (to be applied on top of the debug patch from
Attachement #9004).
Comment 22 Robert Moore 2006-09-12 15:25:05 UTC
BTW, there is a potentially nasty little bug in the DSDT:

If (Local1)
{
    If (And (C174, 0x40))
    {
        Add (Not (Local1), 0x01, Local1)
        And (Local1, 0xFFFF)
    }
}

dsdt.dsl  2826:                                         And (Local1, 0xFFFF)
Warning  1104 -        Result is not used, operator has no effect ^
Comment 23 Rafael J. Wysocki 2006-09-12 15:30:22 UTC
Created attachment 9007 [details]
Part of dmesg output with the fix patch applied

This is the relevant part of dmesg output with the patch from Attachment #9006 [details]
applied.

It shows that now an unsuccessful transition from D3 to D0 causes the operation
to be repeated instead of resulting in a fake "transition" to D0 on the next
attempt.
Comment 24 Len Brown 2006-09-12 19:33:17 UTC
> APIC error on CPU0: 00(40) 
 
This is probably unrelated to the issue at hand, 
but an APIC bus error doesn't give one a lot of confidence 
in the hardware, as it may indicate noise on the APIC bus. 
I expect that booting with "noapic" will make this go away 
and have no effect on the fan issue. 
Comment 25 Matthew Garrett 2006-09-12 19:44:01 UTC
APIC errors are common on IXP200 systems. There's a lot of reasons not to buy
IXP200 systems.
Comment 26 Rafael J. Wysocki 2006-09-13 03:26:45 UTC
Referring to Comment #24:

Hm, it's an SMP system.  I think it needs IO-APIC.
Comment 27 Rafael J. Wysocki 2006-09-13 04:23:52 UTC
Created attachment 9011 [details]
Proposed fix patch

This patch fixes the issue for me on 2.6.18-rc6-mm2.  It also makes the thermal
management work after a resume from disk.

Moreover, if my understanding is correct, resource->references in
acpi_power_on() is a number of devices that use given power resource.  Thus if
the function is to return an error, resource->references should not be
increased, because in that case the device for which it's been called will not
be considered as using the power resource.
Comment 28 Rafael J. Wysocki 2006-10-01 06:25:00 UTC
I still need the patch from Comment #27 to make the thermal management on the
HPC nx6325 work with 2.6.18-mm2.

Also, yestarday I hade to use this patch to make fans work correctly on an HPC
nx6125.
Comment 29 Johan Brannlund 2006-10-09 09:11:19 UTC
2.6.18 with the patch from comment 27 does not work well on my nx6125. The fans
do not turn on when the temperature goes above 57 degrees (one of the trip
points), unless I do "acpi -t", so this is actually a regression.

My machine has a pretty early BIOS, version F.05, so I guess that could be the
problem.
Comment 30 Rafael J. Wysocki 2006-10-09 11:50:29 UTC
This is yet another issue.  You _additionally_ need to apply final patches from 
Bug #5534 to fix it.
Comment 31 Johan Brannlund 2006-10-09 22:33:54 UTC
Thanks, but even with the last two patches from bug 5534, things are still not
working perfectly. For instance, after the last resume I got a temperature
reading stuck at 57 degrees which left the fan running constantly. Trying to
rmmod the thermal module hung the rmmod process in a D state.
Comment 32 Rafael J. Wysocki 2006-10-12 14:53:54 UTC
There seems to be some black magic happening during the resume on the newer HPs 
related to ACPI vs the psmouse module.

Please try to remove psmouse before the suspend and see if that helps.
Comment 33 Johan Brannlund 2006-10-13 15:37:05 UTC
I just tried rmmod'ing psmouse before suspend and modprobing it again on resume.
Unfortunately there are still problems - this time the temperature reading got
stuck at 50 degrees after resuming, so the fans never come on.
Comment 34 Rafael J. Wysocki 2006-10-14 03:58:55 UTC
Well, could you please verify if the resume-related problem is present in 
2.6.18 with the final patches from Bug #5534 but without the patch from Comment 
#27?
Comment 35 Johan Brannlund 2006-10-15 22:12:30 UTC
I tried removing the patch from comment 27 and it didn't help, I still get stuck
temperature readings. It's possible that the fault lies with Ubuntu's suspend
scripts, though, they play some games with acpi. I tried removing the acpi
tricks, and that didn't help either, but I'm not sure I got everything.

Which method are you using for suspending?
Comment 36 Konstantin Karasyov 2006-10-16 01:49:44 UTC
Johan,

Try 'echo platform > /sys/power/disk' before suspend and then
'echo disk > /sys/power/state'

This is possible, that GPE's are being blocked during suspend/resume.
Comment 37 Konstantin Karasyov 2006-10-16 04:37:44 UTC
Created attachment 9254 [details]
power suspend/resume patch

There was another problem observation on my nx6125. When the system goes to
suspend with some fans on and being resumed right after that, the fan states
remain on and never being switched off after that.
The reason is that on _WAK method the state of power resources associated with
fan devices is set to off, so then on resume we turn the fans on, increasing
the number of references of power resource.

Here is two patches to solve this problem.
The first one reset number of resource references on resume and make power
on/off routines more strict and robust.
Another one makes ACPI suspend handlers to occur before _PTS/_GTS methods and
ACPI resume handlers to occur after _WAK method.
Comment 38 Konstantin Karasyov 2006-10-16 04:39:00 UTC
Created attachment 9255 [details]
ACPI suspend resume patch
Comment 39 Konstantin Karasyov 2006-10-16 05:59:39 UTC
This patches are against 2.6.18-rc6
Comment 40 Rafael J. Wysocki 2006-10-16 07:19:41 UTC
Referring to Comment #35:

I use "shutdown", but my box is different.

Could you please check if the patches from Comment #37 and Comment #38 fix the 
problem for you?  If they don't, could you please open a separate bugzilla 
entry for that?

Referring to Comment #36:

This need not work, because swsusp is currently missing a call to 
pm_ops->prepare which needs fixing.

Referring to Comment #37:

The patch looks good to me and it seems the patch from Comment #27 will no 
longer be necessary if this one is applied.  I'll test it and report back.
Comment 41 Rafael J. Wysocki 2006-10-16 08:33:51 UTC
Created attachment 9263 [details]
Modified power suspend/resume patch

This is a modified version of the patch from Comment #37 that applies to
2.6.19-rc1-mm1 (and compiles).

With this patch and the patch from Comment #38 applied the thermal management
works fine on HPC 6325 after a fresh boot as well as after a resume from disk.
Comment 42 Johan Brannlund 2006-10-17 10:07:48 UTC
I applied the patches from comment 41 (with some changes by hand to make it
apply) and comment 37 to 2.6.18 and have now suspended to disk successfully
several times, with working fans.

I forgot to apply the patches from bug 5534, so I still have to do "acpi -t" by
hand to update the temperature readings, but hopefully those patches won't
interfere with the patches from this bug, I'll test this today or tomorrow.
Thanks for the help!
Comment 43 Johan Brannlund 2006-10-17 20:28:23 UTC
I've now compiled 2.6.18 with the patches from comment 38 (not 37 as I stated
before) and comment 41, along with the two patches from bug 5534.

I'm happy to report that the fans still work after resume, along with everything
else (DRI, wireless, usb, network, ...). Thanks again!

Is there any chance that the patches that fix this bug will go into 2.6.19?
Comment 44 Johan Brannlund 2006-10-18 11:15:52 UTC
Unfortunately I managed to make the temperature readings hang again. After
resume, I started a script that periodically does "acpi -t". I use the ondemand
cpufreq governor, so I started doing some processor-intensive things, cpufreq
switched to maximum speed and the temperature went: 50->51->52->58 and stayed at
58 degrees (one degree above the fan trip point), where it's now stuck and the
fans are not running.

Echoing things into /proc/acpi/fan/*/state to start the fans doesn't work either
now, so ACPI doesn't seem to be in a good state at all. As requested, I'll file
a separate bug for this.
Comment 45 Konstantin Karasyov 2006-10-20 10:01:50 UTC
*** Bug 7259 has been marked as a duplicate of this bug. ***
Comment 46 Konstantin Karasyov 2006-10-20 10:02:41 UTC
*** Bug 6978 has been marked as a duplicate of this bug. ***
Comment 47 Konstantin Karasyov 2006-10-20 10:03:27 UTC
*** Bug 7227 has been marked as a duplicate of this bug. ***
Comment 48 Konstantin Karasyov 2006-10-23 12:07:26 UTC
Created attachment 9335 [details]
power resource references implemented as a list

Here is the patch for ACPI power resources. It implements power resource
references as list, so if two devices using the same power resource, it cannot
be disabled by two subsequent calls from a single device.
It worked on my nx6125, but could anybody try it on another nx* system?
Comment 49 Rafael J. Wysocki 2006-10-23 12:16:28 UTC
Is it a replacement for one or more of the other patches or should it be 
applied on top of them?
Comment 50 Rafael J. Wysocki 2006-10-23 13:00:56 UTC
Okay, it looks like a replacement for the patch from Comment #37 and Comment 
#41.

I have one comment: It generally is not a good idea to use GFP_KERNEL in the 
_resume() routines because it may deadlock.  Please use GFP_ATOMIC instead.


Comment 51 Konstantin Karasyov 2006-10-24 00:59:25 UTC
Created attachment 9336 [details]
same as 9335, but GFP_ATOMIC is used

Here updated version of the patch.
Comment 52 Konstantin Karasyov 2006-10-24 07:22:49 UTC
Created attachment 9337 [details]
more safe use of GFP_ATOMIC

irqs_disabled() function is used to check if GFP_ATOMIC should be used.
Comment 53 Rafael J. Wysocki 2006-10-24 14:03:54 UTC
Well, the problem of GFP_KERNEL vs _resume() is that swsusp calls _resume() 
when normal memory management mechanisms are not fully functional.  
Consequently, if it happens to trigger swapping out, for example, the system 
will probably crash.

However, all of this is done with IRQs enabled, so the irqs_disabled() test is 
insufficient.  GFP_KERNEL in _resume() is unconditionally unsafe, so to speak.
Comment 54 Konstantin Karasyov 2006-10-24 23:12:09 UTC
The patch from comment #38 changes ACPI resume handlers to occur in pm_finish, 
after invocation of _WAK method. At this point memory management should be 
already ok, isn't it?
Comment 55 Rafael J. Wysocki 2006-10-25 00:36:41 UTC
Yes.  I didn't realize that, sorry.  In which case the patch from Comment #50 
is correct.  Sorry again.
Comment 56 Johan Brannlund 2006-10-30 19:07:50 UTC
I've now spent a few days running 2.6.18 with the patch from comment 52 and with
the two patches from bug 5534 and despite my best attempts, I have not been able
to get this kernel to freeze the temperature readings after resume. It seems
like this fix is good.

I did notice that with this kernel I cannot resume from suspend to ram, which
works well on the Ubuntu kernels, so it seems they have some patches for this
that have not yet been fed upstream. Anyway, that's a completely different story...
Comment 57 Jure Repinc 2006-10-31 04:28:52 UTC
Does Linux Kernel 2.6.19-rc4 have all these parches applied or should I apply
any specific patch before testing?
Comment 58 Johan Brannlund 2006-11-01 23:57:37 UTC
It should have the patches from bug 5534, but I don't think it has the patch
from comment 52.
Comment 59 Aleksander Trofimowicz 2006-11-10 19:34:04 UTC
I've been testing the patch from <a href="#c52">comment #52</a> along with the
patches from <strike><a href="show_bug.cgi?id=5534" title="CLOSED CODE_FIX - No
thermal events until acpi -t - HP nx6125">bug 5534</a></strike> against 2.6.18.1
kernel for several weeks. Generally speaking they work fine. However there are a
few oddities worth noticing:

1) After resuming from suspend-to-ram temperature at TZ1 rises slightly under
the same system workload, when compared to a pre-suspend state. On my notebook
this can be easily observed when the system is idle. The increase in temperature
leads to a change of the TZ1 state from active[3] to active[2] for a  minute or
so until it gets cooled enough to return to the previous state. That in turn
once again causes a rise in temperature and the cycle is closed. The frequency
of oscilations is constant. Such a behaviour might suggest that the patch from
<a href="#c52">comment #52</a> or a combination of all mentioned earlier causes
additional runtime overhead after resuming from suspend-to-ram. I'm not sure
whether that was intention of the patch creator.

2) When I try to shutdown system after resume from suspend-to-ram, or try to
suspend to disk I experience strange behaviour of my LCD matrix. After issuing
shutdown command or receiving ACPI event from power button system seems to lost
control over display. Instead I can see something that I would call spontaneous
polarisation of pixels. Possibly I misinterpreted something and this is
irrelevant to the patch from <a href="#c52">comment #52</a> so I will welcome
any feedback. 

Tests were performed on HP nx6325 (Fedora Core 6).
Comment 60 Rafael J. Wysocki 2006-11-11 00:56:41 UTC
I use these patches on a regular basis on nx6325 and I haven't observed any 
strange symptoms related to them, but I don't use suspend to RAM.  Also, I 
suspend to disk using the "shutdown" mode, because the "platform" mode causes 
the states of fans and thermal zones to be incorrect after the resume.
Comment 61 Aleksander Trofimowicz 2006-11-13 20:05:01 UTC
I abandoned the 2.6.18.1 kernel in favor of the 2.6.18.2 one since I've got
unstable behaviour of SATA driver during suspend phase. I also solved LCD
problem - it was DPMS issue. 

Now things look like as follows: 

First 'platform' mode. I observed the very same temperature oscillacion at TZ1
caused by - what I know now - the fact the fan 352 doesn't work at all after
resume. Interestingly, this situation won't happen, if I decide to suspend to
disk when TZ1 is in other than active[3] state, or in other words at least one
of the fans 351, 350 and 34f is active. The kernel prints a couple of error
logs, but beside that everything seems stable:

ACPI: [Power Resource - C34B] resume failed: -8
ACPI: [Power Resource - C34C] resume failed: -8
ACPI: [Power Resource - C34D] resume failed: -8
ACPI: Transitioning device [C34F] to D3
ACPI: Transitioning device [C34F] to D3
ACPI: Transitioning device [C350] to D3
ACPI: Transitioning device [C350] to D3
ACPI: Transitioning device [C351] to D3
ACPI: Transitioning device [C351] to D3
ACPI: Transitioning device [C34F] to D0
ACPI: Transitioning device [C34F] to D0
ACPI: Unable to turn cooling device [ffff810037d61cf0] 'on'
ACPI: Transitioning device [C350] to D0
ACPI: Transitioning device [C350] to D0
ACPI: Unable to turn cooling device [ffff810037d61c50] 'on'
ACPI: Transitioning device [C351] to D0
ACPI: Transitioning device [C351] to D0
ACPI: Unable to turn cooling device [ffff810037d61bd0] 'on'
ACPI: Transitioning device [C350] to D0
ACPI: Transitioning device [C350] to D0
ACPI: Unable to turn cooling device [ffff810037d61c50] 'on'
ACPI: Transitioning device [C351] to D0
ACPI: Transitioning device [C351] to D0
ACPI: Unable to turn cooling device [ffff810037d61bd0] 'on'
ACPI: Transitioning device [C351] to D0
ACPI: Transitioning device [C351] to D0
ACPI: Unable to turn cooling device [ffff810037d61bd0] 'on'



Now 'shutdown' mode. Although the kernel ring buffer doesn't contain any error
logs, this is a less stable option. I cannot suspend to disk twice in a row, it
hangs just before suspending in a second run (but after execution of preparation
scripts). At active[3] state files at /proc/acpi/fan/* indicate the fan 352 is
turned on exclusively, but having listened to noise the physical fan generate, I
would say that fans 352 and 351 are both on. Does the fact that all those ACPI
fans drive one physical device, have any importance here?

All those remarks apply to the 2.6.18.2 kernel with patches from comment #38,
comment #50, bug #5534 and a few swswap patches from mm tree bundled with 
Fedora Core 6 2.6.18-1.2849 kernel source package.  
Comment 62 Len Brown 2006-11-13 23:27:46 UTC
patch in comment #52 applied to acpi-test  
Comment 63 Rafael J. Wysocki 2006-11-14 10:55:57 UTC
And what about the patch from Comment #38?  AFAICT it's also needed ...
Comment 64 Konstantin Karasyov 2006-11-20 02:33:31 UTC
Len, please, pull the patch from comment #38. It is also required.
Comment 65 Alistair Strachan 2006-12-04 14:57:43 UTC
Hi, I have the same fan problem after S3 suspend on my HP NC6000 notebook, on 
2.6.19. I have no DSDT errors and no errors are printed to dmesg. I cannot 
transition fans after resuming.

However, if I apply the patch from comment #38, everything magically starts 
working again (like 2.6.17).

Please apply.
Comment 66 otto meijer 2006-12-06 12:58:20 UTC
Dear Linux user or developer,

I do not know any coding andIi am not an expert at all. I tried kernel 
2.6.19, still, after a suspend to ram the fan does't work anymore like 
2.6.18. After a resume from suspend to disk the fan works fine.

I found a simple solution to the problem:

Comment out before compiling these 4 lines of the fan module:

/*static int acpi_fan_suspend(struct acpi_device *device, int state);
* static int acpi_fan_resume(struct acpi_device *device, int state);
*/

and:

/*		.suspend = acpi_fan_suspend,
*		.resume = acpi_fan_resume,
*/

Things will work properly then.

I hope someone will resolve this bug, i am prepared to test a possible 
solution, but again, I am a Linux user and not an expert.

Kind regards,

Otto Meijer



>From: bugme-daemon@bugzilla.kernel.org
>To: meijer.o@hotmail.com
>Subject: [Bug 7122] Thermal management problems - HPC nx6325
>Date: Mon, 4 Dec 2006 15:01:11 -0800
>
>http://bugzilla.kernel.org/show_bug.cgi?id=7122
>
>
>
>
>
>------- Additional Comments From alistair@devzero.co.uk  2006-12-04 14:57 
>-------
>Hi, I have the same fan problem after S3 suspend on my HP NC6000 notebook, 
>on
>2.6.19. I have no DSDT errors and no errors are printed to dmesg. I cannot
>transition fans after resuming.
>
>However, if I apply the patch from comment #38, everything magically starts
>working again (like 2.6.17).
>
>Please apply.
>
>------- You are receiving this mail because: -------
>You are on the CC list for the bug, or are watching someone who is.

_________________________________________________________________
Nieuw: Live Mail. Mis het niet en profiteer direct van de voordelen! 
http://imagine-windowslive.com/mail/launch/default.aspx?Locale=nl-nl

Comment 67 Konstantin Karasyov 2006-12-08 09:28:07 UTC
The patch that should resolve the fan_resume problem is available from here 
(Bug #7570):
http://bugzilla.kernel.org/attachment.cgi?id=9757&action=view

It fixed the issue for my nx6125, could anybody try it on another boxes?
Comment 68 Edgar Villanueva 2006-12-13 20:51:35 UTC
Tried on nx6325 and it seems to work for me.
Comment 69 Marian Klein 2007-01-07 13:18:52 UTC
Created attachment 10022 [details]
the same patch for kernel  2.6.20-rc3 

Original patch in comment #52 is against 2.6.18 kernel, 
this one cleanly patches 2.6.19 up to 2.6.20-rc3 kernels.
(Tested on 2.6.20-rc3)
Comment 70 Johan Brannlund 2007-01-17 20:15:05 UTC
Unfortunately the patch from comment 67 doesn't seem to work for me. I patched
2.6.20-rc3 with that patch and with the patch from comment 186 in bug 5534. The
patch from comment 67 didn't apply cleanly for some reason so I had to fix it by
hand.

Before suspend to disk, the fans work as they should but after resume the
temperature readings are stuck and the fans do not come on. This is with an
nx6125 with BIOS F.11.
Comment 71 Manuel P 2007-01-18 15:12:08 UTC
I patched 2.6.18.5 with patches from comment 38 and 52.
Notebook: nx6325
first, acpi seemed to work. But after some time, fan blows harder and i wonder why.
so i type acpi -V to get more informations.
I saw 106
Comment 72 Rafael J. Wysocki 2007-01-18 15:22:33 UTC
Actually, I use these two patches with the 2.6.20-rc5 kernel on nx6325 and the 
work for me just fine, but _additionally_ I need to use one patch from
Bug #5534.

All of the patches that I use are available from 
http://www.sisk.pl/kernel/patches/2.6.20-rc5/
Comment 73 Rafael J. Wysocki 2007-01-19 08:00:43 UTC
Referring to Comment #72

Ah, I tend to forget about one important thing.  I have to remove the psmouse
module before each suspend to disk (to prevent some strange things happening)
and I use the "shutdown" suspend mode rather than "platform".
Comment 74 Johan Brannlund 2007-01-19 18:22:30 UTC
If I patch 2.6.20-rc5 with all the patches Rafael had collected, I once again
have working fans both before and after resume. Thanks, Rafael!
Comment 75 Edgar Villanueva 2007-01-22 18:10:55 UTC
Tried 2.6.20-rc5 with Rafaels patches and can say that hibernate through KDE
works.  Suspend still has issues and I can confirm the same as comment #71 that
the machine thinks it's overheating.  I'm pretty sure that the machine wasn't
hot. I think the temperature reading is incorrect. Is there a possibility that
when trying suspend that the temperature is being read incorrectly?
Comment 76 Rafael J. Wysocki 2007-01-23 03:46:07 UTC
Referring to Comment #75:

Yes, there is.

Do you unload psmouse before the suspend?

BTW, in my previous posts "suspend" means "hibernate".  Frankly, I haven't used 
the STR on my box yet.
Comment 77 Johan Brannlund 2007-01-23 08:43:09 UTC
I noticed another effect with 2.6.20-rc5 with Rafael's patches: often (but not
always) the fan keeps running at a low speed and never turns off. If this
happens, the only way I've found to turn it off is to run something very
cpu-intensive that brings the temperature up above 57 degrees (the first trip
point). If I do that, the fan keeps running until the computer cools down to 50
degrees and then the fan turns off.

This is on an nx6125 running the latest BIOS F.11. I do *not* have the "Fan
always on when on AC" BIOS option turned on.
Comment 78 Edgar Villanueva 2007-01-24 08:22:05 UTC
In reply to comment 76.

No I didn't unload the mouse. I will give that a shot soon and post results.
For reference.
nx6325
Using Fedora core 6 x86_64 2.6.20-rc5 with Rafaels patches.
PSMouse as module
Comment 79 Edgar Villanueva 2007-01-27 12:46:40 UTC
Augmenting comment #78
After removing psmouse, I'm pretty happy with the results. Both suspend and
resume  seem to be working fairly well.
Question: Which bios should we be using?  I'm using F.02 without DSDT update and
don't *seem* to be having any problems.
Rafael your link to the patches is invaluable that's what finally did it for me.
user.c didn't patch cleanly but easy enough to fix.

Specifications for Reference:
nx6325 Turion X2 60 1GB 
BIOS F.02 not patched with DSDT.
kernel 2.6.20-rc5 with Rafaels patches above.

Notes:
Suspend doesn't seem to work with Binary ATI driver when using Dual Monitors.
Nothing to do with this bug.
Comment 80 Manuel P 2007-01-31 21:16:08 UTC
Thanks to Rafael and Comment 72, i now have a working notebook.
acpi seems to work very well (i am now using the 2.6.20rc5 kernel for two
weeks). The hibernation doesnt work. With or without loaded psmouse module.

Same problems as described above with the psmouse-module (slow boot at bios and
slow linux booting when i did not remove the module).
Comment 81 Rafael J. Wysocki 2007-02-05 10:33:04 UTC
After upgrading from SuSE 10.1 to OpenSUSE 10.2, I've noticed that the patch
from Bug #5534 that I used
(http://bugzilla.kernel.org/attachment.cgi?id=9746&action=view, or
ACPI-notify-revised.patch in "my" series) is no longer needed.

The new series that I use against 2.6.20 is available at
sftp://ogre.sisk.pl:31337/home/rafael/kernel/patches/2.6.20
 
Comment 82 Edgar Villanueva 2007-02-05 19:15:54 UTC
Rafael,

Is there something special in suse 10.2 outside of the kernel that makes things
work? I'm on fedora and trying to figure out how to get there.
Comment 83 Thomas Renninger 2007-02-06 09:58:08 UTC
In (latest!, I doubt this already hit the update process) SUSE 10.2 kernel the 
psmouse things are fixed (there psmouse is built in, so serio drivers get 
unload on shutdown).
And the patch from #5534 comment #180, Rafael was referring to in his last 
comment is in.
Comment 84 Rafael J. Wysocki 2007-02-06 11:20:32 UTC
First, I'd like to apologize everyone for the premature conclusion in Comment
#81.  The patch from
http://bugzilla.kernel.org/attachment.cgi?id=9746&action=view actually _is_
needed  to make thermal management work regardless of the suspend-related
issues.  Also the link to my patch series in Comment #81 was wrong, sorry.

I have uploaded my current series of patches against 2.6.20 into
http://www.sisk.pl/kernel/patches/2.6.20/ (please note that two new patches from
Bug #7887 are included in it).


Referring to Comment #82:

> Is there something special in suse 10.2 outside of the kernel that makes
things work?

No, I don't think so.  The final patch from Bug #5534 should be sufficient to
make thermal management work, but for the suspend some other patches are
necessary.  Please try to apply the series of patches from
http://www.sisk.pl/kernel/patches/2.6.20/ and make sure you unload the psmouse
module before the suspend.

I haven't tried to suspend to RAM yet, but the suspend to disk should work with
Fedora too, at least I don't see any fundamental obstacles.  If you have any
problems, please contact me directly (rjw@sisk.pl).
Comment 85 Edgar Villanueva 2007-02-11 18:40:22 UTC
Status Update
2.6.20 with Ralfaels patches seem to be working well. Have been testing them
with suspend to ram using swsup, fedora core 6, ati binary driver.
I'm currently using the kpowersave and the integrated power save scripts
modified with tidbits here and there.  Roughly testing 3 days.

I have some acpi suspend scripts as well as a kpowersave configuration if anyone
is interested.
Comment 86 Len Brown 2007-02-15 22:50:08 UTC
refreshed comment #52 patch in comment #69 applied to acpi-test
Comment 87 Len Brown 2007-02-15 22:52:37 UTC
*** Bug 7194 has been marked as a duplicate of this bug. ***
Comment 88 Mircea Bardac 2007-02-19 17:56:58 UTC
Hi everybody,

I will be the (unfortunate) owner of one nx6325 (don't think I can change this).
Imagine me, a new to the Linux-on-notebook arena going through everything which
has been written on every forum/wiki, be it HP, Gentoo Wiki, Fedora related,
SuSE related etc. The information is scattered everywhere and it doesn't seem to
reflect the latest changes in the BIOS/kernel versions.

Could somebody here point me to some webpage or do a wrap-up of what is
currently (not) working and how?

I am interested in some information useful with
* the latest BIOS version (currently F.06), nothing else
* the latest kernel (currently 2.6.20), nothing else
* nothing distribution specific

on the following problems:
1. thermal run-a-way
2. thermal information not updating
3. 'bad state' problem
4. anything else I might have missed related to ACPI, kernel etc. (but not
related to sound, wifi, bluetooth, modem, video)

I am interested to know the latest & best solutions for each problem, because I
think I've seen some problems solved too in many ways.

I would have done this wrap-up myself if I could have participated in the
process of fixing things up. Unfortunately, I am faced with all the facts after
they had happened and I have to guess which things I need to do to get
everything working (not getting damaged).

Any help is appreciated.

Lots of thanks,
Mircea
Comment 89 Alexey Starikovskiy 2007-02-19 22:40:09 UTC
Most problems are fixed by patches in #5534.
Problems with suspend/resume are fixed here (you need 5534 fixes already 
applied).
Comment 90 Mircea Bardac 2007-02-20 05:20:29 UTC
Thanks for your rapid response. I'm a bit puzzled by the number of patches in
both bug reports but I just remembered the git repository.

http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.20-git15.log
mentions that some patches around here about nx6325 have been committed. I
assume that all the patches you were referring in the previous comment are
already committed - please correct me if I'm wrong.

You said "most of the problems". Is there anything else not fixed I should look for?

Lots of thanks for your work.
Comment 91 Thomas Renninger 2007-02-20 06:30:08 UTC
You should also include these:
http://bugzilla.kernel.org/show_bug.cgi?id=7689
and be sure thermal polling is not that high, e.g. set it to 15 secs in:
echo 15 >/proc/acpi/thermal_zone/*/polling_frequency
This should fix the temperature to be stuck sometimes.
Comment 92 Mircea Bardac 2007-02-20 07:43:26 UTC
According to the above mentioned kernel git log, patches from #7689 are also
merged in.

Commits:
a1cec06177386ecc320af643de11cfa77e8945bd
82dd9eff4bf3b17f5f511ae931a1f350c36ca9eb

Not sure if patch in comment #65 is in yet.

--
Also, should I understand from the above comment that there are still problems
with thermal polling*? I mean, does a normal laptop user have to adjust the
polling frequency manually? (I have to mention that I haven't worked with Linux
on laptops before)

* is there another bug on this matter I should follow?

Many thanks.
Comment 93 Len Brown 2007-02-20 22:34:06 UTC
refreshed comment #52 patch in comment #69 shipped in Linux-2.6.21-rc1.
Closed
Comment 94 Tommi Kyntola 2007-02-28 12:56:16 UTC
Guys, am I missing something or shouldn't the fan set the power state to ACPI_D3
rather than ACPI_D0 in suspend?

Changing that as I described in http://lkml.org/lkml/2007/2/28/71 fixed my fan
on HP nw8000 after S3 suspends.

You're obviously experiencing here some other issues aswell, but I hope this is
worth pointing out here, too. (because atleast this is a bug thread that people
with fan problems after suspend are likely to google and for some this just
might be the fix they're looking for)
Comment 95 Konstantin Karasyov 2007-03-01 04:18:09 UTC
> Guys, am I missing something or shouldn't the fan set the power state to 
ACPI_D3
> rather than ACPI_D0 in suspend?

It is safer to set the fan to D0 on suspend because in this case we can be sure 
that the system will not overheat.
Comment 96 Tommi Kyntola 2007-03-01 05:19:06 UTC
Konstantin, ok.

Is the fan really supposed stay functional while suspended?
I doesn't do that on my machine. Once it goes down the fan's dead, with _D3 or
_D0. The only difference is that having set it to D3 when suspending allows it
to wake up when resuming.

But I do get your point, that if some laptops allow it to be left on it's
probably better that way incase for example the battery should start to overheat.

However, on my laptop that leaves the fan totally dead after a suspend-to-memory
cycle, but I guess there's something else wrong then and I'll try to figure out
another way to fix it.

And I should probably try that patch of yours that just crept into the main line
and see wether that has any effect, I initially assumed that it was already in
2.6.20.

Just posted accidentally (a dupe no-less) about this to linux-acpi also, sorry
about that then if the change is deemed bad.
Comment 97 Rafael J. Wysocki 2007-03-01 06:43:00 UTC
Please try 2.6.21-rc2.  It sould contain all of the required fixes AFAICT.
Comment 98 Tommi Kyntola 2007-03-01 08:01:54 UTC
Does not compute. It won't wake up from suspend at all. I'll bisect that
tomorrow to see what broke it. In the mean time if there's more information
you'd like (lspci, .config, whatever) don't hesitate to ask.
Comment 99 Konstantin Karasyov 2007-03-04 12:17:43 UTC
> But I do get your point, that if some laptops allow it to be left on it's
> probably better that way incase for example the battery should start to overheat.

Actually, I've meant the situation where the system could go down on thermal 
(possibly h/w) shutdown during suspend. If the fans stay on during suspend, 
where thermal control is possibly not working already, thermal shutdown is less 
likely to occur.