Bug 7122
Description
Rafael J. Wysocki
2006-09-08 04:05:26 UTC
It actually tries to turn the 352 fan on but it can't. I've got the following messages in dmesg: ACPI: Transitioning device [C352] to D0 ACPI: Transitioning device [C352] to D0 ACPI: Unable to turn cooling device [ffff810037a3fb90] 'on' APIC error on CPU0: 00(40) If I reboot the machine after a resume from disk, the thermal management still doesn't work properly and the symptoms are like right after the resume (eg. problems with turning fans on, overheating).
> If I reboot the machine after a resume from disk, the thermal management still
> doesn't work properly and the symptoms are like right after the resume (eg.
> problems with turning fans on, overheating).
If reboot does not fix it... then it looks hardware problem to me. We
may be able to work around it somehow, but reboot should definitely
fix it.
Hm, can ACPI NVS survive accross reboots? I've seen similar ACPI problems after reboots on my nx6125, so I don't think it's a hardware problem. Well, the box seems to have thermal management problems independent of the suspend. For example, I often get messages like these: ACPI: Transitioning device [C351] to D0 ACPI: Transitioning device [C351] to D0 ACPI: Unable to turn cooling device [ffff810037a32570] 'on' after fresh boots (on 2.6.18-rc6-mm1 as well as on -rc5-mm1). I'm starting to see a pattern. If the last two fans (352 and 351) are on right before the suspend, the thermal management seems to work (sort of) after rhe resume. However, if only the 352 fan is on before the suspend, then usually after the resume two fans are on, but only one of them (352) is reported to be 'on' by the kernel. As a result, if the temperature falls below the next tripping point and the kernel turns one of the fans off, it thinks that is the last one (352), but actually it is the 351. Then, the kernel reports all of the fans 'off', but one of them (352) is spinning, so when the kernel wants to turn it on (after the temperature rises), it can't and the operation fails. Also it looks like the kernel sometimes thinks it should turn on the fan 351 when in fact the fan 350 should be switched. Comment #2 is apparently wrong, because I confused the symptoms related to Comment #6 with the swsusp-related problem. The swsusp-related problem appears to be that if the actual configuration of the fans during resume doesn't match the one from before the suspend, the thermal management after the resume is busted. Created attachment 8978 [details]
swsusp debug patch
I can break the thermal management by using the attached patch and doing
'echo test > /sys/power/disk; echo disk > /sys/power/state'.
After the test the system behaves exactly like after a resume from disk (ie.
the fans seem to be out of control).
Intel people, please help... Rafael, what happens if you rmmod thermal, then insmod it after resume? Well, nothing interesting. The symptoms are a bit different, but it's still broken. Anyway, I think I have figured out what to do to make it work. :-) Namely, the problem is not with the fans themselves, but with the corresponding power resources. There are four power resources C34B - C34E that correspond to the fans C34F - C352, respectively. For each fan to work, its corresponding power resource needs to be switched to 'on'. However, the power resources usually become 'on' after several attempts to switch them. Moreover, each time acpi_evaluate_object(resource->device->handle, "_ON", NULL, NULL) is successful, even if the resource "refuses" to change its state. Gently speaking, the code in drivers/acpi/power.c is not prepared to cope with that. For example, to switch fan C351 on, we need to turn on the power resource C34D, which is done via acpi_power_on(). This function increases resource->references to reflect the fact that the resource is needed for one more device. However, if the attempt to turn the resource 'on' actually fails, resource->references is not decreased. Moreover, if resource->references is greater than one, the function doesn't even try to actually switch the resource, because it assumes it has been turned 'on' already. This way, after two unsuccessful attempts to turn the resource 'on' we're toast (ie. resource->references is greater than one and there won't be more attempts to turn in 'on' whatsoever). As a result, the fan is reported to be 'on', which is plain wrong and dangerous. The solution is to make acpi_power_on() and acpi_power_off_device() manage resource->references so that _unsuccessful_ operations are not counted. Still, although this also seems to help solve the suspend-resume related problem, the initialization, suspend and resume of thermal zones and fans seems to be logically wrong (I'll get back to this later). Created attachment 8995 [details]
Proof-of-concept and debug patch
This patch changes acpi_power_on() and acpi_power_off_device() manage
resource->references in such a way that unsuccessful operations are not
counted.
It also makes the kernel printk() quite a lot of debugging information and
makes the initialization, resume and suspend of fans behave in a bit more
friendly way.
Created attachment 8996 [details]
dmesg output from full boot-suspend-resume cycle with the patch
This is a dmesg output from a full boot-suspend-resume cycle of a kernel with
Attachement 8995 applied.
It clearly shows that sometimes several attempts are necessary to turn on a
power resource corresponding to a fan.
The thermal zones and fans initialization code seems to be incorrect, because it may cause the same powered resource to be acquired twice in a row for the same purpose. For example on this box the thermal zones initialization is carried out before the initialization of fans. However, during the thermal zones initialization some fans may be turned on due to the thermal requirements. On this box this generally leads to some power resources being turned on. For example, if the thermal code decides it should turn fan C352 on, it will attempt to do so and power resource C34E will be turned on. Next, the initialization of fans will attempt to unconditionally turn fan C352 on again and that will lead to resource->references in C34E being increased for the second time for the same purpose (ie. keeping fan C352 on). From the logical point of view it's as though fan C352 had two references on the power resource C34E, which is incorrect, because it suggests there are two devices that need this power resource to stay on, but in fact there's only one. My understanding is that we may want to make device->power_state reflect the actual state of the device (eg. fan), but IMHO for this purpose we should use a simplified version of acpi_power_transision() that calls a function which doesn't change resource->references instead of acpi_power_on(), if device->flags.force_power_state is set in acpi_bus_set_power(). The suspend/resume of fans and the resume of thermal zones also seem to be incorrect, because, for example, on this box the resume stage of swsusp kicks in _before_ the initialization of thermal zones and fans. For this reason, it seems, acpi_fan_resume() and acpi_thermal_resume() should make sure the power resources are reset and then do more-or-less what acpi_fan_add() and acpi_thermal_add() do, with one exception: acpi_fan_resume() is called _before_ acpi_thermal_resume(). Also I don't think acpi_fan_suspend() is really needed (doesn't work anyway on this box due to the problems with turning on the power resources from the first kick). Rafael, is it possible to reproduce this failure without a suspend/resume cycle involved? That may complicate this issue, and it is best if we can remove any complications. > if ((resource->references > 1) > || (resource->state == ACPI_POWER_RESOURCE_STATE_ON)) { >- ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Resource [%s] already on\n", >- resource->name)); >+ printk(PREFIX "Resource [%s] already on\n", resource->name); > return 0; > } > >+ printk(PREFIX "Trying to turn resource [%s] on\n", resource->name); >+ > status = acpi_evaluate_object(resource->device->handle, "_ON", NULL, NULL); >- if (ACPI_FAILURE(status)) >+ if (ACPI_FAILURE(status)) { >+ printk(KERN_WARNING PREFIX "Unable to turn resource [%s] on\n", >+ resource->name); >+ resource->references--; > return -ENODEV; >+ } Unfortunately this doesn't mean that _ON failed (_ON doesn't return any status) it means that acpi_evaluate_object() failed. Please dump out the bad status value. One way would be to build with CONFIG_ACPI_DEBUG=y and before the call to acpi_evaluate_object, enable debug, something like below, which may show us where the failure is. int old_layer = acpi_dbg_layer; int old_level = acpi_dbg_level; acpi_dbg_layer=0xFFFFFFFF; acpi_dbg_level=0xFFFFFFFF; acpi_evaluate_object()... acpi_dbg_layer = old_layer; acpi_dbg_level = old_level; Also, please attach the output from acpidump -- I'd like to see if there any anything special happening in the _ON and _OFF methods. re: the logic changes in your patch. There are all kinds of errors in the dmesg -- it isn't clear where things first go bad. I'd like to see a dmesg showing the failure with the additional debug output, but no changes in logic or error handling from the original kernel. > Rafael, > is it possible to reproduce this failure without a suspend/resume > cycle involved? That may complicate this issue, and it is best > if we can remove any complications. Yes, it is. I'll do that tomorrow. > Unfortunately this doesn't mean that _ON failed > (_ON doesn't return any status) > it means that acpi_evaluate_object() failed. > Please dump out the bad status value. I will. > Also, please attach the output from acpidump -- I'd like to see > if there any anything special happening in the _ON and _OFF methods. Will go in the next comment. > re: the logic changes in your patch. > There are all kinds of errors in the dmesg -- it isn't clear > where things first go bad. I'd like to see a dmesg showing the failure > with the additional debug output, but no changes in logic > or error handling from the original kernel. I'll do my best. :-) Created attachment 8998 [details]
acpidump output from HPC nx6325
Created attachment 9004 [details]
Fan and thermal debug patch
This patch adds some debug printk()s and makes some of the ACPI debug messages
more verbose.
Created attachment 9005 [details] dmesg output with the fan and debug patch applied This dmesg output shows what the problem really is. The first interesting part of it starts right after the ACPI processor module configuration and reflects the thermal and fan initialization (lines 387 - 425). It shows that the initialization code is actually correct and my Comment #14 is wrong. Sorry for that. The second interesting part starts after the hda-intel message towards the end (lines 694 - 737) and it illustrates the problem quite well. Namely, it corresponds to the following operations: 1) Successful transition of fan [C351] to D3 (lines 694 - 700) 2) Successful transition of fan [C351] from D3 to D0 (lines 701 - 704) 3) Successful transition of fan [C351] from D0 to D3 (705 - 708) 4) Unsuccessful transition of fan [C351] from D3 to D0 (lines 709 - 714), which has failed because the power resource [C34D] has not acutually been switched on, although acpi_evaluate_object returned 0. Nonetheless, resource->references for [C34D] has been increased, so the power resource now appears to be 'on', which is evidently wrong. 5) Transition of fan [C351] from D3 to D0 that appears to be successful, but in fact is not (lines 715 - 718), because [C34D] only appears to be 'on' due to resource->references increased by the previous operation, but in fact it is 'off'. As a result, fan [C351] is now considered to be 'on' by the thermal code, which is not true. 6) Successful transition of fan [C350] from D3 to D0 (lines 719 - 722) 7) Successful transition of fan [C350] from D0 to D3 (lines 723 - 726) [I have no idea about what might have caused the APIC error.] 8) Unsuccessful transition of fan [C350] from D3 to D0 (lines 728 - 733), which has failed because the power resource [C34C] has not acutually been switched on, although acpi_evaluate_object returned 0. Again, resource->references for [C34C] has been increased, so the power resource now appears to be 'on', which is evidently wrong. 9) Transition of fan [C350] from D3 to D0 that appears to be successful, but in fact is not (lines 734 - 737), because [C34C] is 'off', although it appears to be 'on' due to resource->references increased by the previous operation. [Now the thermal code thinks two fans, [C351] and [C350], are 'on', but none of the actually is and the system goes above 70 C easily.] Created attachment 9006 [details]
Patch (on top of the previous one) that fixes the issue
This patch fixes the issue for me (to be applied on top of the debug patch from
Attachement #9004).
BTW, there is a potentially nasty little bug in the DSDT: If (Local1) { If (And (C174, 0x40)) { Add (Not (Local1), 0x01, Local1) And (Local1, 0xFFFF) } } dsdt.dsl 2826: And (Local1, 0xFFFF) Warning 1104 - Result is not used, operator has no effect ^ Created attachment 9007 [details] Part of dmesg output with the fix patch applied This is the relevant part of dmesg output with the patch from Attachment #9006 [details] applied. It shows that now an unsuccessful transition from D3 to D0 causes the operation to be repeated instead of resulting in a fake "transition" to D0 on the next attempt. > APIC error on CPU0: 00(40)
This is probably unrelated to the issue at hand,
but an APIC bus error doesn't give one a lot of confidence
in the hardware, as it may indicate noise on the APIC bus.
I expect that booting with "noapic" will make this go away
and have no effect on the fan issue.
APIC errors are common on IXP200 systems. There's a lot of reasons not to buy IXP200 systems. Referring to Comment #24: Hm, it's an SMP system. I think it needs IO-APIC. Created attachment 9011 [details]
Proposed fix patch
This patch fixes the issue for me on 2.6.18-rc6-mm2. It also makes the thermal
management work after a resume from disk.
Moreover, if my understanding is correct, resource->references in
acpi_power_on() is a number of devices that use given power resource. Thus if
the function is to return an error, resource->references should not be
increased, because in that case the device for which it's been called will not
be considered as using the power resource.
I still need the patch from Comment #27 to make the thermal management on the HPC nx6325 work with 2.6.18-mm2. Also, yestarday I hade to use this patch to make fans work correctly on an HPC nx6125. 2.6.18 with the patch from comment 27 does not work well on my nx6125. The fans do not turn on when the temperature goes above 57 degrees (one of the trip points), unless I do "acpi -t", so this is actually a regression. My machine has a pretty early BIOS, version F.05, so I guess that could be the problem. This is yet another issue. You _additionally_ need to apply final patches from Bug #5534 to fix it. Thanks, but even with the last two patches from bug 5534, things are still not working perfectly. For instance, after the last resume I got a temperature reading stuck at 57 degrees which left the fan running constantly. Trying to rmmod the thermal module hung the rmmod process in a D state. There seems to be some black magic happening during the resume on the newer HPs related to ACPI vs the psmouse module. Please try to remove psmouse before the suspend and see if that helps. I just tried rmmod'ing psmouse before suspend and modprobing it again on resume. Unfortunately there are still problems - this time the temperature reading got stuck at 50 degrees after resuming, so the fans never come on. Well, could you please verify if the resume-related problem is present in 2.6.18 with the final patches from Bug #5534 but without the patch from Comment #27? I tried removing the patch from comment 27 and it didn't help, I still get stuck temperature readings. It's possible that the fault lies with Ubuntu's suspend scripts, though, they play some games with acpi. I tried removing the acpi tricks, and that didn't help either, but I'm not sure I got everything. Which method are you using for suspending? Johan, Try 'echo platform > /sys/power/disk' before suspend and then 'echo disk > /sys/power/state' This is possible, that GPE's are being blocked during suspend/resume. Created attachment 9254 [details]
power suspend/resume patch
There was another problem observation on my nx6125. When the system goes to
suspend with some fans on and being resumed right after that, the fan states
remain on and never being switched off after that.
The reason is that on _WAK method the state of power resources associated with
fan devices is set to off, so then on resume we turn the fans on, increasing
the number of references of power resource.
Here is two patches to solve this problem.
The first one reset number of resource references on resume and make power
on/off routines more strict and robust.
Another one makes ACPI suspend handlers to occur before _PTS/_GTS methods and
ACPI resume handlers to occur after _WAK method.
Created attachment 9255 [details]
ACPI suspend resume patch
This patches are against 2.6.18-rc6 Referring to Comment #35: I use "shutdown", but my box is different. Could you please check if the patches from Comment #37 and Comment #38 fix the problem for you? If they don't, could you please open a separate bugzilla entry for that? Referring to Comment #36: This need not work, because swsusp is currently missing a call to pm_ops->prepare which needs fixing. Referring to Comment #37: The patch looks good to me and it seems the patch from Comment #27 will no longer be necessary if this one is applied. I'll test it and report back. Created attachment 9263 [details] Modified power suspend/resume patch This is a modified version of the patch from Comment #37 that applies to 2.6.19-rc1-mm1 (and compiles). With this patch and the patch from Comment #38 applied the thermal management works fine on HPC 6325 after a fresh boot as well as after a resume from disk. I applied the patches from comment 41 (with some changes by hand to make it apply) and comment 37 to 2.6.18 and have now suspended to disk successfully several times, with working fans. I forgot to apply the patches from bug 5534, so I still have to do "acpi -t" by hand to update the temperature readings, but hopefully those patches won't interfere with the patches from this bug, I'll test this today or tomorrow. Thanks for the help! I've now compiled 2.6.18 with the patches from comment 38 (not 37 as I stated before) and comment 41, along with the two patches from bug 5534. I'm happy to report that the fans still work after resume, along with everything else (DRI, wireless, usb, network, ...). Thanks again! Is there any chance that the patches that fix this bug will go into 2.6.19? Unfortunately I managed to make the temperature readings hang again. After resume, I started a script that periodically does "acpi -t". I use the ondemand cpufreq governor, so I started doing some processor-intensive things, cpufreq switched to maximum speed and the temperature went: 50->51->52->58 and stayed at 58 degrees (one degree above the fan trip point), where it's now stuck and the fans are not running. Echoing things into /proc/acpi/fan/*/state to start the fans doesn't work either now, so ACPI doesn't seem to be in a good state at all. As requested, I'll file a separate bug for this. *** Bug 7259 has been marked as a duplicate of this bug. *** *** Bug 6978 has been marked as a duplicate of this bug. *** *** Bug 7227 has been marked as a duplicate of this bug. *** Created attachment 9335 [details]
power resource references implemented as a list
Here is the patch for ACPI power resources. It implements power resource
references as list, so if two devices using the same power resource, it cannot
be disabled by two subsequent calls from a single device.
It worked on my nx6125, but could anybody try it on another nx* system?
Is it a replacement for one or more of the other patches or should it be applied on top of them? Okay, it looks like a replacement for the patch from Comment #37 and Comment #41. I have one comment: It generally is not a good idea to use GFP_KERNEL in the _resume() routines because it may deadlock. Please use GFP_ATOMIC instead. Created attachment 9336 [details]
same as 9335, but GFP_ATOMIC is used
Here updated version of the patch.
Created attachment 9337 [details]
more safe use of GFP_ATOMIC
irqs_disabled() function is used to check if GFP_ATOMIC should be used.
Well, the problem of GFP_KERNEL vs _resume() is that swsusp calls _resume() when normal memory management mechanisms are not fully functional. Consequently, if it happens to trigger swapping out, for example, the system will probably crash. However, all of this is done with IRQs enabled, so the irqs_disabled() test is insufficient. GFP_KERNEL in _resume() is unconditionally unsafe, so to speak. The patch from comment #38 changes ACPI resume handlers to occur in pm_finish, after invocation of _WAK method. At this point memory management should be already ok, isn't it? Yes. I didn't realize that, sorry. In which case the patch from Comment #50 is correct. Sorry again. I've now spent a few days running 2.6.18 with the patch from comment 52 and with the two patches from bug 5534 and despite my best attempts, I have not been able to get this kernel to freeze the temperature readings after resume. It seems like this fix is good. I did notice that with this kernel I cannot resume from suspend to ram, which works well on the Ubuntu kernels, so it seems they have some patches for this that have not yet been fed upstream. Anyway, that's a completely different story... Does Linux Kernel 2.6.19-rc4 have all these parches applied or should I apply any specific patch before testing? It should have the patches from bug 5534, but I don't think it has the patch from comment 52. I've been testing the patch from <a href="#c52">comment #52</a> along with the patches from <strike><a href="show_bug.cgi?id=5534" title="CLOSED CODE_FIX - No thermal events until acpi -t - HP nx6125">bug 5534</a></strike> against 2.6.18.1 kernel for several weeks. Generally speaking they work fine. However there are a few oddities worth noticing: 1) After resuming from suspend-to-ram temperature at TZ1 rises slightly under the same system workload, when compared to a pre-suspend state. On my notebook this can be easily observed when the system is idle. The increase in temperature leads to a change of the TZ1 state from active[3] to active[2] for a minute or so until it gets cooled enough to return to the previous state. That in turn once again causes a rise in temperature and the cycle is closed. The frequency of oscilations is constant. Such a behaviour might suggest that the patch from <a href="#c52">comment #52</a> or a combination of all mentioned earlier causes additional runtime overhead after resuming from suspend-to-ram. I'm not sure whether that was intention of the patch creator. 2) When I try to shutdown system after resume from suspend-to-ram, or try to suspend to disk I experience strange behaviour of my LCD matrix. After issuing shutdown command or receiving ACPI event from power button system seems to lost control over display. Instead I can see something that I would call spontaneous polarisation of pixels. Possibly I misinterpreted something and this is irrelevant to the patch from <a href="#c52">comment #52</a> so I will welcome any feedback. Tests were performed on HP nx6325 (Fedora Core 6). I use these patches on a regular basis on nx6325 and I haven't observed any strange symptoms related to them, but I don't use suspend to RAM. Also, I suspend to disk using the "shutdown" mode, because the "platform" mode causes the states of fans and thermal zones to be incorrect after the resume. I abandoned the 2.6.18.1 kernel in favor of the 2.6.18.2 one since I've got unstable behaviour of SATA driver during suspend phase. I also solved LCD problem - it was DPMS issue. Now things look like as follows: First 'platform' mode. I observed the very same temperature oscillacion at TZ1 caused by - what I know now - the fact the fan 352 doesn't work at all after resume. Interestingly, this situation won't happen, if I decide to suspend to disk when TZ1 is in other than active[3] state, or in other words at least one of the fans 351, 350 and 34f is active. The kernel prints a couple of error logs, but beside that everything seems stable: ACPI: [Power Resource - C34B] resume failed: -8 ACPI: [Power Resource - C34C] resume failed: -8 ACPI: [Power Resource - C34D] resume failed: -8 ACPI: Transitioning device [C34F] to D3 ACPI: Transitioning device [C34F] to D3 ACPI: Transitioning device [C350] to D3 ACPI: Transitioning device [C350] to D3 ACPI: Transitioning device [C351] to D3 ACPI: Transitioning device [C351] to D3 ACPI: Transitioning device [C34F] to D0 ACPI: Transitioning device [C34F] to D0 ACPI: Unable to turn cooling device [ffff810037d61cf0] 'on' ACPI: Transitioning device [C350] to D0 ACPI: Transitioning device [C350] to D0 ACPI: Unable to turn cooling device [ffff810037d61c50] 'on' ACPI: Transitioning device [C351] to D0 ACPI: Transitioning device [C351] to D0 ACPI: Unable to turn cooling device [ffff810037d61bd0] 'on' ACPI: Transitioning device [C350] to D0 ACPI: Transitioning device [C350] to D0 ACPI: Unable to turn cooling device [ffff810037d61c50] 'on' ACPI: Transitioning device [C351] to D0 ACPI: Transitioning device [C351] to D0 ACPI: Unable to turn cooling device [ffff810037d61bd0] 'on' ACPI: Transitioning device [C351] to D0 ACPI: Transitioning device [C351] to D0 ACPI: Unable to turn cooling device [ffff810037d61bd0] 'on' Now 'shutdown' mode. Although the kernel ring buffer doesn't contain any error logs, this is a less stable option. I cannot suspend to disk twice in a row, it hangs just before suspending in a second run (but after execution of preparation scripts). At active[3] state files at /proc/acpi/fan/* indicate the fan 352 is turned on exclusively, but having listened to noise the physical fan generate, I would say that fans 352 and 351 are both on. Does the fact that all those ACPI fans drive one physical device, have any importance here? All those remarks apply to the 2.6.18.2 kernel with patches from comment #38, comment #50, bug #5534 and a few swswap patches from mm tree bundled with Fedora Core 6 2.6.18-1.2849 kernel source package. patch in comment #52 applied to acpi-test And what about the patch from Comment #38? AFAICT it's also needed ... Len, please, pull the patch from comment #38. It is also required. Hi, I have the same fan problem after S3 suspend on my HP NC6000 notebook, on 2.6.19. I have no DSDT errors and no errors are printed to dmesg. I cannot transition fans after resuming. However, if I apply the patch from comment #38, everything magically starts working again (like 2.6.17). Please apply. Dear Linux user or developer, I do not know any coding andIi am not an expert at all. I tried kernel 2.6.19, still, after a suspend to ram the fan does't work anymore like 2.6.18. After a resume from suspend to disk the fan works fine. I found a simple solution to the problem: Comment out before compiling these 4 lines of the fan module: /*static int acpi_fan_suspend(struct acpi_device *device, int state); * static int acpi_fan_resume(struct acpi_device *device, int state); */ and: /* .suspend = acpi_fan_suspend, * .resume = acpi_fan_resume, */ Things will work properly then. I hope someone will resolve this bug, i am prepared to test a possible solution, but again, I am a Linux user and not an expert. Kind regards, Otto Meijer >From: bugme-daemon@bugzilla.kernel.org >To: meijer.o@hotmail.com >Subject: [Bug 7122] Thermal management problems - HPC nx6325 >Date: Mon, 4 Dec 2006 15:01:11 -0800 > >http://bugzilla.kernel.org/show_bug.cgi?id=7122 > > > > > >------- Additional Comments From alistair@devzero.co.uk 2006-12-04 14:57 >------- >Hi, I have the same fan problem after S3 suspend on my HP NC6000 notebook, >on >2.6.19. I have no DSDT errors and no errors are printed to dmesg. I cannot >transition fans after resuming. > >However, if I apply the patch from comment #38, everything magically starts >working again (like 2.6.17). > >Please apply. > >------- You are receiving this mail because: ------- >You are on the CC list for the bug, or are watching someone who is. _________________________________________________________________ Nieuw: Live Mail. Mis het niet en profiteer direct van de voordelen! http://imagine-windowslive.com/mail/launch/default.aspx?Locale=nl-nl The patch that should resolve the fan_resume problem is available from here (Bug #7570): http://bugzilla.kernel.org/attachment.cgi?id=9757&action=view It fixed the issue for my nx6125, could anybody try it on another boxes? Tried on nx6325 and it seems to work for me. Created attachment 10022 [details] the same patch for kernel 2.6.20-rc3 Original patch in comment #52 is against 2.6.18 kernel, this one cleanly patches 2.6.19 up to 2.6.20-rc3 kernels. (Tested on 2.6.20-rc3) Unfortunately the patch from comment 67 doesn't seem to work for me. I patched 2.6.20-rc3 with that patch and with the patch from comment 186 in bug 5534. The patch from comment 67 didn't apply cleanly for some reason so I had to fix it by hand. Before suspend to disk, the fans work as they should but after resume the temperature readings are stuck and the fans do not come on. This is with an nx6125 with BIOS F.11. I patched 2.6.18.5 with patches from comment 38 and 52. Notebook: nx6325 first, acpi seemed to work. But after some time, fan blows harder and i wonder why. so i type acpi -V to get more informations. I saw 106 Actually, I use these two patches with the 2.6.20-rc5 kernel on nx6325 and the work for me just fine, but _additionally_ I need to use one patch from Bug #5534. All of the patches that I use are available from http://www.sisk.pl/kernel/patches/2.6.20-rc5/ Referring to Comment #72 Ah, I tend to forget about one important thing. I have to remove the psmouse module before each suspend to disk (to prevent some strange things happening) and I use the "shutdown" suspend mode rather than "platform". If I patch 2.6.20-rc5 with all the patches Rafael had collected, I once again have working fans both before and after resume. Thanks, Rafael! Tried 2.6.20-rc5 with Rafaels patches and can say that hibernate through KDE works. Suspend still has issues and I can confirm the same as comment #71 that the machine thinks it's overheating. I'm pretty sure that the machine wasn't hot. I think the temperature reading is incorrect. Is there a possibility that when trying suspend that the temperature is being read incorrectly? Referring to Comment #75: Yes, there is. Do you unload psmouse before the suspend? BTW, in my previous posts "suspend" means "hibernate". Frankly, I haven't used the STR on my box yet. I noticed another effect with 2.6.20-rc5 with Rafael's patches: often (but not always) the fan keeps running at a low speed and never turns off. If this happens, the only way I've found to turn it off is to run something very cpu-intensive that brings the temperature up above 57 degrees (the first trip point). If I do that, the fan keeps running until the computer cools down to 50 degrees and then the fan turns off. This is on an nx6125 running the latest BIOS F.11. I do *not* have the "Fan always on when on AC" BIOS option turned on. In reply to comment 76. No I didn't unload the mouse. I will give that a shot soon and post results. For reference. nx6325 Using Fedora core 6 x86_64 2.6.20-rc5 with Rafaels patches. PSMouse as module Augmenting comment #78 After removing psmouse, I'm pretty happy with the results. Both suspend and resume seem to be working fairly well. Question: Which bios should we be using? I'm using F.02 without DSDT update and don't *seem* to be having any problems. Rafael your link to the patches is invaluable that's what finally did it for me. user.c didn't patch cleanly but easy enough to fix. Specifications for Reference: nx6325 Turion X2 60 1GB BIOS F.02 not patched with DSDT. kernel 2.6.20-rc5 with Rafaels patches above. Notes: Suspend doesn't seem to work with Binary ATI driver when using Dual Monitors. Nothing to do with this bug. Thanks to Rafael and Comment 72, i now have a working notebook. acpi seems to work very well (i am now using the 2.6.20rc5 kernel for two weeks). The hibernation doesnt work. With or without loaded psmouse module. Same problems as described above with the psmouse-module (slow boot at bios and slow linux booting when i did not remove the module). After upgrading from SuSE 10.1 to OpenSUSE 10.2, I've noticed that the patch from Bug #5534 that I used (http://bugzilla.kernel.org/attachment.cgi?id=9746&action=view, or ACPI-notify-revised.patch in "my" series) is no longer needed. The new series that I use against 2.6.20 is available at sftp://ogre.sisk.pl:31337/home/rafael/kernel/patches/2.6.20 Rafael, Is there something special in suse 10.2 outside of the kernel that makes things work? I'm on fedora and trying to figure out how to get there. In (latest!, I doubt this already hit the update process) SUSE 10.2 kernel the psmouse things are fixed (there psmouse is built in, so serio drivers get unload on shutdown). And the patch from #5534 comment #180, Rafael was referring to in his last comment is in. First, I'd like to apologize everyone for the premature conclusion in Comment #81. The patch from http://bugzilla.kernel.org/attachment.cgi?id=9746&action=view actually _is_ needed to make thermal management work regardless of the suspend-related issues. Also the link to my patch series in Comment #81 was wrong, sorry. I have uploaded my current series of patches against 2.6.20 into http://www.sisk.pl/kernel/patches/2.6.20/ (please note that two new patches from Bug #7887 are included in it). Referring to Comment #82: > Is there something special in suse 10.2 outside of the kernel that makes things work? No, I don't think so. The final patch from Bug #5534 should be sufficient to make thermal management work, but for the suspend some other patches are necessary. Please try to apply the series of patches from http://www.sisk.pl/kernel/patches/2.6.20/ and make sure you unload the psmouse module before the suspend. I haven't tried to suspend to RAM yet, but the suspend to disk should work with Fedora too, at least I don't see any fundamental obstacles. If you have any problems, please contact me directly (rjw@sisk.pl). Status Update 2.6.20 with Ralfaels patches seem to be working well. Have been testing them with suspend to ram using swsup, fedora core 6, ati binary driver. I'm currently using the kpowersave and the integrated power save scripts modified with tidbits here and there. Roughly testing 3 days. I have some acpi suspend scripts as well as a kpowersave configuration if anyone is interested. refreshed comment #52 patch in comment #69 applied to acpi-test *** Bug 7194 has been marked as a duplicate of this bug. *** Hi everybody, I will be the (unfortunate) owner of one nx6325 (don't think I can change this). Imagine me, a new to the Linux-on-notebook arena going through everything which has been written on every forum/wiki, be it HP, Gentoo Wiki, Fedora related, SuSE related etc. The information is scattered everywhere and it doesn't seem to reflect the latest changes in the BIOS/kernel versions. Could somebody here point me to some webpage or do a wrap-up of what is currently (not) working and how? I am interested in some information useful with * the latest BIOS version (currently F.06), nothing else * the latest kernel (currently 2.6.20), nothing else * nothing distribution specific on the following problems: 1. thermal run-a-way 2. thermal information not updating 3. 'bad state' problem 4. anything else I might have missed related to ACPI, kernel etc. (but not related to sound, wifi, bluetooth, modem, video) I am interested to know the latest & best solutions for each problem, because I think I've seen some problems solved too in many ways. I would have done this wrap-up myself if I could have participated in the process of fixing things up. Unfortunately, I am faced with all the facts after they had happened and I have to guess which things I need to do to get everything working (not getting damaged). Any help is appreciated. Lots of thanks, Mircea Most problems are fixed by patches in #5534. Problems with suspend/resume are fixed here (you need 5534 fixes already applied). Thanks for your rapid response. I'm a bit puzzled by the number of patches in both bug reports but I just remembered the git repository. http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.20-git15.log mentions that some patches around here about nx6325 have been committed. I assume that all the patches you were referring in the previous comment are already committed - please correct me if I'm wrong. You said "most of the problems". Is there anything else not fixed I should look for? Lots of thanks for your work. You should also include these: http://bugzilla.kernel.org/show_bug.cgi?id=7689 and be sure thermal polling is not that high, e.g. set it to 15 secs in: echo 15 >/proc/acpi/thermal_zone/*/polling_frequency This should fix the temperature to be stuck sometimes. According to the above mentioned kernel git log, patches from #7689 are also merged in. Commits: a1cec06177386ecc320af643de11cfa77e8945bd 82dd9eff4bf3b17f5f511ae931a1f350c36ca9eb Not sure if patch in comment #65 is in yet. -- Also, should I understand from the above comment that there are still problems with thermal polling*? I mean, does a normal laptop user have to adjust the polling frequency manually? (I have to mention that I haven't worked with Linux on laptops before) * is there another bug on this matter I should follow? Many thanks. refreshed comment #52 patch in comment #69 shipped in Linux-2.6.21-rc1. Closed Guys, am I missing something or shouldn't the fan set the power state to ACPI_D3 rather than ACPI_D0 in suspend? Changing that as I described in http://lkml.org/lkml/2007/2/28/71 fixed my fan on HP nw8000 after S3 suspends. You're obviously experiencing here some other issues aswell, but I hope this is worth pointing out here, too. (because atleast this is a bug thread that people with fan problems after suspend are likely to google and for some this just might be the fix they're looking for) > Guys, am I missing something or shouldn't the fan set the power state to ACPI_D3 > rather than ACPI_D0 in suspend? It is safer to set the fan to D0 on suspend because in this case we can be sure that the system will not overheat. Konstantin, ok. Is the fan really supposed stay functional while suspended? I doesn't do that on my machine. Once it goes down the fan's dead, with _D3 or _D0. The only difference is that having set it to D3 when suspending allows it to wake up when resuming. But I do get your point, that if some laptops allow it to be left on it's probably better that way incase for example the battery should start to overheat. However, on my laptop that leaves the fan totally dead after a suspend-to-memory cycle, but I guess there's something else wrong then and I'll try to figure out another way to fix it. And I should probably try that patch of yours that just crept into the main line and see wether that has any effect, I initially assumed that it was already in 2.6.20. Just posted accidentally (a dupe no-less) about this to linux-acpi also, sorry about that then if the change is deemed bad. Please try 2.6.21-rc2. It sould contain all of the required fixes AFAICT. Does not compute. It won't wake up from suspend at all. I'll bisect that tomorrow to see what broke it. In the mean time if there's more information you'd like (lspci, .config, whatever) don't hesitate to ask. > But I do get your point, that if some laptops allow it to be left on it's
> probably better that way incase for example the battery should start to overheat.
Actually, I've meant the situation where the system could go down on thermal
(possibly h/w) shutdown during suspend. If the fans stay on during suspend,
where thermal control is possibly not working already, thermal shutdown is less
likely to occur.
|