Bug 13497

Summary: CPU fan at full speed after resume
Product: ACPI Reporter: Steve Hill (steve)
Component: Power-Sleep-WakeAssignee: acpi_power-sleep-wake
Status: REJECTED WILL_NOT_FIX    
Severity: normal CC: lenb, rui.zhang, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29.4-167.fc11.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg just before system was suspended
dmesg just after system was resumed
acpidump output
DSDT patch

Description Steve Hill 2009-06-10 11:24:23 UTC
On an Acer TravelMate 6413 notebook, if I suspend to RAM and leave it suspended for over about 3 minutes, when it resumes the CPU fan comes on at full speed and stays on indefinitely.  Hibernating to disk and then resuming returns the fan to its correct behaviour.

Suspending for very short periods of time doesn't cause this problem, leading me to believe that there is a timeout involved (wild conjecture: maybe the kernel needs to keep poking the fan, after it hasn't done so for some time fan just goes to full speed to protect the hardware?).

I have tried to kick the fan by doing:
    echo -n 3 >/proc/acpi/fan/FAN0/state
which produces an error:
    -bash: echo: write error: Exec format error
and causes dmesg to log:
    ACPI: Transitioning device [FAN0] to D3

Also, doing:
    echo -n 0 >/proc/acpi/fan/FAN0/state
produces no error, nor any dmesg log and has no effect on the fan.

Might be unconnected, but gnome's power manager seems to think the battery is no longer present after suspending (unplugging the power cable and plugging it back in makes it reappear) - there is nothing obviously wrong in /proc/acpi/battery/BAT0/ though.

The machine is running the latest BIOS (version 3.04).

This has also been reported in the Fedora Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=243008
Comment 1 Steve Hill 2009-06-10 11:30:02 UTC
Created attachment 21837 [details]
dmesg just before system was suspended
Comment 2 Steve Hill 2009-06-10 11:30:37 UTC
Created attachment 21838 [details]
dmesg just after system was resumed
Comment 3 Zhang Rui 2009-06-11 01:41:00 UTC
please attach the acpidump output
Comment 4 Steve Hill 2009-06-12 07:56:46 UTC
Created attachment 21866 [details]
acpidump output
Comment 5 Steve Hill 2009-06-12 08:05:46 UTC
It seems that if the CPU temperature gets to about 49°C the fan slows down and if you then let the temperature drop the fan will stop.

The critical trip point, /sys/devices/virtual/thermal/thermal_zone0/trip_point_0_temp, is set to 107000 but the "active0" trip point, /sys/devices/virtual/thermal/thermal_zone0/trip_point_1_temp is set to 0.  I get a permission denied error if I try to echo a new value into trip_point_1_temp (I'm not sure if I'm supposed to be able to do this?)
Comment 6 ykzhao 2009-06-12 08:10:57 UTC
Thanks for the info.
   From the acpidump it seems that this issue is related with the broken BIOS.
   From the acpidump we know that the ACPI fan device is turned on/off by the power resource. But unfortunately it is a bogus power resource object. There is nothing to do in the _ON/_OFF object of power resource.
   >PowerResource (FN00, 0x00, 0x0000){
    Method (_ON, 0, Serialized)
            {
            }

            Method (_OFF, 0, Serialized)
            {
            }
        }

    At the same time it seems that the FAN device is turned on by BIOS in course of suspend/resume. And we can do nothing about it.
    
    IMO this is a BIOS bug. And it had better be fixed by BIOS upgrading.
    Thanks.
Comment 7 Steve Hill 2009-06-12 08:59:31 UTC
Thanks for your help - I'll talk to Acer.

Am I right in thinking that the fan isn't under the control of the OS at all?  (i.e. under normal operation, the BIOS will be controlling the fan speed according to the CPU's reported temperature rather than the OS)
Comment 8 Zhang Rui 2009-06-15 02:57:42 UTC
(In reply to comment #7)
> Thanks for your help - I'll talk to Acer.
> 
> Am I right in thinking that the fan isn't under the control of the OS at all?

well, at least the fan is not controlled via ACPI.
BIOS can control the fan, but sometimes it can be controlled via some platform specific methods, which is also beyond my scope...

it would be great if you can verify if the problem still exists in Windows...
Comment 9 Steve Hill 2009-06-15 09:12:16 UTC
Unfortunately I don't have access to Windows, although some googling around suggests that it might be a problem under Windows too (similar but not identical model laptop: http://forum.soft32.com/windows/Fan-runs-high-speed-resuming-standby-ftopict350097.html).  I've contacted Acer about this but don't really have any hope of them even caring - my experience of Acer customer support is rather poor.

I have found a work-around though:
The DSDT seems to show that various bits of integrated hardware are controlled by storing an 8 bit value into SMIF and then storing zero into TRP0 - for example, to read the CPU temperature, we have:
    If (LEqual (RTMP, One))
    {
        Store (0x87, SMIF)
        Store (Zero, TRP0)
    }
    Store (One, RTMP)
    If (LGreaterEqual (DTS1, DTS2))
    {
        Return (Add (0x0AAC, Multiply (DTS1, 0x0A)))
    }
    Return (Add (0x0AAC, Multiply (DTS2, 0x0A)))

From this, it looks like SMIF is used to pass a command (presumably to an embedded controller?) and sending zero to TRP0 is used to give the embedded controller an interrupt.  From experimentation, in the above code, setting SMIF to 0x87 and then sending zero to TRP0 causes DTS1 and DTS2 to be updated with the current temperatures.

On my system, SMIF is at memory address 0x7f66bdbe (offset 2 of GNVS [0x7f66bdbc] - this is at the top end of the (2GB) of RAM installed, so presumably will move if more RAM is installed).  TRP0 is I/O port 0x0808 (offset 8 of IO_T [0x0800]).

After some trial end error, it seems that storing 0x86 in SMIF and 0 in TRP0 causes the fan speed to be re-set based on the current temperature.  So the upshot of this is that the problem can be worked around by writing 0x86 to memory address 0x7f66bdbe and then 0 to port 0x0808 after the system is resumed from standby.

An important note: I haven't found any kind of documentation - this was discovered by making wild assumptions based on parts of the DSDT and a lot of trial and error.  I haven't seen any side effects from using this work-around, but that doesn't mean they don't exist.
Comment 10 ykzhao 2009-06-16 00:48:06 UTC
Very cool finding.
    Maybe what you have done is right. 
    But it seems that this is an obvious BIOS bug. And we can do nothing about it. 
    So IMO this bug can be rejected.
    Thanks.
Comment 11 Zhang Rui 2009-06-17 03:54:33 UTC
Great job, Steve.
If we add this piece of code in the _WAK method, which is invoked during resume, the problem can be fixed.
But all of these suggests that this is a BIOS problem that we can not fix it in Linux/kernel.
So it would be great if you can check if there is any BIOS update for your laptop.
if no, you can submit your valuable findings to Acer as well.
Comment 12 Zhang Rui 2009-06-17 03:56:11 UTC
IMO, the workaround available in Linux is to use a custom DSDT...
close this bug as we can not fix it.
Comment 13 Steve Hill 2009-06-17 09:26:00 UTC
There is no BIOS more recent than the one I'm using.

I'm not sure what the current kernel policy is on working around broken hardware/firmware?  Replacing the DSDT for non-debugging purposes has fallen out of favour these days, and whilst it _is_ a bug which the vendor should be fixing, the fact remains that the vendor don't seem to care so I don't expect them to fix it.

So in these cases where there's a hardware or firmware bug that will almost certainly never be fixed, what is the policy for workarounds?  Either we implement a work around so that Linux "Just Works" on the device (and put up with the code bloat that this adds to the kernel), or we accept the fact that Linux will never work properly on the device without each and every end-user implementing their own work around (for example, patching the DSDT, which involves recompiling the kernel after every update on many distros).
Comment 14 Zhang Rui 2009-06-18 01:54:11 UTC
we need Len to answer this question. :)
Comment 15 Steve Hill 2009-06-18 07:19:58 UTC
Created attachment 21979 [details]
DSDT patch

Adds the workaround code to the _WAK method.
Comment 16 Steve Hill 2010-07-28 15:45:08 UTC
This can be worked around without recompiling the kernel now - recent kernels allow runtime patching of single DSDT methods from userland, so I've implemented the workaround here:
http://subversion.nexusuk.org/projects/acer_dsdt/trunk/

The question of what the kernel policy is regarding workarounds for firmware bugs that will never be fixed by the vendor has not been explained - can anyone shed some light on this?