Bug 10686

Summary: critical thermal shutdown regression 2.6.26-rc1 - HP Pavilion dv6700
Product: ACPI Reporter: Len Brown (lenb)
Component: BIOSAssignee: Zhang Rui (rui.zhang)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, arjan, bunk, hyc, nadavkav, Robert.Moore
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: acpidump.out
implicit return patch included in 2.6.26-rc1
bad patch in 2.6.26-rc1
patch: ignore invalid crit trip point
patch: introduce AE_STRICT
patch: thermal driver check _CRT strictly
patch vs 2.6.27... in acpi-test tree

Description Len Brown 2008-05-12 20:04:34 UTC
Latest working kernel version: 2.6.25
Earliest failing kernel version: 2.6.26-rc1

In 2.6.25 we got this:

ACPI Exception (thermal-0365): AE_BAD_DATA, No critical threshold [20070126]

and /proc/acpi/thermal_zone was empty

In 2.6.26-rc1 we got this:
/proc/acpi/thermal_zone/THR1/cooling_mode:<setting not supported>
/proc/acpi/thermal_zone/THR1/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THR1/state:state:                   critical 
/proc/acpi/thermal_zone/THR1/temperature:temperature:             50 C
/proc/acpi/thermal_zone/THR1/trip_points:critical (S5):           -73 C <disabled>
/proc/acpi/thermal_zone/THR1/trip_points:hot (S4):                120 C <disabled>
/proc/acpi/thermal_zone/THR1/trip_points:passive:                 92 C: tc1=2 tc2=3 tsp=30 devices=CPU0 CPU1 

though the system was booted with thermal.nocrt=1
in order to disable the critical trip point --
else the system would reboot...

The AML for _CRT is clearly broken.
Interestingly, for non Windows 2006, it appears that _HOT is broken too:

           Store (0x07D0, OSYS)
            If (CondRefOf (_OSI, Local0))
            {
                If (_OSI ("Linux"))
                {
                    Store (One, LINX)
                }

                If (_OSI ("Windows 2001"))
                {
                    Store (0x07D1, OSYS)
                }

                If (_OSI ("Windows 2001 SP1"))
                {
                    Store (0x07D1, OSYS)
                }

                If (_OSI ("Windows 2001 SP2"))
                {
                    Store (0x07D2, OSYS)
                }

                If (_OSI ("Windows 2006"))
                {
                    Store (0x07D6, OSYS)
                }
            }

...
            Method (_HOT, 0, Serialized)
            {
                If (LEqual (OSYS, 0x07D6))
                {
                    If (LEqual (\_SB.TJ85, Zero))
                    {
                        Return (Add (0x0AAC, Multiply (TPC, 0x0A)))
                    }
                    Else
                    {
                        Return (Add (0x0AAC, Multiply (TP85, 0x0A)))
                    }
                }
            }

            Method (_CRT, 0, Serialized)
            {
                If (LLess (OSYS, 0x07D6))
                {
                    If (LEqual (\_SB.TJ85, Zero))
                    {
                        Return (Add (0x0AAC, Multiply (TPC, 0x0A)))
                    }
                    Else
                    {
                        Return (Add (0x0AAC, Multiply (TP85, 0x0A)))
                    }
                }
	    }

So it appears that in the interest of being "bug compatible"
and conjuring up an "implicit return" here, we conjured up
a value that tubed the system when we should have disabled
_CRT (if not the entire thermal zone).
Comment 1 Len Brown 2008-05-12 20:05:47 UTC
Created attachment 16121 [details]
acpidump.out
Comment 2 Arjan van de Ven 2008-05-12 20:39:47 UTC
the function implicit-returns now "2006", due to 0x07D6 (2006 decimal)... which results in -73C.
Comment 3 Len Brown 2008-05-13 08:43:33 UTC
thermal.nocrt=1 works around the problem, of course -- causing
the -73 to show up in /proc/acpi/thermal_zone/THR1/trip_points
but preventing the actual shutdown.

Booting with "acpi=strict" disables implicit return and the malformed
thermal zone is then completely rejected by Linux with...

ACPI Exception (thermal-0370): AE_BAD_DATA, No critical threshold [20080321]
Comment 4 Len Brown 2008-05-13 08:58:07 UTC
Created attachment 16124 [details]
implicit return patch included in 2.6.26-rc1

This patch appears to be the only one between 2.6.25 and 2.6.26-rc1
that touches the implicit return workaround.  However,
for the record, reverting this patch does not change this failure.
Comment 5 Len Brown 2008-05-13 11:38:47 UTC
Created attachment 16127 [details]
bad patch in 2.6.26-rc1

git bisect shows that the attached patch is the first bad commit
where this problem begins.  Unfortunately, this patch doesn't
revert cleanly from the top of tree as subsequent patches
change the same files.
Comment 6 Robert Moore 2008-05-13 12:45:51 UTC
Did the patch change the value of the implicit return?
Comment 7 Rafael J. Wysocki 2008-05-13 15:06:47 UTC
This entry is being used for tracking a regression from 2.6.25.  Please don't
close it until the problem is fixed in the mainline.
Comment 8 Rafael J. Wysocki 2008-05-13 15:07:42 UTC
Regressions list annotation:
Handled-By : Robert Moore <Robert.Moore@intel.com>
Comment 9 Robert Moore 2008-05-14 09:56:53 UTC
This patch in question was released in version 20070320.

Executing with slack mode enabled:

1) Before 20070320, the _CRT above returns AE_AML_NO_RETURN_VALUE.
2) After 20070320, _CRT above returns 0x7D6.
3) With the latest ACPICA, _CRT returns 0.
Comment 10 Zhang Rui 2008-06-04 00:14:27 UTC
Created attachment 16386 [details]
patch: ignore invalid crit trip point

Workaround patch available
Comment 11 Adrian Bunk 2008-06-12 01:36:36 UTC
fix is now as commit a39a2d7c72b358c6253a2ec28e17b023b7f6f41c in Linus' tree
Comment 12 Zhang Rui 2008-08-27 20:00:08 UTC
*** Bug 11421 has been marked as a duplicate of this bug. ***
Comment 13 Zhang Rui 2008-08-27 20:49:01 UTC
During my test, both windows XP and Vista still enable the thermal zone even if the critical threshold is below 0C.
e.g. if _crt returns 2000, and _tmp returns 1900, we don't see anything unusual.
but if _crt return 2000, and _tmp returns 2010, windows will shutdown.

So I'm afraid commit a39a2d7c72b358c6253a2ec28e17b023b7f6f41c is not the right solution for this problem.

IMO, in order to fix this problem,
1) evaluating _CRT should fail, rather than return an arbitrary value in the current ACPICA code.
2) thermal driver should either ignore the critical trip point or ignore the whole thermal zone.
This seems to be what we've done in the past, before ACPICA 20070320 released, but I think it's the right way to go.
Comment 14 Howard Chu 2008-08-27 22:03:01 UTC
(In reply to comment #13)
> IMO, in order to fix this problem,
> 1) evaluating _CRT should fail, rather than return an arbitrary value in the
> current ACPICA code.

> 2) thermal driver should either ignore the critical trip point or ignore the
> whole thermal zone.
> This seems to be what we've done in the past, before ACPICA 20070320
> released,
> but I think it's the right way to go.

I may be stating the obvious, but I vote for keeping the rest of the thermal zone working, because I like being able to see the temperature reading. Ignoring just the invalid trip point is probably OK.
Comment 15 Zhang Rui 2008-08-27 22:32:14 UTC
Created attachment 17494 [details]
patch: introduce AE_STRICT
Comment 16 Zhang Rui 2008-08-27 22:43:31 UTC
Created attachment 17495 [details]
patch: thermal driver check _CRT strictly

Bob,
when there is no return object for the methods that requires a return value, the current ACPICA code returns AE_OK and an arbitrary value.
This is not right IMO.
At least there should be some way that OS can check if the return value is bogus.
And the first patch introduces a new error code AE_STRICT, which is returned in this case. It equels AE_OK unless ACPI_SUCCESS_STRICT/ACPI_FAILURE_STRICT is used.
Comment 17 Zhang Rui 2008-09-09 19:29:56 UTC
Bob, ming, len,

any comments about the patch above?
Comment 18 Robert Moore 2008-09-10 10:09:31 UTC
Looks like there are a few things going on here.

Apparently, in actuality, there is no support for an "implicit return" for top-level methods called from the acpi_evaluate_object interface. I don't remember why this is, but I suspect that it has something to do with preventing memory leaks caused by adding the implicit return feature and/or that it wasn't needed at the time we added this feature.

The original test I used was this:

    Method (MAIN, 2, NotSerialized)
    {
        Store (_CRT (), Local0)
        Store (Local0, Debug)
    }

Which calls the original _CRT code in the bug description above. This test was able to use the implicit return feature, so we got the incorrect result.

Therefore, I now think that this problem has nothing to do with the implicit return feature.

In the ACPICA acpi_evaluate_object interface, if there is a buffer provided for a return value but there is no actual return value, the buffer length is set to zero:

    /*
     * If we are expecting a return value, and all went well above,
     * copy the return value to an external object.
     */
    if (ReturnBuffer)
    {
        if (!Info->ReturnObject)
        {
            ReturnBuffer->Length = 0;
        }
        else

What I think is happening is that the Linux thermal.c driver does not check for a ReturnBuffer->Length of zero and the failure occurs because it is essentially looking at random data in the ReturnBuffer->Pointer (data).

ACPICA returns an AE_OK exception in this case because it has no way of knowing whether a method is "supposed" to return a value or not. Just because a user buffer was passed in is not sufficient evidence of this condition. It is in fact the ReturnBuffer->Length field that is used as an output parameter to indicate the amount of buffer actually used by the return value (in this case, zero).

I would suggest that the immediate fix to the problem is to update the acpi_evaluate_integer interface in the Linux file utils.c to examine the length field. This bug is probably in other drivers and utility functions also. (acpi_evaluate_reference is one example.)

	buffer.length = sizeof(union acpi_object);
	buffer.pointer = element;
	status = acpi_evaluate_object(handle, pathname, arguments, &buffer);
-	if (ACPI_FAILURE(status)) {
+	if (ACPI_FAILURE(status) || !buffer.length) {
		acpi_util_eval_error(handle, pathname, status);
		kfree(element);
		return status;
	}

I realize that this problem is somewhat of a direct result of overloading the buffer.length field to be both an input and output parameter (with resulting confusion), and may have been an unfortunate decision -- but the immediate bug fix is above.



In the near future, however, the "predefined method validation" code will be released which will in fact detect the condition where a method such as _CRT does not return a value. This doesn't change the fact that in general, the acpi device drivers should be able to handle a return buffer length of zero.

The new code will produce the message below, and the only debate at this point is to continue to return AE_OK or go ahead and return AE_AML_NO_RETURN_VALUE. Since this makes the driver *even more* complex, it could be argued to just leave the upper ACPICA code as-is, returning AE_OK and a null buffer length.

- ex _CRT
Executing \_CRT
ACPI Warning (nspredef-0251): \_CRT: Missing expected return value [20080910]

BTW, the warning only occurs on the first call to _CRT, ignored later. This means dmesg will have valuable data but not too much of it.

Bob
Comment 19 Robert Moore 2008-09-10 10:28:38 UTC
FYI, here is the relevant section from the ACPICA programmer reference:

4.2.6.2	Output Buffer
An output buffer is defined to be a buffer that is filled with data by an ACPI interface before it is returned to the caller. When the ACPI_BUFFER structure is used as an output buffer the caller must always initialize the structure by either 
1.	Placing a value in the Length field that indicates the maximum size of the buffer that is pointed to by the Pointer field. The length is used by the ACPI interface to ensure that there is sufficient user provided space for the return value. 
2.	Initializing the Length field to ACPI_ALLOCATE_BUFFER to cause the ACPI subsystem to allocate a buffer.
If a buffer that was passed in by the caller is too small, the ACPI interfaces that require output buffers will indicate the failure by returning the error code AE_BUFFER_OVERFLOW. The interfaces will never attempt to put more data into the caller’s buffer than is specified by the Length field of the ACPI_BUFFER structure (unless ACPI_ALLOCATE_BUFFER is used). The caller may recover from this failure by examining the Length field of the ACPI_BUFFER structure. The interface will place the required length in this field in the event that the buffer was too small.
During normal operation, the ACPI interface will copy data into the buffer. It will indicate to the caller the length of data in the buffer by setting the Length field of the ACPI_BUFFER to the actual number of bytes placed in the buffer.
Therefore, the Length field is both an input and output parameter. On input, it indicates either the size of the buffer or an indication to the ACPI subsystem to allocate a return buffer on behalf of the caller. On output, it either indicates the actual amount of data that was placed in the buffer (if the buffer was large enough), or it indicates the buffer size that is required (if the buffer was too small) and the exception is set to AE_BUFFER_OVERFLOW.
Comment 20 Nadav Kavalerchik 2008-10-04 08:04:25 UTC
i have the same issue on HP Pavilion tx2520ej
empty thermal_zone and fan folders although they are loaded into kernel

ACPI Exception (thermal-0377): AE_OK, No or invalid critical threshold

my bios's dsdt.dsl (line 8697) shows :
Method (_HOT, 0, Serialized)
            {
                If (LEqual (TPOS, 0x40))
                {
                    Return (Add (0x0AAC, Multiply (TPC, 0x0A)))
                }
                Else
                {
                    Return (Zero)
                }
            }

            Method (_CRT, 0, Serialized)
            {
                If (LLess (TPOS, 0x40))
                {
                    Return (Add (0x0AAC, Multiply (TPC, 0x0A)))
                }
                Else
                {
                    Return (Zero)
                }
            }

i will try to fix it and recompile (my first time)
hope it will fix the bios :-)
Comment 21 Nadav Kavalerchik 2008-10-05 02:30:57 UTC
btw,
i use debian sid with kernels 2.6.26-1-i686 / amd64 
and kernels 2.6.27-rc7-i686 and amd64
i have this behavior on all of them.

the tablet used to shutdown on boot before i used "acpi=off"
and then added "options thermal nocrt=1"
Comment 22 Shaohua 2008-10-14 22:32:21 UTC
Rui, what's status of your patch and this bug?
Comment 23 Len Brown 2008-10-22 14:39:24 UTC
Created attachment 18403 [details]
patch vs 2.6.27... in acpi-test tree

this bug is also here:
http://www.acpica.org/bugzilla/show_bug.cgi?id=349

the patch is in the acpi test tree:

commit 0a152f4c675ceb7a9ef4e641c35e71371f764b63
Author: Lin Ming <ming.m.lin@intel.com>
Date:   Sat Sep 27 12:01:12 2008 +0800

    ACPICA: Fix for implicit return compatibility
    
    Predicate can be used for an implicit return value.
    This change improves the implicit return mechanism to be more
    compatible with the MS interpreter.
    
    http://www.acpica.org/bugzilla/show_bug.cgi?id=349
    
    Below AML code from http://bugzilla.kernel.org/show_bug.cgi?id=10686
...
Comment 24 Len Brown 2008-10-24 22:52:53 UTC
shipped in linux-2.6.28-rc1
closed

commit b9d1312ad4246e467f333dfe2ac4dc7a79608d59
Author: Lin Ming <ming.m.lin@intel.com>
Date:   Sat Sep 27 12:01:12 2008 +0800

    ACPICA: Fix for implicit return compatibility

    Predicate can be used for an implicit return value.
    This change improves the implicit return mechanism to be more
    compatible with the MS interpreter.