When acpi_ns_repair_CID() is called for a _CID which returns a package
of strings, it calls acpi_ns_repair_HID() for each of the package
elements. acpi_ns_repair_HID() calls acpi_ut_remove_reference() on the
original object, but acpi_ns_repair_CID() calls it again on return,
leading to a double free.
This problem was seen on a Acer TravelMate P449-G2-MG.
Thanks to Daniel Drake for helping investigating this problem.
Cross linking to ACPICA bug:
I have a fix now.
Is there anyone reachable can still help to validate?
This was already fixed in Linux commit ed7f8bc91a8df96662b21accc9a5abe2c292014a
I don't think it's fully fixed by the commit, the commit only makes it rare triggered.
1. in AcpiDsBuildInternalPackageObj(), package element reference count is still copied when reference count > 1;
2. repair code still contains reference count copy code in AcpiNsRepair_CID() , AcpiNsRepairNullElement() and AcpiNsSimpleRepair().
Is the reference count copy case a normal use case of reference counting?
These pieces of code still risk races when an element of a package is referenced by an AML execution flow and the flow unlocks interpreter/namespace locks. Unfortunately, there are many such unlock cases because we need to implement AML control method preemption rule defined by spec 5.5.2 Control Method Execution.
So the bug is still there, and you can trigger it using high parallel cases.
There is a complication for the root cause of this issue.
The root cause is actually the design of AcpiUtUpdateObjectReference().
Normally, there are several key design aspects for object deletion due to zeroed reference count.
1. Locking - software either marks a "dying state" of the object and never allows reference count to bump up/down when it is dying or locks around the entire deletion process. The latter is not achievable in OS kernels as the deletion process cannot be in an atomic context. Then have you seen previous one implemented in ACPICA - the answer is no. For many kinds of objects, ACPICA never knows whether they are about to be destroyed. And unfortunately, package operand is one of such objects.
2. Recursion - an object deletion process shouldn't consume stack too much and actually it won't if the cross reference layers of the objects are controlled to an acceptable level. What ACPICA is offering now is a different approach which converts recursive deletion's stack hazard into object deletion list's heap hazard. In order to delete an object, ACPICA actually needs to allocate additional heap memory to hold "next objects chain". And in order to make AcpiUtUpdateRefCount() to be the only interfaces that need to be atomic, all children reference counts are bumped up/down along with its parents' as a result of being referenced by the object deletion list mechanism. This mechanism in return reduces the occurance ratio of reference count bump up/down hazard mentioned in the above locking paragraph.
The mechanism that "the children objects' references are bumped up/down along with their parent objects' references" brings the needs of "reference count copy" and brings the trouble to the operand objects reference counting in a parallel execution environment.
These wrong designs entangles together to make it very difficult to solve one simple problem using just one single/simple change.
GET/PUT is actually a useful mechanism to track reference count bump up/down issue. Because object reference is only decremented to 0 when all pointer references are dropped. But ACPICA also doesn't contain get/put mechanism for operand objects, and AcpiUtAddReference() sometimes invoked after pointer assignment, and no caller checks if it returns errors as it never returns errors.
We first saw this issue on Acer TravelMate P449-G2-MG. The touchpad was unusable on every boot. We found memory corruption and then tracked it down to the double-free.
We no longer have that laptop so can't do further tests. Even so, the issue appeared to have been fixed with the commit I mentioned above, so I'm not sure quite how useful test feedback would be here.
what's the status of this bug?
I believe that we are still looking at this, but we have a fix.
is the fix in ACPICA release?
if not, do we have an ACPICA bug report for this issue?
Fixed in this commit:
Author: Robert Moore <Robert.Moore@intel.com>
Date: Wed Feb 22 08:58:00 2017 -0800
Update for automatic repair code for objects returned by evaluate_object
This change fixes two instances where the repair code made an incorrect
assumption about how reference counts are assigned to package objects.
Resolves issues where a warning was issued about a "large reference
count" -- which usually indicates an attempt to delete an object
that has previously been poisoned and released into the object cache.
Released in ACPICA version 20170303