Bug 60561 - Notebook P15SM: Regression of "ACPI: Rework acpi_get_child() to be more efficient"
Notebook P15SM: Regression of "ACPI: Rework acpi_get_child() to be more effic...
Status: CLOSED CODE_FIX
Product: ACPI
Classification: Unclassified
Component: Power-Video
All Linux
: P1 normal
Assigned To: Lan Tianyu
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-16 12:40 UTC by Peter Wu
Modified: 2013-08-10 09:14 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.8.5, all versions with 33f767d7
Tree: Mainline
Regression: Yes


Attachments
acpidump for Notebook P15SM (359.65 KB, text/plain)
2013-07-16 12:40 UTC, Peter Wu
Details
debug.patch (516 bytes, patch)
2013-07-17 02:06 UTC, Lan Tianyu
Details | Diff
debug.patch (597 bytes, patch)
2013-07-17 13:43 UTC, Lan Tianyu
Details | Diff
ACPI / glue: Try to resolve _ADR collisions for bridges (4.80 KB, patch)
2013-07-27 22:06 UTC, Rafael J. Wysocki
Details | Diff
pci_is_bridge.patch (826 bytes, patch)
2013-08-06 08:52 UTC, Lan Tianyu
Details | Diff
ACPI / glue: Try to resolve _ADR collisions for bridges (v2) (5.09 KB, patch)
2013-08-06 14:39 UTC, Rafael J. Wysocki
Details | Diff

Description Peter Wu 2013-07-16 12:40:59 UTC
Created attachment 106894 [details]
acpidump for Notebook P15SM

Bug https://bugzilla.kernel.org/show_bug.cgi?id=42696 fixes incorrect enumeration of an ACPI handle for a PCI device:

    commit 33f767d767e9a684e9cd60704d4c049a2014c8d5
    Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Date:   Thu Jan 10 13:13:49 2013 +0100

        ACPI: Rework acpi_get_child() to be more efficient

This resulted in breakage on another machine as reported here[1]. The video card can again not be found. This seems to be caused by a Device which does not have anything useful as shown in [2]. When the above commit is reverted (on top of 3.9.9 and 3.10.0), the bug is gone for these users.

Details from DMI:
system-manufacturer   : Notebook                        
system-product-name   : P15SM                           
system-version        : Not Applicable                  
baseboard-manufacturer: Notebook                        
baseboard-product-name: P15SM                           
baseboard-version     : Not Applicable                  
bios-vendor           : American Megatrends Inc.
bios-version          : 4.6.5
bios-release-date     : 04/09/2013

For some debugging attempts, see [1]. The bug occurs because the wrong ACPI handle is attached to the PCIe Root Port. The correct handle is \_SB.PCI0.PEG0 (which has a child PEGP video device), but after the above commit it becomes \_SB.PCI0.P0P2 (which has no children).

I thought that the commit in 3.11 that tests _STA would solve this issue (as the P0P2 device does not have such a method), but the results were negative. The commit they tried to apply (without revert) was:

    commit c7d9ca90aa9497f0b6e301ec67c52dd4b57a7852
    Author: Jeff Wu <zlinuxkernel@gmail.com>
    Date:   Wed May 29 06:31:30 2013 +0000

        ACPI: add _STA evaluation at do_acpi_find_child()

 [1]: https://github.com/Bumblebee-Project/bbswitch/issues/65
 [2]: https://github.com/Bumblebee-Project/bbswitch/issues/65#issuecomment-20824596
Comment 1 Lan Tianyu 2013-07-17 02:05:02 UTC
Check the dsdt and ssdt5 table, both \_SB.PCI0.PEG0 and \_SB.PCI0.P0P2 have same addr(0x00010000) but they don't have _STA method.

Before commit 33f767, acpi_get_child() returned the last one when there were devices with the same _ADR. After the commit, it returned the first one.
This  may explain that the handle became \_SB.PCI0.P0P2 from \_SB.PCI0.PEG0.

commit c7d9ca should resolve the problem since it will return the last one if all devices don't have _STA(Here is the case.).

I notice commit 33f767 changed do_acpi_find_child() as post_order_visit callback 
for acpi_walk_namespace() from pre_order_visit. This may be the cause. I will produce a debug.patch. Please have a try.

Device (P0P2)
{
     Name (_ADR, 0x00010000)  // _ADR: Address
     Method (_PRT, 0, NotSerialized)  // _PRT: PCI Routing Table
     {
          If (PICM)
          {
               Return (AR02 ())
          }

          Return (PR02 ())
     }
}



Scope (\_SB.PCI0)
{
        Name (LTRS, Zero)
        Name (OBFS, Zero)
        Device (PEG0)
        {
            Name (_ADR, 0x00010000)  // _ADR: Address
            OperationRegion (PEGR, PCI_Config, 0xC0, 0x30)
            Field (PEGR, DWordAcc, NoLock, Preserve)
            {
                Offset (0x02),
                PSTS,   1,
                Offset (0x2C),
                GENG,   1,
                    ,   1,
                PMEG,   1
            }
...
Comment 2 Lan Tianyu 2013-07-17 02:06:29 UTC
Created attachment 106899 [details]
debug.patch
Comment 3 Peter Wu 2013-07-17 08:46:30 UTC
I assumed that PEG0 had a _STA method since it was missing at P0P2, but that apparantly does not hold. Also, after c7d9ca9, it should ignore disabled devices (as reported through _STA). That should fix bug 42696 too without regressing on others (such as this machine).

The Lenovo laptops from bug 42696 has devices returning _ADR = 0; PEGP._STA returns 0x0f and VGA._STA returns Zero if a certain GPIO is non-zero and 0x0f otherwise. Assuming that this GPIO is non-zero, the bug should be fixed. I will ask some owners of these Lenovo laptops to test your patch on top of c7d9ca9.

Radael, the commit message for 33f767d contains:

    "Moreover, acpi_get_child() doesn't need to loop any more once it has
    found a matching handle, so make it stop in that case.  **To prevent
    the results from changing**, make it use do_acpi_find_child() as
    a post-order callback."

In what case can the results change then?
Comment 4 Peter Wu 2013-07-17 09:44:40 UTC
It has been confirmed[1] that the patch from comment 2 does not cause a regression on Lenovo laptops from bug 42696.

 [1]: https://github.com/Bumblebee-Project/bbswitch/issues/2#issuecomment-21101083
Comment 5 Rafael J. Wysocki 2013-07-17 11:13:44 UTC
(In reply to Peter from comment #3)
> I assumed that PEG0 had a _STA method since it was missing at P0P2, but that
> apparantly does not hold. Also, after c7d9ca9, it should ignore disabled
> devices (as reported through _STA). That should fix bug 42696 too without
> regressing on others (such as this machine).
> 
> The Lenovo laptops from bug 42696 has devices returning _ADR = 0; PEGP._STA
> returns 0x0f and VGA._STA returns Zero if a certain GPIO is non-zero and
> 0x0f otherwise. Assuming that this GPIO is non-zero, the bug should be
> fixed. I will ask some owners of these Lenovo laptops to test your patch on
> top of c7d9ca9.
> 
> Radael, the commit message for 33f767d contains:
> 
>     "Moreover, acpi_get_child() doesn't need to loop any more once it has
>     found a matching handle, so make it stop in that case.  **To prevent
>     the results from changing**, make it use do_acpi_find_child() as
>     a post-order callback."
> 
> In what case can the results change then?

When the "right" device is the last one and the "wrong" device is the first one, for example.

I suppose we can apply the Tianyu's patch from comment #2, but then the original problem may reappear in some cases.
Comment 6 Lan Tianyu 2013-07-17 13:43:56 UTC
Created attachment 106909 [details]
debug.patch

Hi, sorry. I think I didn't catch point. I originally assumed the acpi_bus_get_status_handle() returned error when the _STA didn't exist. But it doesn't. If _STA was not founded, acpi_bus_get_status_handle() would return 0x0f as status which contains ENABLED. So name space scan still stops when the first device has been found because it is marked as ENABLED(commit c7d9ca90 introduce this). For this case, evaluating _STA directly maybe more sense and this is the purpose of the debug.patch. Please have a try.

acpi_status acpi_bus_get_status_handle(acpi_handle handle,
                                       unsigned long long *sta)
{
        acpi_status status;

        status = acpi_evaluate_integer(handle, "_STA", NULL, sta);
        if (ACPI_SUCCESS(status))
                return AE_OK;

        if (status == AE_NOT_FOUND) {
                *sta = ACPI_STA_DEVICE_PRESENT | ACPI_STA_DEVICE_ENABLED |
                       ACPI_STA_DEVICE_UI      | ACPI_STA_DEVICE_FUNCTIONING;
                return AE_OK;
        }
        return status;
}
Comment 7 Peter Wu 2013-07-17 14:14:37 UTC
@Ra(In reply to Rafael J. Wysocki from comment #5)
> > In what case can the results change then?
> 
> When the "right" device is the last one and the "wrong" device is the first
> one, for example.
> 
> I suppose we can apply the Tianyu's patch from comment #2, but then the
> original problem may reappear in some cases.

Oh, you mean that the result pointer is overwritten with a new value.

@Lan According to the ACPI spec on _STA:
> If a device object (including the processor object) does not have an
> _STA object, then OSPM assumes that all of the above bits are set
> (i.e., the device is present, enabled, shown in the UI, and
> functioning).

To me, your patch proposed in comment 6 seems incorrect. If _STA does not exist, the device must be assumed present, enabled, etc. Wouldn't it make more sense to stop at the first device that:

1. has a matching address (via _ADR); and
2. is enabled (_STA does not exist or _STA returns with Enabled bit set).

In bug 42696, (2) was not tested hence it would pick the first occurence even if the device was not available. As that was probably the reason why Rafael changed from pre- to post-walk, I suggest to revert to pre-walk again as is done in the patch from comment 2. That one has been tested on the affected Lenovo laptops.
Comment 8 Peter Wu 2013-07-17 14:23:41 UTC
... but that patch from comment 2 seems not to solve this bug[1]. Why not actually? As acpi_bus_get_status_handle is successful and is Enabled (because _STA does not exist), shouldn't it terminate on the first match?

Assuming the order to be \_SB.PCI0.PEG0 followed by \_SB.PCI0.P0P2 (since the current behavior is post-walk+terminate on P0P2), I would expect that PEG0 is selected when using pre-walk?

 [1]: https://github.com/Bumblebee-Project/bbswitch/issues/65#issuecomment-21104823
Comment 9 Lan Tianyu 2013-07-17 15:36:39 UTC
Hi Peter:
        The changlog of c7d9ca9 said it's dedicated to the cases that multi device share one address but only one device's status is enabled at the same time. Please see the detail discussion at the following link. 
http://marc.info/?l=linux-acpi&m=136919329004398&w=2

"At our platforms, DEV1 use for win8,DEV2 use for win7,DEV1 use for Linux.
If these devices are disabled when do init, they will always be
disabled ,so, work fine."

       For that case, the Bios must provide _STA for these devices since it's needed to identify which device should be used on different OS. So I thought evaluating _STA directly maybe more sense. But for this bug, both devices don't have _STA and both are marked as ENABLED by OS. The previous rule(only one device is enabled) is broken. So it's hard to identify which should be right one in the linux acpi. For this machine, the last one should be chosen.

commit c7d9ca90aa9497f0b6e301ec67c52dd4b57a7852
Author: Jeff Wu <zlinuxkernel@gmail.com>
Date:   Wed May 29 06:31:30 2013 +0000

    ACPI: add _STA evaluation at do_acpi_find_child()

    Once do_acpi_find_child() has found the first matching handle, it
    makes the acpi_get_child() loop stop and return that handle.  On some
    platforms, though, there are multiple devices with the same value of
    "_ADR" in the same namespace scope, and if one of them is enabled,
    the others will be disabled.  For example:

     Address : 0x1FFFF ; path : SB_PCI0.SATA.DEV0
     Address : 0x1FFFF ; path : SB_PCI0.SATA.DEV1
     Address : 0x1FFFF ; path : SB_PCI0.SATA.DEV2
Comment 10 Lan Tianyu 2013-07-19 07:17:08 UTC
The patch in the comment #6 is proved to work on both Notebook P15SM and Lenovo Y480.

https://github.com/Bumblebee-Project/bbswitch/issues/65
https://github.com/Bumblebee-Project/bbswitch/issues/2#issuecomment-21101083
Comment 11 Rafael J. Wysocki 2013-07-19 12:06:13 UTC
The patch from comment #6 looks OK, but please don't mark the bug as resolved until you've submitted the patch for inclusion.

And if you have submitted it, please add the patchwork link to the fix to this entry and then mark it as resolved.
Comment 12 Lan Tianyu 2013-07-22 05:18:03 UTC
(In reply to Rafael J. Wysocki from comment #11)
> The patch from comment #6 looks OK, but please don't mark the bug as
> resolved until you've submitted the patch for inclusion.
> 
> And if you have submitted it, please add the patchwork link to the fix to
> this entry and then mark it as resolved.
Ok. I get it. Thanks for reminder and I will keep in mind.
Comment 13 Lan Tianyu 2013-07-23 05:37:14 UTC
The fix patch has sent to ACPI maillist.

https://patchwork.kernel.org/patch/2831706/
Comment 14 Rafael J. Wysocki 2013-07-23 22:40:57 UTC
(In reply to Peter from comment #8)
> ... but that patch from comment 2 seems not to solve this bug[1]. Why not
> actually? As acpi_bus_get_status_handle is successful and is Enabled
> (because _STA does not exist), shouldn't it terminate on the first match?
> 
> Assuming the order to be \_SB.PCI0.PEG0 followed by \_SB.PCI0.P0P2 (since
> the current behavior is post-walk+terminate on P0P2), I would expect that
> PEG0 is selected when using pre-walk?
> 
>  [1]:
> https://github.com/Bumblebee-Project/bbswitch/issues/65#issuecomment-21104823

That's a good point.  I think the regression is not the change of the ordering, which actually *did* *not* happen (the ordering of the walk is different, but before commit 33f767d7 the last handle found was returned and now the first handle found is returned, which should be the same), but the fact that _ADR is now evaluated directly instead of using acpi_get_object_info().

That said, I think we can see how using the patch from comment #13 goes, but its changelog should be modified.
Comment 15 Rafael J. Wysocki 2013-07-23 23:42:22 UTC
However, I'm afraid that still wouldn't be robust enough.

Peter, what *exactly* do you mean by "anything useful" in the problem description?  Is there any specific method we should be looking for that is common to all cases like this?  _INI?  _DSM?  Both??
Comment 16 Peter Wu 2013-07-24 10:06:53 UTC
That \_SB.PCI0.P0P2 device has only an _PRT methods besides _ADR, this is the complete device, there are no other objects:

    Device (P0P2)
    {
        Name (_ADR, 0x00010000)  // _ADR: Address
        Method (_PRT, 0, NotSerialized)  // _PRT: PCI Routing Table
        {
            If (PICM)
            {
                Return (AR02 ())
            }

            Return (PR02 ())
        }
    }

PEG0 has additional methods:
\_SB_.PCI0.PEG0._DSM
\_SB_.PCI0.PEG0.HPME
\_SB_.PCI0.PEG0._INI
\_SB_.PCI0.PEG0._PRT
\_SB_.PCI0.PEG0._PRW
\_SB_.PCI0.PEG0._PSW
\_SB_.PCI0.PEG0._STA (returns 0x0F)
(+some non-standard methods like RBP0)

_INI writes to two memory locations. The name _ADR has value 0x00010000 as expected for PCI device 0000:00:01.0 and is not modified.

Doesn't seem usable, not all devices expose those methods, like the Lenovo_Ideapad_Y570:
\_SB_.PCI0.PEG0._PRT
\_SB_.PCI0.PEG0._STA (returns 0x0F)

If it matters, P0P2 is defined in DSDT while PEG0 is found in SSDT5 which has OEM (Table) ID "SaSsdt".
Comment 17 Rafael J. Wysocki 2013-07-27 14:53:52 UTC
Tianyu went through the acpi_ns_walk_namespace() code in the meantime and explained the mystery here.

Namely, for acpi_ns_walk_namespace() "post-order" means "after all of the children have been visited" rather than "on the way back", so in fact for devices that have no children or (like in this case) for walks of depth 1 "post-order" and "pre-order" are actually the same.

Thus I don't think that applying the patch from comment #13 would improve a lot and we need a more robust way of deciding which object to choose.

For bridges, which this bug is about, I think we can simply choose the object that has child devices.

So the algorithm would be:
1. If this is a bridge, is enabled and has child devices, return it.
2. If it is not a bridge and is enabled, return it.
3. Otherwise return the first matching object found.
Comment 18 Rafael J. Wysocki 2013-07-27 22:06:17 UTC
Created attachment 107027 [details]
ACPI / glue: Try to resolve _ADR collisions for bridges

This patch implements the algorithm described above.

Since I can't really test it beyond verifying that it doesn't break the simplest case, can you please check if it makes any difference on the affected systems?
Comment 19 Peter Wu 2013-07-28 19:03:43 UTC
No regression reported for bug 42696 [1], still awaiting feedback for affected users of this bug. The patch looks OK to me. 

 [1]: https://github.com/Bumblebee-Project/bbswitch/issues/2#issuecomment-21687350
Comment 20 Peter Wu 2013-07-29 10:01:38 UTC
And unfortunately, this bug does not appear to be fixed by the patch from comment 18[1], the wrong ACPI handle is still detected for the PCIe port.

 [1]: https://github.com/Bumblebee-Project/bbswitch/issues/65#issuecomment-21700439
Comment 21 Rafael J. Wysocki 2013-07-29 11:34:47 UTC
OK, I need some more debug info from those systems with that patch applied.

I'll send a debug patch for that later today.
Comment 22 Lan Tianyu 2013-08-06 08:52:38 UTC
Created attachment 107115 [details]
pci_is_bridge.patch

Hi Rafael:
       Do you have any update about this bug? After reviewing the patch in Comment 18, actually I don't find logical issue. But the pci_is_bridge() is suspicious. Because it depends on the pci->subordinate which will be populated when the bridge's child bus is allocated. When the bridge's pci dev is created and doing bind operation, the child bus is not allocated. So the pci_is_bridge() seems not work in this case. The attachment patch is to use hdr_type as condition to identify whether the pci dev is a bridge or not. This maybe help. Please have a look. Thanks.
Comment 23 Lan Tianyu 2013-08-06 13:31:46 UTC
With the patch in the comment 22, the patch in the comment 18 works.

https://github.com/Bumblebee-Project/bbswitch/issues/65#issuecomment-21700439
Comment 24 Rafael J. Wysocki 2013-08-06 13:57:53 UTC
Nice catch, thanks!

But please do not modify pci_is_bridge() this way here (we may do it separately later if people don't object).

I guess the best approach for fixing this issue would be to fold the change from the patch in comment #22 into the patch in comment #18.  I'll do that shortly.
Comment 25 Rafael J. Wysocki 2013-08-06 14:39:48 UTC
Created attachment 107119 [details]
ACPI / glue: Try to resolve _ADR collisions for bridges (v2)

This patch combines the patch from comment #18 with the patch from comment #22.

Please test it.
Comment 26 Vladimir Lalov 2013-08-06 16:52:28 UTC
I've just tried it on P177SM and it works well. Only the warning as some said in the bbswitch issue spoils the perfection:

ACPI Warning: \_SB_.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
Comment 27 Rafael J. Wysocki 2013-08-06 21:16:56 UTC
Thanks for the confirmation!

The warning is a result of a recent ACPICA change unrelated to the issue tracked here.

The problem is that the 4th argument of the _DSM method is supposed to be a package and it looks like a buffer is passed there instead.  [ACPICA didn't warn about method argument type mismatches previously.]
Comment 28 Peter Wu 2013-08-06 21:37:03 UTC
@Vladimir
Many NVIDIA Optimus laptops expect a Buffer as fourth argument, this warning cannot be silenced without breaking those machines.

Related patches for the unrelated ACPI warning:
Use Buffer instead of Integer for nouveau:
http://lists.freedesktop.org/archives/dri-devel/2013-August/042730.html

Use Package instead of Buffer in i915 - bug 32602 (queued for -next)

Clarification why bbswitch uses Buffer:
https://github.com/Bumblebee-Project/bbswitch/commit/ee0591b
Comment 29 Vladimir Lalov 2013-08-07 06:01:40 UTC
Thank you for the explanation.
Comment 30 Robert Moore 2013-08-07 16:24:45 UTC
We can modify ACPICA to accept either a package or a buffer. Sometimes I wonder why we bother with an "ACPI specification", however.
Comment 31 Rafael J. Wysocki 2013-08-07 22:39:44 UTC
Well, that's a question to the people who wrote those BIOSes (and the the developers of whatever system they have been tested with ...).
Comment 32 Robert Moore 2013-08-08 21:40:11 UTC
/* Although the ACPI spec defines Arg3 as a Package, in practise
 * implementations expect a Buffer (CreateWordField and Index functions are
 * applied to it).
 */

Interestingly, if a package of discrete objects was supported by the _DSM, there would be no need to extract individual data items from a buffer object using CreateWordField, Index, etc. -- which, BTW, is far more error prone from both the driver and the BIOS perspectives.  :-(

Bob
Comment 33 Robert Moore 2013-08-09 16:56:09 UTC
I don't think we are quite finished with this.

I downloaded the acpidump and looked at the _DSM methods defined in the various tables (DSDT, SSDTs).

Looking at this particular method that causes a warning. It appears in SSDT5:

>ACPI Warning: \_SB_.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found
>[Buffer], ACPI requires [Package] (20130517/nsarguments-95)

    Scope (\_SB.PCI0)
    {
        Device (PEG0)
        {
            Method (_DSM, 4, Serialized)  // _DSM: Device-Specific Method
            {


This control method does not even reference Arg3, let alone create a buffer field on it.

As proof, here is an invocation of the method with the ACPI-defined arguments

    _SB_.PCI0.PEG0.PEGP._DSM (Buffer, Integer, Integer, Package)


- debug _SB_.PCI0.PEG0.PEGP._DSM (00 00 00 00) 1 2 [1]
Evaluating \_SB_.PCI0.PEG0.PEGP._DSM
[_DSM] @00000 #008C:  CreateByteField (Arg0, 0x03, GUID)

% args
Arguments for Method [_DSM]:  (4 arguments defined, max concurrency = 0)
    Arg0:   00314EB0 <Obj>             Buffer(4) 00 00 00 00
    Arg1:   00315630 <Obj>             Integer 0000000000000001
    Arg2:   00314D90 <Obj>             Integer 0000000000000002
    Arg3:   00315390 <Obj>             Package 00315390
    Arg4:   00000000 <Null Object>
    Arg5:   00000000 <Null Object>
    Arg6:   00000000 <Null Object>
% go
Evaluation of \_SB_.PCI0.PEG0.PEGP._DSM returned object 00273EE8, external buffer length 10
  [Integer] = 0000000080000001

It completes without error.

This makes me think that the original warning may in fact indicate a problem with the driver that is calling the method.

Bob
Comment 34 Rafael J. Wysocki 2013-08-09 23:48:02 UTC
OK, I suppose it's better to leave the warning as is for now and if necessary open a separate bug entry for tracking that issue.

I'm closing this entry, because the fix has just made it to the Linus' tree:

60f75b8 ACPI: Try harder to resolve _ADR collisions for bridges
Comment 35 Peter Wu 2013-08-10 09:14:51 UTC
Bob, the cause of the warning is the bbswitch kernel module, let's continue that discussion on https://github.com/Bumblebee-Project/bbswitch/commit/ee0591b

Note You need to log in before you can comment on or make changes to this bug.