Bug 194889

Summary: Bisected: BBSwitch issue - gets incorrect acpi handle returned for nvida card
Product: ACPI Reporter: mike
Component: Config-OtherAssignee: Rafael J. Wysocki (rjw)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: jmatic, kernel, luis.zaldivar, lv.zheng, rjw, rui.zhang
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.10.1 4.10.2 4.10.3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg for 4.10.2
dmesg for 4.9.14 - working
acpidump
lspci -vv
lspci -vvx
grep . /sys/bus/acpi/devices/*/modalias /sys/bus/acpi/devices/*/path
lspci -vvx
PATCH: Introduce new flag for devices with _HID/_CID
Debug patch against glue.c

Description mike 2017-03-15 10:56:30 UTC
Created attachment 255255 [details]
dmesg for 4.10.2

Hardware lenovo optimus laptop with 3D controller: NVIDIA Corporation GK208M [GeForce GT 730M]. Running Arch

On kernel 4.9.x and before BBSwitch returns

bbswitch: Found discrete VGA device 0000:09:00.0: \_SB_.PCI0.RP05.PEGP

on kernel 4.10.1/2 BBSwitch returns

bbswitch: Found discrete VGA device 0000:09:00.0: \_SB_.PCI0.RP05.PXSX

Using acpi_call on 4.10.x shows that \_SB_.PCI0.RP05.PEGP is correct.
Comment 1 mike 2017-03-15 10:57:14 UTC
Created attachment 255257 [details]
dmesg for 4.9.14 - working
Comment 2 mike 2017-03-15 10:57:53 UTC
Created attachment 255259 [details]
acpidump
Comment 3 mike 2017-03-15 10:58:44 UTC
Created attachment 255261 [details]
lspci -vv
Comment 4 mike 2017-03-15 11:03:27 UTC
I should say that obviously this breaks BBSwitch and it's ability to switch off the nvidia card.

Also, though possibly unrelated, Bumblebee does not function with 4.10.1/2. While it works fine in 4.9.14 and before.

There are a few other reports on ARCH forums and BUG Tracker with similar issues
Comment 5 Lv Zheng 2017-03-23 05:39:31 UTC
Marking this as duplicate due to:
[    6.093918] lenovoU430 kernel: bbswitch: failed to evaluate \_SB_.PCI0.RP05.PXSX._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x0 {0x00,0x00,0x00,0x00}: AE_NOT_FOUND
[    6.093925] lenovoU430 kernel: bbswitch: failed to evaluate \_SB_.PCI0.RP05.PXSX._DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4} 0x102 0x0 {0x00,0x00,0x00,0x00}: AE_NOT_FOUND
[    6.093935] lenovoU430 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160930/nsarguments-95)

*** This bug has been marked as a duplicate of bug 194431 ***
Comment 6 mike 2017-03-23 11:06:50 UTC
Not a duplicate as far as I can tell. 

My issue isn't the _DSM calls, it's that the wrong acpi handle gets returned to bbswitch prior to this. When the correct handle is returned and used, as with 4.9.x, bbswitch functions correctly. 

An error is correctly thrown for \_SB.PCI0.GFX0._DSM as that's the Intel card. It is also correct for \_SB_.PCI0.RP05.PXSX._DSM as this is an invalid handle. The PXSX should be PEGP.

If I manually make that call with acpi_call with the correct handle, I get no error and a good response.
Comment 7 Lv Zheng 2017-03-24 02:35:12 UTC
OK.
Reopening and checking the logs.
Comment 8 Matic 2017-03-28 15:51:30 UTC
I'm having similar problems on acer aspire v5. Possibly related to https://bugzilla.kernel.org/show_bug.cgi?id=60829 ?
Comment 9 Lv Zheng 2017-03-29 08:16:43 UTC
Is that possible to bisect the culprit out?
Comment 10 Lv Zheng 2017-03-29 08:27:46 UTC
I checked the decompiled tables.
PEGP is defined in ssdt5:
DefinitionBlock ("", "SSDT", 1, "LENOVO", "CB-01   ", 0x00000001)
{
    ...
    Scope (\_SB.PCI0.RP05)
    {
        Device (PEGP)
        {
            Name (_ADR, Zero)  // _ADR: Address
            Method (_PRW, 0, NotSerialized)  // _PRW: Power Resources for Wake
            {
                Return (GPRW) /* External reference */
                0x09
                0x04
            }
        }
    }
It seems in DSDT, every RPxx device contains a PXSX as follows:
            Device (PXSX)
            {
                Name (_ADR, Zero)  // _ADR: Address
                Method (_PRW, 0, NotSerialized)  // _PRW: Power Resources for Wake
                {
                    Return (GPRW (0x69, 0x04))
                }
            }

However I don't know whether BBSwitch chose PXSX rather than PEGP.
It looks it's BBSwitch specific logic:
[    6.093893] lenovoU430 kernel: bbswitch: version 0.8
[    6.093899] lenovoU430 kernel: bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
[    6.093906] lenovoU430 kernel: bbswitch: Found discrete VGA device 0000:09:00.0: \_SB_.PCI0.RP05.PXSX
[    6.093918] lenovoU430 kernel: bbswitch: failed to evaluate \_SB_.PCI0.RP05.PXSX._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x0 {0x00,0x00,0x00,0x00}: AE_NOT_FOUND
[    6.093925] lenovoU430 kernel: bbswitch: failed to evaluate \_SB_.PCI0.RP05.PXSX._DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4} 0x102 0x0 {0x00,0x00,0x00,0x00}: AE_NOT_FOUND
[    6.093935] lenovoU430 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160930/nsarguments-95)
[    6.093978] lenovoU430 kernel: bbswitch: No suitable _DSM call found.

We don't have BBSwitch in the Linux kernel upstream.
So why don't you ask support from BBSwitch community?

Thanks
Lv
Comment 11 Matic 2017-03-29 11:52:09 UTC
I think bbswitch just goes through pci devices with right pci class and uses ACPI_HANDLE to get the handle. I'm guessing ACPI_HANDLE returns the wrong handle.


https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c
line 390
Comment 12 mike 2017-03-29 13:16:47 UTC
(In reply to Lv Zheng from comment #9)
> Is that possible to bisect the culprit out?

I'm going to have a look at doing this. Might be a little while to get through it on my laptop. 

(In reply to Matic from comment #8)
> I'm having similar problems on acer aspire v5. Possibly related to
> https://bugzilla.kernel.org/show_bug.cgi?id=60829 ?

Certainly looks like it is to me. Interestingly, by going through git commits makes me suspect commit "c2a6bbaf0c5f90463a7011a295bbdb7e33c80b51 ACPI / scan: Prefer devices without _HID/_CID for _ADR matching" which is only one that seems to directly affect the patch from your bug since 4.9.x.
Comment 13 mike 2017-03-29 18:22:26 UTC
Can confirm that commit "c2a6bbaf0c5f90463a7011a295bbdb7e33c80b51 ACPI / scan: Prefer devices without _HID/_CID for _ADR matching" is the cause of this.

Not fully bisected, but took apportity between builds to rebuild arch mainline kernel with this reverted by commenting out the relevant change in glue.c line 108/9.

return sta_present && list_empty(&adev->pnp.ids) ?
			FIND_CHILD_MAX_SCORE : FIND_CHILD_MIN_SCORE;

becomes 

return sta_present /* && list_empty(&adev->pnp.ids) */ ?
			FIND_CHILD_MAX_SCORE : FIND_CHILD_MIN_SCORE;

Then both bbswitch and bumblebee function correctly. The proper handle is returned.

For completeness I tried

return ( sta_present && list_empty(&adev->pnp.ids) ) ?
			FIND_CHILD_MAX_SCORE : FIND_CHILD_MIN_SCORE;

in case this was being evaluated incorrectly, but this resulted in the original error.
Comment 14 Zhang Rui 2017-03-30 01:15:10 UTC
At the first glance, I think the problem is probably related to the device enumeration order.

Say, there may be multiple devices (acpi handles) that match the logic in bbswitch_init(), and the first one matched will be returned. In this case, if kernel changes the device' enumeration order, it may result in different acpi handle got by bbswitch.

anyway, please attach the lspci -vvx output.
Comment 15 Zhang Rui 2017-03-30 01:50:37 UTC
    Scope (\_SB.PCI0.RP05)
    {
        Device (PEGP)
        {
            Name (_ADR, Zero)  // _ADR: Address
            Method (_PRW, 0, NotSerialized)  // _PRW: Power Resources for Wake
            {
                Return (GPRW) /* External reference */
                0x09
                0x04
            }
        }
    }

        Device (RP05)
        {
            ...
            Device (PXSX)
            {
                Name (_ADR, Zero)  // _ADR: Address
                Method (_PRW, 0, NotSerialized)  // _PRW: Power Resources for Wake
                {
                    Return (GPRW (0x69, 0x04))
                }
            }
            ...
        }

First of all, both PXSX and PEGP are under RP05, but they have the same _ADR, this is clearly a violation of ACPI spec.
Only \_SB.PCI0.RP05.PEGP has _DSM method, so PEGP is the right handle we should return, in order to make the bbswitch work.
Comment 16 Zhang Rui 2017-03-30 02:05:29 UTC
please attach the output of "grep . /sys/bus/acpi/devices/*/modalias /sys/bus/acpi/devices/*/path"
Comment 17 Lv Zheng 2017-03-30 02:48:47 UTC
> Both PXSX and PEGP are under RP05, but they have the same _ADR, this is
> clearly a violation of ACPI spec.

We need to think about a possiblity that the _ADR is not such critical in practice.
It looks this is not an isolated case related to _ADR.
Comment 18 mike 2017-03-30 08:14:20 UTC
Created attachment 255635 [details]
lspci -vvx

lspci -vvx 

Kernel Arch 4.10.6
bbswitch not loaded, so nvidia card in default and powered state
Comment 19 mike 2017-03-30 08:17:33 UTC
Created attachment 255637 [details]
grep . /sys/bus/acpi/devices/*/modalias /sys/bus/acpi/devices/*/path
Comment 20 mike 2017-03-30 09:32:01 UTC
Created attachment 255639 [details]
lspci -vvx

kernel 4.10.6 Arch mainline

bbswitch not working and nvidia/nouveau modules not loaded. Card powered.
Redone with sudo.
Comment 21 Zhang Rui 2017-03-30 14:27:03 UTC
/sys/bus/acpi/devices/LNXVIDEO:00/modalias:acpi:LNXVIDEO:
/sys/bus/acpi/devices/LNXVIDEO:00/path:\_SB_.PCI0.RP05.PEGP

This explains why the commit "c2a6bbaf0c5f90463a7011a295bbdb7e33c80b51 ACPI / scan: Prefer devices without _HID/_CID for _ADR matching" breaks bbswitch.

Before the commit, 
      return sta_present ? FIND_CHILD_MAX_SCORE : FIND_CHILD_MIN_SCORE;
FIND_CHILD_MAX_SCORE is returned for PEGP.
After the commit
      return sta_present && list_empty(&adev->pnp.ids) ?
			FIND_CHILD_MAX_SCORE : FIND_CHILD_MIN_SCORE;
FIND_CHILD_MIN_SCORE is returned for PEGP because, as a video bus device, a fake pnpid "LNXVIDEO" is assigned to PEGP.

So the problem is that we should not use list_empty(&adev->pnp.ids) to check the existence of _HID/_CID.

A debug patch will attached later.
Comment 22 Zhang Rui 2017-03-30 14:45:55 UTC
Created attachment 255641 [details]
PATCH: Introduce new flag for devices with _HID/_CID

Please apply this patch on top of the bogus commit and see if it helps.
Comment 23 mike 2017-03-30 15:54:48 UTC
Patch works for me. Using arch 4.10.6 kernel. Doesn't seem to have caused any other issues so far.
Comment 24 Rafael J. Wysocki 2017-03-30 19:11:41 UTC
(In reply to Zhang Rui from comment #22)
> Created attachment 255641 [details]
> PATCH: Introduce new flag for devices with _HID/_CID
> 
> Please apply this patch on top of the bogus commit and see if it helps.

It is not bogus, but overly coarse grained.

Anyway, thanks for looking into this.
Comment 25 Rafael J. Wysocki 2017-03-30 19:26:52 UTC
Created attachment 255645 [details]
Debug patch against glue.c

I'm wondering, though, if the new flag is really needed.

Something like this patch should suffice, but of course it needs to be tested on the system with the issue that the problematic commit attempted to address.
Comment 26 Zhang Rui 2017-03-31 04:31:52 UTC
(In reply to Rafael J. Wysocki from comment #25)
> Created attachment 255645 [details]
> Debug patch against glue.c
> 
> I'm wondering, though, if the new flag is really needed.

I don't know the background of c2a6bbaf0c5f ("ACPI / scan: Prefer devices without _HID/_CID for _ADR matching"), is there a particular problem that this fixes? If checking _HID is sufficient for that case, surely we can make use of platform_id flag.
Comment 27 Rafael J. Wysocki 2017-03-31 08:42:07 UTC
(In reply to Zhang Rui from comment #26)
> (In reply to Rafael J. Wysocki from comment #25)
> > Created attachment 255645 [details]
> > Debug patch against glue.c
> > 
> > I'm wondering, though, if the new flag is really needed.
> 
> I don't know the background of c2a6bbaf0c5f ("ACPI / scan: Prefer devices
> without _HID/_CID for _ADR matching"), is there a particular problem that
> this fixes?

Yes, it does.  Hence the Tested-by: tag. :-)
Comment 28 mike 2017-03-31 09:21:18 UTC
(In reply to Rafael J. Wysocki from comment #25)
> Created attachment 255645 [details]
> Debug patch against glue.c
> 
> I'm wondering, though, if the new flag is really needed.
> 
> Something like this patch should suffice, but of course it needs to be
> tested on the system with the issue that the problematic commit attempted to
> address.

This patch is working too.
Comment 29 Rafael J. Wysocki 2017-03-31 22:57:27 UTC
The patch from comment #25 has been submitted as https://patchwork.kernel.org/patch/9657331/