Bug 43228 - Commit 1cc0c998fdf2cb665d625fb565a0d6db5c81c639 breaks turning off nvidia optimus card
Summary: Commit 1cc0c998fdf2cb665d625fb565a0d6db5c81c639 breaks turning off nvidia opt...
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: acpi_power-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-11 01:31 UTC by rocko
Modified: 2012-06-05 03:34 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.4-rc6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
acpi: fix regression on D3hot detection (1.33 KB, patch)
2012-05-12 16:40 UTC, Peter Wu
Details | Diff
PCI / ACPI / PM: Restore the old behavior of acpi_pci_set_power_state() (537 bytes, patch)
2012-05-15 17:48 UTC, Rafael J. Wysocki
Details | Diff
another patch that fixes the issue (679 bytes, text/plain)
2012-05-16 14:09 UTC, rocko
Details
ACPI / PCI / PM: Fix device PM regression related to D3cold (2.96 KB, patch)
2012-05-16 19:48 UTC, Rafael J. Wysocki
Details | Diff
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold (3.08 KB, patch)
2012-05-16 20:16 UTC, Rafael J. Wysocki
Details | Diff
Fix for calling _PS3 for D3cold (1.61 KB, patch)
2012-05-16 20:29 UTC, Peter Wu
Details | Diff
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold (3.77 KB, patch)
2012-05-16 20:41 UTC, Rafael J. Wysocki
Details | Diff
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold (4.92 KB, patch)
2012-05-17 16:04 UTC, Rafael J. Wysocki
Details | Diff
ACPI: Debug acpi_bus_set_power() (940 bytes, patch)
2012-05-17 18:53 UTC, Rafael J. Wysocki
Details | Diff
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold (4.80 KB, patch)
2012-05-17 21:25 UTC, Rafael J. Wysocki
Details | Diff

Description rocko 2012-05-11 01:31:02 UTC
The bbswitch kernel module from https://github.com/Bumblebee-Project turns off hybrid optimus nvidia cards when they are not being used. 

However, after commit 1cc0c998fdf2cb665d625fb565a0d6db5c81c639 (ACPI: Fix D3hot v D3cold confusion) bbswitch can no longer turn off the NVIDIA 540M video card on my Dell XPS15 (L502x). Reverting the commit fixes the problem.

http://lists.opensuse.org/opensuse-kernel/2012-05/msg00039.html looks like the same issue on another PC.
Comment 1 Peter Wu 2012-05-11 15:37:17 UTC
There are three commits between rc1 and rc6.

Commit 37239978778806ecba54da60676abb46870acebb removes setting the
valid flag when _PSx is found (where x is a state between 0 and 3).

3ebc81b8937d2bc1d0d0064bed29434dfce490aa is not particularly interesting, it
updates the validation for a state (states greater than D3cold are nonsense)
and would add a check that is removed in the next commit.

Part of the next commit 1cc0c998fdf2cb665d625fb565a0d6db5c81c639 diff
(indentation removed for formatting)
- /* State is valid if we have some power control */
- if (ps->resources.count || ps->flags.explicit_set)
+     /*
+      * State is valid if there are means to put the device into it.
+      * D3hot is only valid if _PR3 present.
+      */
+     if (ps->resources.count ||
+         (ps->flags.explicit_set && i < ACPI_STATE_D3_HOT))

I think that the last line is wrong. Shouldn't it be
"i <= ACPI_STATE_D3_HOT"? (note the equality sign). With the current
code, D3cold is ignored (okay I guess), but D3hot is ignored too
for all devices without a power resource for the given state. Can you
try changing "<" to "<=" as described?
Comment 2 rocko 2012-05-12 00:10:52 UTC
Nicely spotted! Changing it to "i <= ACPI_STATE_D3_HOT" also fixes the problem.

I wonder if the ACPI devs see this bug mailing list? I think subsystems use bugzilla, but others don't.
Comment 3 Alan 2012-05-12 00:14:51 UTC
See

Documentation/SubmittingPatches

and send a correcting fix.
Comment 4 Peter Wu 2012-05-12 16:40:28 UTC
Created attachment 73262 [details]
acpi: fix regression on D3hot detection

Well uhh, I'm not sure how exactly the mail should look like, so I'll put the patch here if you don't mind.

@rocko, I've put your name and address in the patch, assuming you do not mind.
Comment 5 Rafael J. Wysocki 2012-05-15 17:48:23 UTC
Created attachment 73303 [details]
PCI / ACPI / PM: Restore the old behavior of acpi_pci_set_power_state()

Does the attached patch fix the problem too?
Comment 6 rocko 2012-05-16 10:33:35 UTC
The patch in comment #5 does *not* fix the problem (ie after I reverted the patch in #4).
Comment 7 Peter Wu 2012-05-16 12:07:45 UTC
Rafael, the nvidia card can be disabled through the _PS3 method and is done
with `pci_set_power_state(dis_dev, PCI_D3hot);` as you can see on
https://github.com/Bumblebee-Project/bbswitch/blob/d12ce73a0d042fce3738aea800f55f8a23874e8c/bbswitch.c#L240

switcheroo does something similar.

I think I have found the real issue.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=drivers/acpi/scan.c;h=7417267e88fa9763617e0cc7e36fde6154ff4335;hb=1cc0c998fdf2cb665d625fb565a0d6db5c81c639#l906

905         /* Set defaults for D0 and D3 states (always valid) */
906         device->power.states[ACPI_STATE_D0].flags.valid = 1;
907         device->power.states[ACPI_STATE_D0].power = 100;
908         device->power.states[ACPI_STATE_D3].flags.valid = 1;
909         device->power.states[ACPI_STATE_D3].power = 0;

With the change, ACPI_STATE_D3 has become "4":
include/acpi/actypes.h:502:#define ACPI_STATE_D3_HOT               (u8) 3
include/acpi/actypes.h:503:#define ACPI_STATE_D3                   (u8) 4
include/acpi/actypes.h:504:#define ACPI_STATE_D3_COLD              ACPI_STATE_D3

@rocko, can you revert the patches and try ACPI_STATE_D3_HOT instead of ACPI_STATE_D3 for those lines?
Comment 8 rocko 2012-05-16 14:09:58 UTC
Created attachment 73308 [details]
another patch that fixes the issue

@Peter, yes changing to ACPI_STATE_D3_HOT as you suggested does fix the problem. I've attached the output of git diff just to be clear what I changed.
Comment 9 Rafael J. Wysocki 2012-05-16 19:33:35 UTC
Well, this is not a correct fix.  I would prefer it if you didn't guess things, but tried to understand what really happened instead. :-)

acpi_bus_get_power_flags() works as intended, ie. it only sets the "valid" flag for ACPI_STATE_D3_HOT if _PR3 is present, so the problem has to be elsewhere.

So, let's see why it still doesn't work with the patch from comment #5 applied.

 * pci_set_power_state(dis_dev, PCI_D3hot) is called.

 * __pci_complete_power_transition(dev, PCI_D3hot) is called.

 * pci_platform_power_transition(dev, PCI_D3hot) is called.

 * platform_pci_set_power_state(dev, PCI_D3hot) is called.  This calls
   pci_platform_pm->set_state() with the same arguments, which for ACPI
   points to acpi_pci_set_power_state().

 * acpi_pci_set_power_state(dev, PCI_D3hot) is called.  With the patch from
   comment #5 applied it choses ACPI_STATE_D3 as the argument for
   acpi_bus_set_power(), which calls
   __acpi_bus_set_power(device, ACPI_STATE_D3), where ACPI_STATE_D3 == 4
   as you have observed already.

 * Thus object_name[] contains the string "_PS4", which obviously doesn't
   represent anything meaningful, right?

So, in addition to the patch in comment #5 we need the following change in __acpi_bus_set_power():

Index: linux/drivers/acpi/bus.c
===================================================================
--- linux.orig/drivers/acpi/bus.c
+++ linux/drivers/acpi/bus.c
@@ -250,6 +250,10 @@ static int __acpi_bus_set_power(struct a
 		return -ENODEV;
 	}
 
+	/* _PS3 should be executed for transitions into D3cold. */
+	if (state == ACPI_STATE_D3_COLD)
+		object_name[3] = '3';
+
 	/*
 	 * Transition Power
 	 * ----------------

I'll attach an updated patch in a while.
Comment 10 Rafael J. Wysocki 2012-05-16 19:48:07 UTC
Created attachment 73314 [details]
ACPI / PCI / PM: Fix device PM regression related to D3cold

Updated patch from comment #5, please test.
Comment 11 Rafael J. Wysocki 2012-05-16 20:16:52 UTC
Created attachment 73315 [details]
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold

Small modification: it only is necessary to check for state == D3cold if we're going into a deeper state.

Please test!
Comment 12 Peter Wu 2012-05-16 20:29:37 UTC
Created attachment 73316 [details]
Fix for calling _PS3 for D3cold

Rafael, that was indeed a guess. Before trying to cook another patch I started understanding what you were exactly trying to achieve with your patch by looking at the spec and the related subsystems.

Funny enough my idea for a fix was the same as yours, setting the state to '3' explicitly. Unfortunately, this did not work.

Looking further, this makes sense since there is also a check before the _PSx method is executed:

    if (device->power.states[state].flags.explicit_set) {
explicit_set is 1 if _PSx was detected in acpi_bus_get_power_flags. The attached patch is tested and fixes this (three lines to satisfy the kernel coding style).

A quick grep showed that PCI_D3hot is used where "D3" is meant, i.e. "do not care about D3hot or D3cold, just save power by putting the device off". So yes, for compatibility reasons, it's probably better to keep the D3hot state as it for now.
Comment 13 Rafael J. Wysocki 2012-05-16 20:41:15 UTC
Created attachment 73317 [details]
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold

Right.  As it turns out, I had one more ACPI patch applied and it fixed that last problem.

I'd prefer to do it like in this attached patch.  Can you please test if it works for you still?
Comment 14 Peter Wu 2012-05-16 22:43:53 UTC
I wish I could say yes, but it fails with:

    ACPI: Device [PEGP] failed to transition to D4

Looking at your changes, it became clear that code in __acpi_bus_set_power is wrongly placed. It should be placed in the statement following the "state < device->power.state" condition, not the else-statement after. Otherwise, it causes the function to call _PS4 (fails).

Now this fixes it at boot, but on suspend/resume I get the error again if bbswitch tries to put the device back in D3 state. bbswitch:

 1. Before suspend, it turns the card back on: bbswitch_on().
 2. bbswitch_on() calls pci_set_power_state(dis_dev, PCI_D0); somewhere
 3. After resume, it turns the card back off if necessary: bbswitch_off().
 4. bbswitch_off() calls pci_set_power_state(dis_dev, PCI_D3hot);

Before the D3 changes, I had no issues with suspend/resume. Perhaps there are more places where ACPI_STATE_D3 (4) should become .._HOT (3)? Do you want be to test the suspend/resume process while setting object_name[3] = '3' earlier (i.e. right before the if/else as in my last patch)?
Comment 15 Peter Wu 2012-05-17 07:17:27 UTC
Hmm, have I been drunk or something? Putting it in the if is wrong as it should indeed be called only "if target state is higher than current state". Perhaps there is a missing validation somewhere, something that assumed only one D3 state (comparison with ACPI_STATE_D3?). I'll dig again, any ideas / patches are appreciated.
Comment 16 Peter Wu 2012-05-17 07:52:28 UTC
I see one possibly (unrelated) error:
acpi_bus_update_power first calls acpi_bus_get_power and then tries to acpi_bus_set_power with the state returned by the getter.
But acpi_bus_get_power calls _PSC which returns at most 3 for D3, it does not make a distinction between D3hot and D3cold.

On the 258 acpidumps I checked, there is no \_SB._OSC method that could advertise both D3hot and D3 support, not does it have any _PR3 method as stated in table 6-147 of the ACPI Spec v5:
> Field name: _PR3 support
> This bit is set if OSPM supports reading _PR3and using power
> resources to switch power. Note this handshake translates to an
> operating model that the platform and OSPM supports both the power
> model containing both D3hot and D3.

From sect 2.4 Sleeping State Definitions of the ACPI spec v5:
> The D3hot state differs from the D3 state in two distinct parameters; the
> main power rail is present and software can access a device in D3hot. For 
> devices that support both D3hot and D3 exposed to OSPM via _PR3, device
> software/drivers must always assume OSPM will target D3and must
> assume device context will be lost.
Can't we stick to one D3 state then and add a flag for support of D3hot?
Comment 17 Rafael J. Wysocki 2012-05-17 16:03:24 UTC
(In reply to comment #14)
> I wish I could say yes, but it fails with:
> 
>     ACPI: Device [PEGP] failed to transition to D4
> 

I suppose that the source of the error message is acpi_power_transition(), which does some wrong things for D3cold (state == 4).

I'll attach an updated patch in a little while.
Comment 18 Rafael J. Wysocki 2012-05-17 16:04:50 UTC
Created attachment 73320 [details]
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold

Please test this one.
Comment 19 Rafael J. Wysocki 2012-05-17 16:09:29 UTC
(In reply to comment #16)
> I see one possibly (unrelated) error:
> acpi_bus_update_power first calls acpi_bus_get_power and then tries to
> acpi_bus_set_power with the state returned by the getter.
> But acpi_bus_get_power calls _PSC which returns at most 3 for D3, it does not
> make a distinction between D3hot and D3cold.

I don't think this is a big deal at the moment, but we need to fix this.  Thanks for spotting it!

> On the 258 acpidumps I checked, there is no \_SB._OSC method that could
> advertise both D3hot and D3 support, not does it have any _PR3 method as
> stated
> in table 6-147 of the ACPI Spec v5:
> > Field name: _PR3 support
> > This bit is set if OSPM supports reading _PR3and using power
> > resources to switch power. Note this handshake translates to an
> > operating model that the platform and OSPM supports both the power
> > model containing both D3hot and D3.
> 
> From sect 2.4 Sleeping State Definitions of the ACPI spec v5:
> > The D3hot state differs from the D3 state in two distinct parameters; the
> > main power rail is present and software can access a device in D3hot. For 
> > devices that support both D3hot and D3 exposed to OSPM via _PR3, device
> > software/drivers must always assume OSPM will target D3and must
> > assume device context will be lost.
> Can't we stick to one D3 state then and add a flag for support of D3hot?

We might, but for that we'd need to revert some commits that have dependencies on top.  Besides, D3hot and D3cold are supposed to be different _states_.

My goal for now is to make pci_set_power_state(PCI_D3hot) work as before and we can take care of the other things (like comments printing D4 etc.) later.
Comment 20 Peter Wu 2012-05-17 16:57:33 UTC
It still does not work, not on boot nor after suspend/resume. Still the same erorr.

In case you are interested, I have checked the property of the PCI device (a nvidia VGA controller) that I am trying to turn off (after trying to bring it in D3hot which is at the moment an alias to D3cold):

    struct pci_dev *pdev; /* assume that pdev contains VGA controller dev */
    pdev->current_state == 3
I can add some debugging messages arround, what would you like to know?
Comment 21 Rafael J. Wysocki 2012-05-17 18:53:23 UTC
Created attachment 73321 [details]
ACPI: Debug acpi_bus_set_power()

Please apply this patch on top of the previous one and grep dmesg for the messages added by it.
Comment 22 Peter Wu 2012-05-17 20:59:34 UTC
I added a bit more debugging, the "transitioning from D..." message comes right before the if condition "state < device->power.state", acpi.debug_layer=0x550000:

[   16.968478] ACPI: transitioning from D255 to D4
[   16.968485] ACPI: Executed _PS4 for device [PEGP], status: 5
[   16.968487] ACPI: Device [PEGP] failed to transition to D4
[   17.007702]    utils-0286 [00] evaluate_integer      : Return value [0]
[   17.123579]    utils-0286 [00] evaluate_integer      : Return value [1]
[   17.175566]    utils-0286 [00] evaluate_integer      : Return value [28]

The same message occurs after resume. It seems that initially the power state is (unsigned char)-1 and therefore it thinks that it's greater than the target state D3. Since machines with _PR3 are uncommon (I haven't seen one so far), this issue could be hidden for a long time (without power resources, the end result is the same).

Again, I'm happy to test any patches :)
Comment 23 Rafael J. Wysocki 2012-05-17 21:21:02 UTC
(In reply to comment #22)
> I added a bit more debugging, the "transitioning from D..." message comes
> right
> before the if condition "state < device->power.state",
> acpi.debug_layer=0x550000:
> 
> [   16.968478] ACPI: transitioning from D255 to D4

Ah.

> [   16.968485] ACPI: Executed _PS4 for device [PEGP], status: 5

So the case is that the device is being transitioned directly from ACPI_STATE_UNKNOWN to D3cold, which isn't going to work.

This is a result of an initialization failure.  Most likely, __acpi_bus_get_power() fails for your device for some reason and I'd like to figure out why.

In the meantime, I'll attach another patch for you to try.
Comment 24 Rafael J. Wysocki 2012-05-17 21:25:46 UTC
Created attachment 73322 [details]
ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold

In this patch the method name for D3cold is always set to _PS3, regardless of the direction of the transition.
Comment 25 Peter Wu 2012-05-17 22:24:46 UTC
Thanks, your new patch fixes the bugs. The bbswitch module works properly on boot/resume.

Compared to 3.3.6, I found one difference when grepping for err|warn, previously I had the next line:

    ACPI: Video Device [PEGP] (multi-head: no  rom: yes  post: no)
now it shows:

    video: probe of LNXVIDEO:00 failed with error 5

On the 3.3.6 kernel, /sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/LNXVIDEO:00/uevent shows:

    DRIVER=video
    MODALIAS=acpi:LNXVIDEO:
on the new 3.4-rc7 kernel, this is missing. The ACPI handle as shown through the path property is \_SB_.PCI0.P0P2.PEGP. acpidump is available at https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/488

I am not sure if it is related to this bug, but do not notice any missing functionality as a result of this error either.
Comment 26 Rafael J. Wysocki 2012-05-17 22:28:20 UTC
(In reply to comment #25)
> Thanks, your new patch fixes the bugs. The bbswitch module works properly on
> boot/resume.

Good, thanks for your patience!

> Compared to 3.3.6, I found one difference when grepping for err|warn,
> previously I had the next line:
> 
>     ACPI: Video Device [PEGP] (multi-head: no  rom: yes  post: no)
> now it shows:
> 
>     video: probe of LNXVIDEO:00 failed with error 5
> 
> On the 3.3.6 kernel,
> /sys/devices/LNXSYSTM:00/device:00/PNP0A08:00/device:01/LNXVIDEO:00/uevent
> shows:
> 
>     DRIVER=video
>     MODALIAS=acpi:LNXVIDEO:
> on the new 3.4-rc7 kernel, this is missing. The ACPI handle as shown through
> the path property is \_SB_.PCI0.P0P2.PEGP. acpidump is available at
> https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/488
> 
> I am not sure if it is related to this bug, but do not notice any missing
> functionality as a result of this error either.

This is a different problem, but I think it's worth looking into.

Can you please create a separate bug entry for it and let me know the number?
Comment 27 rocko 2012-05-18 03:59:00 UTC
Wow, that made it into the kernel git tree quickly!

The patches fixes the issue on my laptop too, thanks.
Comment 28 Peter Wu 2012-05-18 08:44:14 UTC
@Rafael, seems that it's caused by:

commit ea9f8856bd6d4ed45885b06a338f7362cd6c60e5
Author: Igor Murzov <intergalactic.anonymous@gmail.com>
Date:   Fri Mar 30 21:32:08 2012 +0400

    ACPI video: Harden video bus adding.
    
    It is always better to check return values, so add some new checks and
    correct existing ones.
    
    v2: Be consistent and don't mix errors from -E* and AE_* namespaces.
    
    Signed-off-by: Igor Murzov <e-mail@date.by>
    Signed-off-by: Len Brown <len.brown@intel.com>

In particular:
@@ -1343,15 +1343,17 @@ static int
 acpi_video_bus_get_devices(struct acpi_video_bus *video,
                           struct acpi_device *device)
 {
-       int status = 0;
+       int status;
        struct acpi_device *dev;

-       acpi_video_device_enumerate(video);
+       status = acpi_video_device_enumerate(video);
+       if (status)
+               return status;

And indeed, previously I got an ACPI exception and now I still do, but with addition of the probe failure:
    ACPI Exception: AE_NOT_FOUND, Evaluating _DOD (20120320/video-1147)
    video: probe of LNXVIDEO:00 failed with error 5

So I think this is the expected behaviour?
Comment 29 Rafael J. Wysocki 2012-05-22 20:38:05 UTC
Yes, it is, but I still wonder why acpi_bus_init_power() doesn't get you the correct power state.
Comment 30 Peter Wu 2012-05-23 09:00:49 UTC
Where do you conclude that from? Note that the other files (lspci.txt) in that tarball may not be that accurate as devices may have been disabled manually.
Comment 31 Len Brown 2012-06-05 03:34:29 UTC
5c7dd710f691d1b44c39e32d2f05b4286ff51f99
(ACPI / PCI / PM: Fix device PM regression related to D3hot/D3cold)
shipped in linux-3.4

closed.

the video discussion should go to another forum or bug report.

Note You need to log in before you can comment on or make changes to this bug.