Bug 15100

Summary: s2ram is broken for (un-)docked T400
Product: Drivers Reporter: Toralf Förster (toralf.foerster)
Component: PNPAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, bjorn.helgaas, chris, jbarnes, rjw, shaohua.li, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 14230    
Attachments: kernel log message
cat /sys/kernel/debug/kmemleak
.config
debug patch
/var/log/messages.log
kernel crash for 2.6.35
kernel crash for 2.6.37
debug patch2
dmesg
kernel crash for 2.6.37 + 2nd debug patch
kernel crash for 2.6.37 + 2nd debug patch + debug info compiled into
PNP: Do not trust dev->data in pnpacpi_resume()
kernel crash for 2.6.37 + 2nd ry
PNP: Do not trust dev->data in pnpacpi_resume() and don't ignore errors returned by it
PNP: Do not trust dev->data in pnpacpi_resume() and don't ignore errors returned by it
dmesg
PNP: Remove ACPI-based devices that are not present during resume
PNP: Use DEVICE_ACPI_HANDLE() for accessing companion ACPI devices
kmemleak

Description Toralf Förster 2010-01-21 08:56:31 UTC
If I suspend to RAM my ThinkPad T400 at home (docking station with 1680x1050 HDMI2 monitor) and switch to my office (internal LVDS1 with 1440x900) and resume my notebook by opening the lid, the screen is black. Ctrl+Alt+Backspace works however.
This wasn't the case with 2.6.31.X kernels. Starting with 2.6.32.X however now I'm facing this issue.
Comment 1 ykzhao 2010-01-25 15:10:58 UTC
Will you please try the latest kernel(2.6.33-rc5) and see whether the issue still exists?

Thanks.
Comment 2 Rafael J. Wysocki 2010-01-27 00:40:03 UTC
I think it's better to test 2.6.33-rc6 when it's out.
Comment 3 Toralf Förster 2010-01-30 09:05:37 UTC
With 2.6.33-rc6 I'm in addition not longer able to press Ctrl+Alt+Del after wakeup - btw the SysRQ keys don't work any more (compared to 2.6.32.7).
Comment 4 Rafael J. Wysocki 2010-01-30 18:13:15 UTC
Well, it looks like no one has any idea about what the problem is.  Would it be possible to bisect the kernels between 2.6.31 and 2.6.32?
Comment 5 Toralf Förster 2010-01-30 19:25:30 UTC
(In reply to comment #4)
> Would it be possible to bisect the kernels between 2.6.31 and 2.6.32?
It's not a technical question it is rather a question whether bisection could be limited to a given subdirectory or whether it is not worth b/c of the risk of having an issue somewhere else. I'm asking b/c unfortunately this bisect cannot be made by a simple script.
Comment 6 Rafael J. Wysocki 2010-01-30 19:40:36 UTC
There always is a risk like that, so you can only bisect the commits that modify i915 directly (I don't think the problem is anywhere outside).
Comment 7 Toralf Förster 2010-01-30 22:17:02 UTC
I bisected it to this :

tfoerste@n22 ~/devel/linux-2.6 $ git bisect good
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
af729a26ccc3ff9ad834a5e96f455aab20f176cd
c1c7af60892070e4b82ad63bbfb95ae745056de0
We cannot bisect more!


I skipped 2.6.31-rc9-00107-gaf729a2 b/c suspend2ram didn't worked.
Comment 8 Toralf Förster 2010-01-30 22:19:08 UTC
forget to mention I used this :

git bisect start v2.6.32 v2.6.31 -- ./include/config/drm/i915 ./drivers/gpu/drm/i915
Comment 9 Rafael J. Wysocki 2010-01-30 22:23:36 UTC
I'd try to revert af729a26ccc3ff9ad834a5e96f455aab20f176cd and c1c7af60892070e4b82ad63bbfb95ae745056de0.  If that fixes the problem, you can revert just one of them and see what happens.  This should give you the answer, otherwise the bisection just went wrong.
Comment 10 Toralf Förster 2010-01-31 09:59:09 UTC
c1c7af6 is the culprit, the previous commit 7e12715 works fine.
Comment 11 Rafael J. Wysocki 2010-01-31 12:21:59 UTC
Thanks for verifying.

commit c1c7af60892070e4b82ad63bbfb95ae745056de0
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Thu Sep 10 15:28:03 2009 -0700

    drm/i915: force mode set at lid open time

    Some laptop platforms will disable pipes and/or planes at lid close time
    and not restore them when the lid is opened again.  So catch the lid
    event, and if the lid was opened, force a mode restore.

    Fixes fdo bug #21230.

    Acked-by: Matthew Garrett <mjg@redhat.com>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Eric Anholt <eric@anholt.net>

First-Bad-Commit : c1c7af60892070e4b82ad63bbfb95ae745056de0
Comment 12 Jesse Barnes 2010-02-05 19:06:06 UTC
After suspending on the dock, then opening your lid undocked, does your ACPI lid file indicate your lid is open?  You can just cat /proc/acpi/button/lid/LID/state to check.
Comment 13 Toralf Förster 2010-02-05 20:00:54 UTC
yes, it is "open".
Comment 14 Jesse Barnes 2010-02-05 20:16:08 UTC
Presumably this patch gets your old behavior back?  I wonder why X isn't resetting its mode on resume...

diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds
index aa74e59..5ceb43f 100644
--- a/drivers/gpu/drm/i915/intel_lvds.c
+++ b/drivers/gpu/drm/i915/intel_lvds.c
@@ -719,7 +719,7 @@ static int intel_lid_notify(struct notifier_block *nb, unsig
        dev_priv->modeset_on_lid = 0;
 
        mutex_lock(&dev->mode_config.mutex);
-       drm_helper_resume_force_mode(dev);
+       //      drm_helper_resume_force_mode(dev);
        mutex_unlock(&dev->mode_config.mutex);
 
        return NOTIFY_OK;
Comment 15 Toralf Förster 2010-02-05 21:03:17 UTC
(In reply to comment #14)
> Presumably this patch gets your old behavior back?  I wonder why X isn't
> resetting its mode on resume...
> 
No success (tested with 2.6.32.7).
Comment 16 Jesse Barnes 2010-02-05 21:13:06 UTC
So reverting c1c7af60892070e4b82ad63bbfb95ae745056de0 gets your suspend/resume behavior back, but commenting out the drm_helper_resume_force_mode in intel_lid_notify doesn't?  That would mean the lid notification itself was breaking things somehow...
Comment 17 Toralf Förster 2010-02-05 21:21:08 UTC
(In reply to comment #16)
> So reverting c1c7af60892070e4b82ad63bbfb95ae745056de0 gets your
> suspend/resume
> behavior back
yes (bisected in git kernel trunk)

> but commenting out the drm_helper_resume_force_mode in
> intel_lid_notify doesn't? 
yes (tested with vanilla 2.6.32.7)
Comment 18 Jesse Barnes 2010-02-05 21:30:27 UTC
What if you remove the call to acpi_lid_notifer_register in lvds_init?
Comment 19 Toralf Förster 2010-02-06 10:30:17 UTC
(In reply to comment #18)
> What if you remove the call to acpi_lid_notifer_register in lvds_init?

No effect in 2.6.32.7 and only small in .8-rc2 (when I putted the ThinkPad into the docking station, I could see the mouse pointer within a black screen, funny ebough when I pressed Alt+F1, I was on the conolsembut the mouse piinter was still visible).

However tweaking in the git tree the bad commit by this solved the issue:

tfoerste@n22 ~/devel/linux-2.6 $ git describe
v2.6.31-rc9-109-gc1c7af6

diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds.c
index e190ff7..3f90491 100644
--- a/drivers/gpu/drm/i915/intel_lvds.c
+++ b/drivers/gpu/drm/i915/intel_lvds.c
@@ -641,8 +641,8 @@ static int intel_lid_notify(struct notifier_block *nb, unsigned long val,
        struct drm_device *dev = dev_priv->dev;

        if (acpi_lid_open())
-               drm_helper_resume_force_mode(dev);
-
+               //drm_helper_resume_force_mode(dev);
+               ;
        return NOTIFY_OK;
Comment 20 Rafael J. Wysocki 2010-02-15 20:43:59 UTC
On Monday 15 February 2010, Toralf Förster wrote:
> 
> Rafael J. Wysocki wrote at 00:52:50
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15100
> > Subject             : X11 is black after resume from s2ram if my T400 was
> previous in
> >  docking station before Submitter   : Toralf Förster
> <toralf.foerster@gmx.de>
> > Date                : 2010-01-21 08:56 (25 days old)
> > 
> Issue present in 2.6.33-rc8
Comment 21 Toralf Förster 2010-02-16 09:33:46 UTC
FWIW with kernel 2.6.31 xrandr tells me that I've DVI1 and DVI2 whereas with newer kernel xrandr named it HDMI1 and HDMI2. The docking station itself has a DVI port.
Comment 22 Jesse Barnes 2010-02-26 22:48:29 UTC
Does http://patchwork.kernel.org/patch/80474/ on top of 2.6.33 help?
Comment 23 Toralf Förster 2010-02-27 09:58:27 UTC
(In reply to comment #22)
> Does http://patchwork.kernel.org/patch/80474/ on top of 2.6.33 help?

No, neither on 2.6.33 nor on 2.6.32.9 :-(
Comment 24 Toralf Förster 2010-03-16 17:48:35 UTC
solved at least with 2.6.32.10
Comment 25 Toralf Förster 2010-03-18 21:23:24 UTC
KDE people says it is a kernel bug : https://bugs.kde.org/show_bug.cgi?id=231251(In reply to comment #24)
> solved at least with 2.6.32.10
No, b/c KDE people says it is a kernel bug : https://bugs.kde.org/show_bug.cgi?id=231251
Comment 26 Toralf Förster 2010-04-03 09:53:31 UTC
BTW this issue is still in 2.6.33.2 + xorg 1.7.6 + intel driver 2.11.
If I suspend the docked notebook, undock it and  wake it up, the screen is black, Ctrl-Alt-Del often doesn't work, howeeer SysRq key sequencies work.
Comment 27 Toralf Förster 2010-04-04 09:42:00 UTC
The issue disappeared between v2.6.34-rc3[or .2 ?] and v2.6.34-rc3-163-g5e1161.
Furthermore I update pixman to 0.18. Can't remember any other relevant changes made at this system.
Comment 28 Toralf Förster 2010-06-09 08:28:13 UTC
The issue is back in 2.6.34. I suspend the ThinkPad within the docking station, unplug it - and it will not wakeup - no keyboard action possible, even SysRq keys don't work.
Comment 29 Jesse Barnes 2010-07-23 19:48:56 UTC
does this still occur with 2.6.35-rc6 and with an xserver from git master of the X tree?  Several fixes landed there that are related.
Comment 30 Toralf Förster 2010-07-24 10:41:25 UTC
(In reply to comment #29)
> does this still occur with 2.6.35-rc6 and with an xserver from git master of
> the X tree?  Several fixes landed there that are related.

Yes, 2.6.35-rc6-19-g86c65a7 shows the same behaviour. Furthermore it is independend from X11. If I boot with "init 2" (nonetwork, no X, only few services up), run 
$>sync; sleep 2; sync; echo mem > /sys/power/state
and plug the notebook oout of the docking station - then the system is unresponsive after opening the LID.
Comment 31 Toralf Förster 2010-08-05 15:09:27 UTC
changed the title accordingly to the current behaviour
Comment 32 Toralf Förster 2010-09-03 07:36:54 UTC
Created attachment 28922 [details]
kernel log message

With the new BIOS and kernel 2.6.35.4 it works now, however the kernel seems to have some memory problems pls see the attached file. In addition I'll attach the kmemleak file too.
Comment 33 Toralf Förster 2010-09-03 07:37:35 UTC
Created attachment 28932 [details]
cat /sys/kernel/debug/kmemleak
Comment 34 Chris Wilson 2010-09-11 09:15:22 UTC
Fixed the leak [1dfd9754cd55e424f247d9a2e855ad384e3e90ef].

Can you confirm working s2ram on the current kernel? [I don't want to too hastily close this report without knowing why.]
Comment 35 Toralf Förster 2010-09-11 09:54:06 UTC
(In reply to comment #34)
> Can you confirm working s2ram on the current kernel? [I don't want to too
> hastily close this report without knowing why.]
Unfortunately not, I was to optimistic.
Comment 36 Chris Wilson 2010-09-11 09:59:39 UTC
Good to know, thanks.
Comment 37 Toralf Förster 2010-12-03 18:11:04 UTC
Well, one or more commits between 27th of November and today solved this issue, kernel v2.6.37-rc4-158-ga7ca0e0 s2ram's and wakes up fine.
Comment 38 Chris Wilson 2010-12-16 15:14:19 UTC
Nope, none the wiser. Nothing between the 11-27 and 12-03 in i915 look responsible for the fix.

Please do keep an eye on this and reopen if suspend fails once again.
Comment 39 Toralf Förster 2010-12-18 16:38:16 UTC
Created attachment 40752 [details]
.config

Well, the issue consists of 2 parts - both I bisected now within the last 3 days.

The first (smaller) issue was introduced with commit c1c7af6 (between 2.6.31 and 2.6.32)):

commit c1c7af60892070e4b82ad63bbfb95ae745056de0
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Thu Sep 10 15:28:03 2009 -0700

    drm/i915: force mode set at lid open time
    
    Some laptop platforms will disable pipes and/or planes at lid close time
    and not restore them when the lid is opened again.  So catch the lid
    event, and if the lid was opened, force a mode restore.
    
    Fixes fdo bug #21230.
    

The issue itself is, that if a T400 is docked, booted, s2ram'ed, unplugged from the docking station and waked up - in that order - the internal notebook display is black. If I put the T400 now back into the docking station, then the external monitor (DVI output of the docking station) works fine (fortunately).


The second (bigger) issue was introduced between 2.6.32 and 2.6.33 with commit c4da694:

commit c4da6940a7a41c72781ff2d62ebd4b99f3749f14
Author: Bjorn Helgaas <bjorn.helgaas@hp.com>
Date:   Tue Nov 17 17:05:14 2009 -0700

    PNPACPI: save struct acpi_device, not just acpi_handle
    
    Some drivers need to look at things in the acpi_device structure
    besides the handle.
    


This has the effect, that after the wakeup (see steps before) the system hangs - no syslog messages, nothing to see at the internal display - neither Ctrl-Alt-Del nor SysRq key sequences works. If I plug in that system into the docking station, I see a blinking cursor in the upper left corner - that's all. I've to power off the system.



I was able to revert commit c4da694 on the current git kernel v2.6.37-rc6-11-gc0708b1 (only one hunk had to be applied manually) - the bigger issue went away, and the smaller issue is still present.

But unfortunately I could not revert commit c1c7af6, because a "patch -p1 --dry-run -R" gave for 4 hunks a "FAILED" :-(.
Comment 40 Bjorn Helgaas 2010-12-21 18:21:42 UTC
I took a look at c4da6940a7 (PNPACPI: save struct acpi_device, not just
acpi_handle) again, and it's hard to see how reverting that would make
a difference.

I assume suspend/resume works fine if you are always docked, or if you
are always undocked?  Maybe we don't do the right thing with dock/undock
events that occur while suspended.

My guess is that when we resume, we should get a bus check notification
to tell us that the dock went away (Rafael, correct me if I'm assuming
too much).  Looks like dock.c registers for that, so we should call
dock_notify().  I wonder if there's some ordering issue between that and
some other pnpacpi_resume() path such that c4da6940a7 causes us to use
a stale acpi_handle or something.

Can you explore that by reverting c4da6940a7 and adding some printks
in those paths?  If so, please attach the complete dmesg after a
suspend/resume cycle.
Comment 41 Rafael J. Wysocki 2010-12-22 00:24:40 UTC
I'm not actually familiar with dock.c, but it seems the notification should
be triggered on resume.
Comment 42 Bjorn Helgaas 2011-01-07 00:20:17 UTC
Created attachment 42662 [details]
debug patch

This is for the PNPACPI issue.

Can you try this:

  - Revert c4da6940a7
  - Apply this patch
  - Dock, boot, s2ram, undock, wakeup (I assume external monitor works after wakeup)
  - Collect the dmesg log and attach it here

Now if you put c4da6940a7 back (also keeping this debug patch), I assume it hangs on wakeup (but maybe you could double-check whether the network is working).  If you boot with "no_console_suspend", can you see anything on either screen?
Comment 43 Toralf Förster 2011-01-07 10:14:00 UTC
Created attachment 42672 [details]
/var/log/messages.log

Hello, I choosed kernel 2.6.35 for the tests.
I
- reverted the c4 patch, applied your debug message patch (+ manually) correction
- booted the docked machine
- s2ram
- undocked it
- waked it up (works !)
- docked it
- applied the c4 patch (now I had a 2.6.35 kernel + your debug patches
- reboot
- s2ram
- undock
- waked it up
- failure - as expected
I'll test the no_console_suspend option later today
Comment 44 Toralf Förster 2011-01-07 10:39:16 UTC
Created attachment 42682 [details]
kernel crash for 2.6.35

Booting with "no_console_suspend" gives the attached info for kernel 2.6.35.
(BTW seen on both screens - I can plug in the hanging notebook into the docking station and the external screen works - although the ThinkPad itself doesn't react on any keys, not even on SysRq)
Comment 45 Toralf Förster 2011-01-07 10:40:03 UTC
Created attachment 42692 [details]
kernel crash for 2.6.37

similar for 2.6.37
Comment 46 Bjorn Helgaas 2011-01-07 17:30:04 UTC
Created attachment 42712 [details]
debug patch2

Comparing the screen shot with the messages.log, I think this might be a null pointer oops when we're resuming the 00:0a serial device.

Please:

  - Replace previous patch with this one (so you have 2.6.37 + this patch)
  - Boot docked with "vga=0x0f07 no_console_suspend" (to use a smaller font to get more info in the screen photo)
  - Save dmesg log
  - s2ram, undock, wakeup (I expect it will hang)
  - Attach dmesg log and screenshot when it hangs
Comment 47 Rafael J. Wysocki 2011-01-07 21:57:03 UTC
I think that /sys/power/pm_trace may also help debug this.

Apart from this, Toralf, please use gdb to determine what code corresponds
to acpi_ns_validate_handle+0x16 as indicated by your trace (looking at
acpi_ns_validate_handle() I'm not where it can possibly crash).
Comment 48 Rafael J. Wysocki 2011-01-07 21:58:32 UTC
Well, perhaps handle is not NULL, but it is not a valid pointer too.
Comment 49 Toralf Förster 2011-01-08 10:17:36 UTC
Created attachment 42812 [details]
dmesg

(In reply to comment #46)
Here's the dmesg
Comment 50 Toralf Förster 2011-01-08 10:18:42 UTC
Created attachment 42822 [details]
kernel crash for 2.6.37 + 2nd debug patch

And the snap shot
Comment 51 Toralf Förster 2011-01-08 11:15:00 UTC
(In reply to comment #47)
> Apart from this, Toralf, please use gdb to determine what code corresponds
> to acpi_ns_validate_handle+0x16 as indicated by your trace (looking at
> acpi_ns_validate_handle() I'm not where it can possibly crash).

(gdb) l *acpi_ns_validate_handle+0x16
0xc1176815 is in acpi_ns_validate_handle (drivers/acpi/acpica/nsutils.c:607).
602                     return (acpi_gbl_root_node);
603             }
604
605             /* We can at least attempt to verify the handle */
606
607             if (ACPI_GET_DESCRIPTOR_TYPE(handle) != ACPI_DESC_TYPE_NAMED) {
608                     return (NULL);
609             }
610
611             return (ACPI_CAST_PTR(struct acpi_namespace_node, handle));
(gdb) quit
Comment 52 Rafael J. Wysocki 2011-01-08 11:17:29 UTC
So, as I suspected, handle is not a valid pointer any more during resume.
Comment 53 Rafael J. Wysocki 2011-01-08 11:33:08 UTC
The problem seems to be that 'handle' points to nowhere in pnpacpi_resume(),
so the acpi_bus_set_power(handle, ACPI_STATE_D0) explodes.

I guess there's missing synchronization between an undock event that happens
somewhere during resume and pnpacpi_resume().  I'll have a closer look at that
later today.
Comment 54 Toralf Förster 2011-01-08 14:17:37 UTC
Created attachment 42842 [details]
kernel crash for 2.6.37 + 2nd debug patch + debug info compiled into

This is - for completeness - the info for the same kernel however w/ this changed .config : 

tfoerste@n22 ~/devel/linux-2.6 $ diff .config*
4c4
< # Sat Jan  8 11:59:07 2011
---
> # Wed Jan  5 09:28:49 2011
2271,2272c2271
< CONFIG_DEBUG_INFO=y
< CONFIG_DEBUG_INFO_REDUCED=y
---
> # CONFIG_DEBUG_INFO is not set


(gdb) l *acpi_ns_validate_handle+0x16/0x1f
0xc11767ff is in acpi_ns_validate_handle (drivers/acpi/acpica/nsutils.c:595).
590      *       due to a table unload.
591      *
592      ******************************************************************************/
593
594     struct acpi_namespace_node *acpi_ns_validate_handle(acpi_handle handle)
595     {
596
597             ACPI_FUNCTION_ENTRY();
598
599             /* Parameter validation */

(gdb) l *acpi_ns_validate_handle+0x16
0xc1176815 is in acpi_ns_validate_handle (drivers/acpi/acpica/nsutils.c:607).
602                     return (acpi_gbl_root_node);
603             }
604
605             /* We can at least attempt to verify the handle */
606
607             if (ACPI_GET_DESCRIPTOR_TYPE(handle) != ACPI_DESC_TYPE_NAMED) {
608                     return (NULL);
609             }
610
611             return (ACPI_CAST_PTR(struct acpi_namespace_node, handle));
Comment 55 Rafael J. Wysocki 2011-01-08 21:07:53 UTC
Created attachment 42912 [details]
PNP: Do not trust dev->data in pnpacpi_resume()

I wonder if this patch makes a difference?
Comment 56 Toralf Förster 2011-01-08 21:55:33 UTC
Created attachment 42922 [details]
kernel crash for 2.6.37 + 2nd ry

Well, at least the issue isn't solved :

(gdb) l *acpi_ns_lookup+0x53/0x2d7
0xc1175370 is in acpi_ns_lookup (drivers/acpi/acpica/nsaccess.c:287).
282                    acpi_object_type type,
283                    acpi_interpreter_mode interpreter_mode,
284                    u32 flags,
285                    struct acpi_walk_state *walk_state,
286                    struct acpi_namespace_node **return_node)
287     {
288             acpi_status status;
289             char *path = pathname;
290             struct acpi_namespace_node *prefix_node;
291             struct acpi_namespace_node *current_node = NULL;
(gdb) l *acpi_ns_lookup+0x53
0xc11753c3 is in acpi_ns_lookup (drivers/acpi/acpica/nsaccess.c:325).
320                                       acpi_gbl_root_node));
321
322                     prefix_node = acpi_gbl_root_node;
323             } else {
324                     prefix_node = scope_info->scope.node;
325                     if (ACPI_GET_DESCRIPTOR_TYPE(prefix_node) !=
326                         ACPI_DESC_TYPE_NAMED) {
327                             ACPI_ERROR((AE_INFO, "%p is not a namespace node [%s]",
328                                         prefix_node,
329                                         acpi_ut_get_descriptor_name(prefix_node)));
Comment 57 Rafael J. Wysocki 2011-01-08 22:18:55 UTC
No, it's not, but also it doesn't crash in pnpacpi_resume() any more.
Comment 58 Rafael J. Wysocki 2011-01-08 22:20:21 UTC
Created attachment 42932 [details]
PNP: Do not trust dev->data in pnpacpi_resume() and don't ignore errors returned by it

Please try this patch instead of the previous one.
Comment 59 Rafael J. Wysocki 2011-01-08 22:39:19 UTC
Created attachment 42942 [details]
PNP: Do not trust dev->data in pnpacpi_resume() and don't ignore errors returned by it

Sorry, not that one, please try this one. :-)
Comment 60 Toralf Förster 2011-01-08 23:21:31 UTC
Created attachment 42962 [details]
dmesg

(In reply to comment #59)
> Sorry, not that one, please try this one. :-)
Much better, the crash went away, unfortunately now only s2ram is only possible one time, all subsequent tries are rejected with : 

...
serial 00:0a: disable failed
legacy_suspend(): pnp_bus_suspend+0x0/0x80 returns -5
PM: Device 00:0a failed to suspend: error -5
PM: Some devices failed to suspend
...
Comment 61 Rafael J. Wysocki 2011-01-08 23:30:49 UTC
That's because the device should be removed during resume if it is not
detected any more.  I have a brute force patch attempting to do that, but I'm
not sure if it's correct.

Unfortunately I don't own a docking station so I can't really test these
things myself.
Comment 62 Rafael J. Wysocki 2011-01-08 23:33:42 UTC
Created attachment 42972 [details]
PNP: Remove ACPI-based devices that are not present during resume

This is a modification of the previous patch removing devices that aren't
present any more during resume.

Please try it.
Comment 63 Toralf Förster 2011-01-09 09:23:50 UTC
(In reply to comment #62)
> Created an attachment (id=42972) [details]
> PNP: Remove ACPI-based devices that are not present during resume
> 
> This is a modification of the previous patch removing devices that aren't
> present any more during resume.
> 
> Please try it.
That is a step back - after wake up I've a black screen with a blinking cursor in the upper left corner.
Comment 64 Rafael J. Wysocki 2011-01-09 12:07:51 UTC
Well, it's a little heavy-handed.
Comment 65 Rafael J. Wysocki 2011-01-09 12:28:13 UTC
Created attachment 42982 [details]
PNP: Use DEVICE_ACPI_HANDLE() for accessing companion ACPI devices

I think this patch will avoid the "second suspend fails" issue. but I suspect
that the device in question will not function correctly after the "undocked"
resume even if you re-dock the machine.
Comment 66 Toralf Förster 2011-01-09 15:35:55 UTC
(In reply to comment #65)
That patch works :-)
I s2ram'ed few times in a row + while docked/undocked and mixed - no problems so far till now :-)
Comment 67 Rafael J. Wysocki 2011-01-09 21:38:51 UTC
Good, thanks for testing.

I'll try to push the patch upstream.
Comment 68 Toralf Förster 2011-01-10 09:01:19 UTC
(In reply to comment #67)
> Good, thanks for testing.
:-)

> I'll try to push the patch upstream.

FWIW there's no such string :

n22 ~ # grep "ACPI device not found in" /var/log/messages
n22 ~ # zgrep "ACPI device not found in" /var/log/messages*
n22 ~ # bzgrep "ACPI device not found in" /var/log/messages*
Comment 69 Bjorn Helgaas 2011-01-10 18:01:36 UTC
I'm not sure I understand the rationale for the patch in comment 65.  It looks like we're just exchanging the cached "pnp_dev->data->handle" for "lookup DEVICE_ACPI_HANDLE(pnp_dev->dev) and handle failure gracefully."  It sounds like it fixes this problem, but it feels like a pnp_dev lifetime issue that could bite us again somewhere else.

Is this the scenario:

  - boot docked (dock contains a UART)
  - make a PNP serial device
  - suspend to RAM
  - undock
  - resume (now the UART is gone)
  - resume the PNP serial device
  - dock.c notices UART removal
  - remove the PNP serial device

It seems sort of backwards to resume all the devices, then check to see whether any of them were removed.

I know we would need a lot of ACPI rework to deal with the Bus Check notification before resuming devices, but I wonder if that's the cleaner long-term approach.
Comment 70 Rafael J. Wysocki 2011-01-10 18:41:53 UTC
It is.  Still, I'd like to fix the _crash_ at hand first and applying the patch
won't prevent us from doing the right thing in the long run, right?

The problem is that without the patch users who never use the device in
question will see a resume crash and/or second suspend failure.
Comment 71 Rafael J. Wysocki 2011-01-10 18:51:31 UTC
BTW, the scenario is:

  - boot docked (dock contains a UART)
  - make a PNP serial device
  - suspend to RAM
  - undock
  - dock.c notices UART removal
  - remove the ACPI companion device of the PNP serial device
  - resume the PNP serial device
  - oops (unless we check that ACPI companion is gone, which is what
    DEVICE_ACPI_HANDLE) does.  [Yes, we should remove the PNP device
    at this point, but it isn't safe with the current code in pnpacpi.]

DEVICE_ACPI_HANDLE() should be used in there anyway, because it returns the
_right_ cached ACPI handle instead of the "copy" from pnp_dev->data->handle
which is completely redundant.
Comment 72 Rafael J. Wysocki 2011-01-10 19:13:06 UTC
(In reply to comment #68)
> (In reply to comment #67)
> > Good, thanks for testing.
> :-)
> 
> > I'll try to push the patch upstream.
> 
> FWIW there's no such string :
> 
> n22 ~ # grep "ACPI device not found in" /var/log/messages
> n22 ~ # zgrep "ACPI device not found in" /var/log/messages*
> n22 ~ # bzgrep "ACPI device not found in" /var/log/messages*

That's probably because it's printed with dev_dbg().
Comment 73 Toralf Förster 2011-01-13 08:49:49 UTC
Created attachment 43412 [details]
kmemleak

Just for the record I'd like to update this report with the kmemleaks which might be related to this use case too.
Comment 74 Rafael J. Wysocki 2011-01-16 22:54:59 UTC
The kmemleaks seem to be from USB, so they are most likely unrelated.

The patch has been merged as
cc8e7a3 PNP / ACPI: Use DEVICE_ACPI_HANDLE() for device ACPI handle access
so closing.