Bug 15100
Description
Toralf Förster
2010-01-21 08:56:31 UTC
Will you please try the latest kernel(2.6.33-rc5) and see whether the issue still exists? Thanks. I think it's better to test 2.6.33-rc6 when it's out. With 2.6.33-rc6 I'm in addition not longer able to press Ctrl+Alt+Del after wakeup - btw the SysRQ keys don't work any more (compared to 2.6.32.7). Well, it looks like no one has any idea about what the problem is. Would it be possible to bisect the kernels between 2.6.31 and 2.6.32? (In reply to comment #4) > Would it be possible to bisect the kernels between 2.6.31 and 2.6.32? It's not a technical question it is rather a question whether bisection could be limited to a given subdirectory or whether it is not worth b/c of the risk of having an issue somewhere else. I'm asking b/c unfortunately this bisect cannot be made by a simple script. There always is a risk like that, so you can only bisect the commits that modify i915 directly (I don't think the problem is anywhere outside). I bisected it to this : tfoerste@n22 ~/devel/linux-2.6 $ git bisect good There are only 'skip'ped commits left to test. The first bad commit could be any of: af729a26ccc3ff9ad834a5e96f455aab20f176cd c1c7af60892070e4b82ad63bbfb95ae745056de0 We cannot bisect more! I skipped 2.6.31-rc9-00107-gaf729a2 b/c suspend2ram didn't worked. forget to mention I used this : git bisect start v2.6.32 v2.6.31 -- ./include/config/drm/i915 ./drivers/gpu/drm/i915 I'd try to revert af729a26ccc3ff9ad834a5e96f455aab20f176cd and c1c7af60892070e4b82ad63bbfb95ae745056de0. If that fixes the problem, you can revert just one of them and see what happens. This should give you the answer, otherwise the bisection just went wrong. c1c7af6 is the culprit, the previous commit 7e12715 works fine. Thanks for verifying. commit c1c7af60892070e4b82ad63bbfb95ae745056de0 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Thu Sep 10 15:28:03 2009 -0700 drm/i915: force mode set at lid open time Some laptop platforms will disable pipes and/or planes at lid close time and not restore them when the lid is opened again. So catch the lid event, and if the lid was opened, force a mode restore. Fixes fdo bug #21230. Acked-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Eric Anholt <eric@anholt.net> First-Bad-Commit : c1c7af60892070e4b82ad63bbfb95ae745056de0 After suspending on the dock, then opening your lid undocked, does your ACPI lid file indicate your lid is open? You can just cat /proc/acpi/button/lid/LID/state to check. yes, it is "open". Presumably this patch gets your old behavior back? I wonder why X isn't resetting its mode on resume... diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds index aa74e59..5ceb43f 100644 --- a/drivers/gpu/drm/i915/intel_lvds.c +++ b/drivers/gpu/drm/i915/intel_lvds.c @@ -719,7 +719,7 @@ static int intel_lid_notify(struct notifier_block *nb, unsig dev_priv->modeset_on_lid = 0; mutex_lock(&dev->mode_config.mutex); - drm_helper_resume_force_mode(dev); + // drm_helper_resume_force_mode(dev); mutex_unlock(&dev->mode_config.mutex); return NOTIFY_OK; (In reply to comment #14) > Presumably this patch gets your old behavior back? I wonder why X isn't > resetting its mode on resume... > No success (tested with 2.6.32.7). So reverting c1c7af60892070e4b82ad63bbfb95ae745056de0 gets your suspend/resume behavior back, but commenting out the drm_helper_resume_force_mode in intel_lid_notify doesn't? That would mean the lid notification itself was breaking things somehow... (In reply to comment #16) > So reverting c1c7af60892070e4b82ad63bbfb95ae745056de0 gets your > suspend/resume > behavior back yes (bisected in git kernel trunk) > but commenting out the drm_helper_resume_force_mode in > intel_lid_notify doesn't? yes (tested with vanilla 2.6.32.7) What if you remove the call to acpi_lid_notifer_register in lvds_init? (In reply to comment #18) > What if you remove the call to acpi_lid_notifer_register in lvds_init? No effect in 2.6.32.7 and only small in .8-rc2 (when I putted the ThinkPad into the docking station, I could see the mouse pointer within a black screen, funny ebough when I pressed Alt+F1, I was on the conolsembut the mouse piinter was still visible). However tweaking in the git tree the bad commit by this solved the issue: tfoerste@n22 ~/devel/linux-2.6 $ git describe v2.6.31-rc9-109-gc1c7af6 diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds.c index e190ff7..3f90491 100644 --- a/drivers/gpu/drm/i915/intel_lvds.c +++ b/drivers/gpu/drm/i915/intel_lvds.c @@ -641,8 +641,8 @@ static int intel_lid_notify(struct notifier_block *nb, unsigned long val, struct drm_device *dev = dev_priv->dev; if (acpi_lid_open()) - drm_helper_resume_force_mode(dev); - + //drm_helper_resume_force_mode(dev); + ; return NOTIFY_OK; On Monday 15 February 2010, Toralf Förster wrote:
>
> Rafael J. Wysocki wrote at 00:52:50
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32. Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15100
> > Subject : X11 is black after resume from s2ram if my T400 was
> previous in
> > docking station before Submitter : Toralf Förster
> <toralf.foerster@gmx.de>
> > Date : 2010-01-21 08:56 (25 days old)
> >
> Issue present in 2.6.33-rc8
FWIW with kernel 2.6.31 xrandr tells me that I've DVI1 and DVI2 whereas with newer kernel xrandr named it HDMI1 and HDMI2. The docking station itself has a DVI port. Does http://patchwork.kernel.org/patch/80474/ on top of 2.6.33 help? (In reply to comment #22) > Does http://patchwork.kernel.org/patch/80474/ on top of 2.6.33 help? No, neither on 2.6.33 nor on 2.6.32.9 :-( solved at least with 2.6.32.10 KDE people says it is a kernel bug : https://bugs.kde.org/show_bug.cgi?id=231251(In reply to comment #24) > solved at least with 2.6.32.10 No, b/c KDE people says it is a kernel bug : https://bugs.kde.org/show_bug.cgi?id=231251 BTW this issue is still in 2.6.33.2 + xorg 1.7.6 + intel driver 2.11. If I suspend the docked notebook, undock it and wake it up, the screen is black, Ctrl-Alt-Del often doesn't work, howeeer SysRq key sequencies work. The issue disappeared between v2.6.34-rc3[or .2 ?] and v2.6.34-rc3-163-g5e1161. Furthermore I update pixman to 0.18. Can't remember any other relevant changes made at this system. The issue is back in 2.6.34. I suspend the ThinkPad within the docking station, unplug it - and it will not wakeup - no keyboard action possible, even SysRq keys don't work. does this still occur with 2.6.35-rc6 and with an xserver from git master of the X tree? Several fixes landed there that are related. (In reply to comment #29) > does this still occur with 2.6.35-rc6 and with an xserver from git master of > the X tree? Several fixes landed there that are related. Yes, 2.6.35-rc6-19-g86c65a7 shows the same behaviour. Furthermore it is independend from X11. If I boot with "init 2" (nonetwork, no X, only few services up), run $>sync; sleep 2; sync; echo mem > /sys/power/state and plug the notebook oout of the docking station - then the system is unresponsive after opening the LID. changed the title accordingly to the current behaviour Created attachment 28922 [details]
kernel log message
With the new BIOS and kernel 2.6.35.4 it works now, however the kernel seems to have some memory problems pls see the attached file. In addition I'll attach the kmemleak file too.
Created attachment 28932 [details]
cat /sys/kernel/debug/kmemleak
Fixed the leak [1dfd9754cd55e424f247d9a2e855ad384e3e90ef]. Can you confirm working s2ram on the current kernel? [I don't want to too hastily close this report without knowing why.] (In reply to comment #34) > Can you confirm working s2ram on the current kernel? [I don't want to too > hastily close this report without knowing why.] Unfortunately not, I was to optimistic. Good to know, thanks. Well, one or more commits between 27th of November and today solved this issue, kernel v2.6.37-rc4-158-ga7ca0e0 s2ram's and wakes up fine. Nope, none the wiser. Nothing between the 11-27 and 12-03 in i915 look responsible for the fix. Please do keep an eye on this and reopen if suspend fails once again. Created attachment 40752 [details] .config Well, the issue consists of 2 parts - both I bisected now within the last 3 days. The first (smaller) issue was introduced with commit c1c7af6 (between 2.6.31 and 2.6.32)): commit c1c7af60892070e4b82ad63bbfb95ae745056de0 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Thu Sep 10 15:28:03 2009 -0700 drm/i915: force mode set at lid open time Some laptop platforms will disable pipes and/or planes at lid close time and not restore them when the lid is opened again. So catch the lid event, and if the lid was opened, force a mode restore. Fixes fdo bug #21230. The issue itself is, that if a T400 is docked, booted, s2ram'ed, unplugged from the docking station and waked up - in that order - the internal notebook display is black. If I put the T400 now back into the docking station, then the external monitor (DVI output of the docking station) works fine (fortunately). The second (bigger) issue was introduced between 2.6.32 and 2.6.33 with commit c4da694: commit c4da6940a7a41c72781ff2d62ebd4b99f3749f14 Author: Bjorn Helgaas <bjorn.helgaas@hp.com> Date: Tue Nov 17 17:05:14 2009 -0700 PNPACPI: save struct acpi_device, not just acpi_handle Some drivers need to look at things in the acpi_device structure besides the handle. This has the effect, that after the wakeup (see steps before) the system hangs - no syslog messages, nothing to see at the internal display - neither Ctrl-Alt-Del nor SysRq key sequences works. If I plug in that system into the docking station, I see a blinking cursor in the upper left corner - that's all. I've to power off the system. I was able to revert commit c4da694 on the current git kernel v2.6.37-rc6-11-gc0708b1 (only one hunk had to be applied manually) - the bigger issue went away, and the smaller issue is still present. But unfortunately I could not revert commit c1c7af6, because a "patch -p1 --dry-run -R" gave for 4 hunks a "FAILED" :-(. I took a look at c4da6940a7 (PNPACPI: save struct acpi_device, not just acpi_handle) again, and it's hard to see how reverting that would make a difference. I assume suspend/resume works fine if you are always docked, or if you are always undocked? Maybe we don't do the right thing with dock/undock events that occur while suspended. My guess is that when we resume, we should get a bus check notification to tell us that the dock went away (Rafael, correct me if I'm assuming too much). Looks like dock.c registers for that, so we should call dock_notify(). I wonder if there's some ordering issue between that and some other pnpacpi_resume() path such that c4da6940a7 causes us to use a stale acpi_handle or something. Can you explore that by reverting c4da6940a7 and adding some printks in those paths? If so, please attach the complete dmesg after a suspend/resume cycle. I'm not actually familiar with dock.c, but it seems the notification should be triggered on resume. Created attachment 42662 [details]
debug patch
This is for the PNPACPI issue.
Can you try this:
- Revert c4da6940a7
- Apply this patch
- Dock, boot, s2ram, undock, wakeup (I assume external monitor works after wakeup)
- Collect the dmesg log and attach it here
Now if you put c4da6940a7 back (also keeping this debug patch), I assume it hangs on wakeup (but maybe you could double-check whether the network is working). If you boot with "no_console_suspend", can you see anything on either screen?
Created attachment 42672 [details]
/var/log/messages.log
Hello, I choosed kernel 2.6.35 for the tests.
I
- reverted the c4 patch, applied your debug message patch (+ manually) correction
- booted the docked machine
- s2ram
- undocked it
- waked it up (works !)
- docked it
- applied the c4 patch (now I had a 2.6.35 kernel + your debug patches
- reboot
- s2ram
- undock
- waked it up
- failure - as expected
I'll test the no_console_suspend option later today
Created attachment 42682 [details]
kernel crash for 2.6.35
Booting with "no_console_suspend" gives the attached info for kernel 2.6.35.
(BTW seen on both screens - I can plug in the hanging notebook into the docking station and the external screen works - although the ThinkPad itself doesn't react on any keys, not even on SysRq)
Created attachment 42692 [details]
kernel crash for 2.6.37
similar for 2.6.37
Created attachment 42712 [details]
debug patch2
Comparing the screen shot with the messages.log, I think this might be a null pointer oops when we're resuming the 00:0a serial device.
Please:
- Replace previous patch with this one (so you have 2.6.37 + this patch)
- Boot docked with "vga=0x0f07 no_console_suspend" (to use a smaller font to get more info in the screen photo)
- Save dmesg log
- s2ram, undock, wakeup (I expect it will hang)
- Attach dmesg log and screenshot when it hangs
I think that /sys/power/pm_trace may also help debug this. Apart from this, Toralf, please use gdb to determine what code corresponds to acpi_ns_validate_handle+0x16 as indicated by your trace (looking at acpi_ns_validate_handle() I'm not where it can possibly crash). Well, perhaps handle is not NULL, but it is not a valid pointer too. Created attachment 42812 [details] dmesg (In reply to comment #46) Here's the dmesg Created attachment 42822 [details]
kernel crash for 2.6.37 + 2nd debug patch
And the snap shot
(In reply to comment #47) > Apart from this, Toralf, please use gdb to determine what code corresponds > to acpi_ns_validate_handle+0x16 as indicated by your trace (looking at > acpi_ns_validate_handle() I'm not where it can possibly crash). (gdb) l *acpi_ns_validate_handle+0x16 0xc1176815 is in acpi_ns_validate_handle (drivers/acpi/acpica/nsutils.c:607). 602 return (acpi_gbl_root_node); 603 } 604 605 /* We can at least attempt to verify the handle */ 606 607 if (ACPI_GET_DESCRIPTOR_TYPE(handle) != ACPI_DESC_TYPE_NAMED) { 608 return (NULL); 609 } 610 611 return (ACPI_CAST_PTR(struct acpi_namespace_node, handle)); (gdb) quit So, as I suspected, handle is not a valid pointer any more during resume. The problem seems to be that 'handle' points to nowhere in pnpacpi_resume(), so the acpi_bus_set_power(handle, ACPI_STATE_D0) explodes. I guess there's missing synchronization between an undock event that happens somewhere during resume and pnpacpi_resume(). I'll have a closer look at that later today. Created attachment 42842 [details] kernel crash for 2.6.37 + 2nd debug patch + debug info compiled into This is - for completeness - the info for the same kernel however w/ this changed .config : tfoerste@n22 ~/devel/linux-2.6 $ diff .config* 4c4 < # Sat Jan 8 11:59:07 2011 --- > # Wed Jan 5 09:28:49 2011 2271,2272c2271 < CONFIG_DEBUG_INFO=y < CONFIG_DEBUG_INFO_REDUCED=y --- > # CONFIG_DEBUG_INFO is not set (gdb) l *acpi_ns_validate_handle+0x16/0x1f 0xc11767ff is in acpi_ns_validate_handle (drivers/acpi/acpica/nsutils.c:595). 590 * due to a table unload. 591 * 592 ******************************************************************************/ 593 594 struct acpi_namespace_node *acpi_ns_validate_handle(acpi_handle handle) 595 { 596 597 ACPI_FUNCTION_ENTRY(); 598 599 /* Parameter validation */ (gdb) l *acpi_ns_validate_handle+0x16 0xc1176815 is in acpi_ns_validate_handle (drivers/acpi/acpica/nsutils.c:607). 602 return (acpi_gbl_root_node); 603 } 604 605 /* We can at least attempt to verify the handle */ 606 607 if (ACPI_GET_DESCRIPTOR_TYPE(handle) != ACPI_DESC_TYPE_NAMED) { 608 return (NULL); 609 } 610 611 return (ACPI_CAST_PTR(struct acpi_namespace_node, handle)); Created attachment 42912 [details]
PNP: Do not trust dev->data in pnpacpi_resume()
I wonder if this patch makes a difference?
Created attachment 42922 [details]
kernel crash for 2.6.37 + 2nd ry
Well, at least the issue isn't solved :
(gdb) l *acpi_ns_lookup+0x53/0x2d7
0xc1175370 is in acpi_ns_lookup (drivers/acpi/acpica/nsaccess.c:287).
282 acpi_object_type type,
283 acpi_interpreter_mode interpreter_mode,
284 u32 flags,
285 struct acpi_walk_state *walk_state,
286 struct acpi_namespace_node **return_node)
287 {
288 acpi_status status;
289 char *path = pathname;
290 struct acpi_namespace_node *prefix_node;
291 struct acpi_namespace_node *current_node = NULL;
(gdb) l *acpi_ns_lookup+0x53
0xc11753c3 is in acpi_ns_lookup (drivers/acpi/acpica/nsaccess.c:325).
320 acpi_gbl_root_node));
321
322 prefix_node = acpi_gbl_root_node;
323 } else {
324 prefix_node = scope_info->scope.node;
325 if (ACPI_GET_DESCRIPTOR_TYPE(prefix_node) !=
326 ACPI_DESC_TYPE_NAMED) {
327 ACPI_ERROR((AE_INFO, "%p is not a namespace node [%s]",
328 prefix_node,
329 acpi_ut_get_descriptor_name(prefix_node)));
No, it's not, but also it doesn't crash in pnpacpi_resume() any more. Created attachment 42932 [details]
PNP: Do not trust dev->data in pnpacpi_resume() and don't ignore errors returned by it
Please try this patch instead of the previous one.
Created attachment 42942 [details]
PNP: Do not trust dev->data in pnpacpi_resume() and don't ignore errors returned by it
Sorry, not that one, please try this one. :-)
Created attachment 42962 [details] dmesg (In reply to comment #59) > Sorry, not that one, please try this one. :-) Much better, the crash went away, unfortunately now only s2ram is only possible one time, all subsequent tries are rejected with : ... serial 00:0a: disable failed legacy_suspend(): pnp_bus_suspend+0x0/0x80 returns -5 PM: Device 00:0a failed to suspend: error -5 PM: Some devices failed to suspend ... That's because the device should be removed during resume if it is not detected any more. I have a brute force patch attempting to do that, but I'm not sure if it's correct. Unfortunately I don't own a docking station so I can't really test these things myself. Created attachment 42972 [details]
PNP: Remove ACPI-based devices that are not present during resume
This is a modification of the previous patch removing devices that aren't
present any more during resume.
Please try it.
(In reply to comment #62) > Created an attachment (id=42972) [details] > PNP: Remove ACPI-based devices that are not present during resume > > This is a modification of the previous patch removing devices that aren't > present any more during resume. > > Please try it. That is a step back - after wake up I've a black screen with a blinking cursor in the upper left corner. Well, it's a little heavy-handed. Created attachment 42982 [details]
PNP: Use DEVICE_ACPI_HANDLE() for accessing companion ACPI devices
I think this patch will avoid the "second suspend fails" issue. but I suspect
that the device in question will not function correctly after the "undocked"
resume even if you re-dock the machine.
(In reply to comment #65) That patch works :-) I s2ram'ed few times in a row + while docked/undocked and mixed - no problems so far till now :-) Good, thanks for testing. I'll try to push the patch upstream. (In reply to comment #67) > Good, thanks for testing. :-) > I'll try to push the patch upstream. FWIW there's no such string : n22 ~ # grep "ACPI device not found in" /var/log/messages n22 ~ # zgrep "ACPI device not found in" /var/log/messages* n22 ~ # bzgrep "ACPI device not found in" /var/log/messages* I'm not sure I understand the rationale for the patch in comment 65. It looks like we're just exchanging the cached "pnp_dev->data->handle" for "lookup DEVICE_ACPI_HANDLE(pnp_dev->dev) and handle failure gracefully." It sounds like it fixes this problem, but it feels like a pnp_dev lifetime issue that could bite us again somewhere else. Is this the scenario: - boot docked (dock contains a UART) - make a PNP serial device - suspend to RAM - undock - resume (now the UART is gone) - resume the PNP serial device - dock.c notices UART removal - remove the PNP serial device It seems sort of backwards to resume all the devices, then check to see whether any of them were removed. I know we would need a lot of ACPI rework to deal with the Bus Check notification before resuming devices, but I wonder if that's the cleaner long-term approach. It is. Still, I'd like to fix the _crash_ at hand first and applying the patch won't prevent us from doing the right thing in the long run, right? The problem is that without the patch users who never use the device in question will see a resume crash and/or second suspend failure. BTW, the scenario is: - boot docked (dock contains a UART) - make a PNP serial device - suspend to RAM - undock - dock.c notices UART removal - remove the ACPI companion device of the PNP serial device - resume the PNP serial device - oops (unless we check that ACPI companion is gone, which is what DEVICE_ACPI_HANDLE) does. [Yes, we should remove the PNP device at this point, but it isn't safe with the current code in pnpacpi.] DEVICE_ACPI_HANDLE() should be used in there anyway, because it returns the _right_ cached ACPI handle instead of the "copy" from pnp_dev->data->handle which is completely redundant. (In reply to comment #68) > (In reply to comment #67) > > Good, thanks for testing. > :-) > > > I'll try to push the patch upstream. > > FWIW there's no such string : > > n22 ~ # grep "ACPI device not found in" /var/log/messages > n22 ~ # zgrep "ACPI device not found in" /var/log/messages* > n22 ~ # bzgrep "ACPI device not found in" /var/log/messages* That's probably because it's printed with dev_dbg(). Created attachment 43412 [details]
kmemleak
Just for the record I'd like to update this report with the kmemleaks which might be related to this use case too.
The kmemleaks seem to be from USB, so they are most likely unrelated. The patch has been merged as cc8e7a3 PNP / ACPI: Use DEVICE_ACPI_HANDLE() for device ACPI handle access so closing. |