Bug 14484

Summary: no video output after suspend
Product: Drivers Reporter: Rafael J. Wysocki (rjw)
Component: Video(DRI - Intel)Assignee: Jesse Barnes (jbarnes)
Status: RESOLVED CODE_FIX    
Severity: normal CC: adi, axet, jbarnes, mat, riccardo.magliocchetti, santi
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 13615    
Attachments: extract from /var/log/messages of boot, failed suspend, suspend and resume with blank screen
X logs
2.6.32-rc7 messages
dmesg-2.6.32-rc8-cvp20091022-00011-ga8a8a66
serialconsole-2.6.32-rc8-cvp20091022-00011-ga8a8a66
Boot and suspend / resume cycle logs with kernel 2.6.33-rc4
xorg logs, don't know if they are useful

Description Rafael J. Wysocki 2009-10-26 18:12:47 UTC
Subject    : no video output after suspend after "drm/i915: force mode set at lid open time"
Submitter  : Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Date       : 2009-10-25 20:57
References : http://marc.info/?l=linux-kernel&m=125650430123713&w=4
Handled-By : Jesse Barnes <jbarnes@virtuousgeek.org>

This entry is being used for tracking a regression from 2.6.31.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2009-10-26 19:57:48 UTC
Patch : http://patchwork.kernel.org/patch/53783/
Comment 2 Rafael J. Wysocki 2009-10-27 22:10:19 UTC
Ignore-Patch : http://patchwork.kernel.org/patch/53783/
Comment 3 Linus Torvalds 2009-10-28 23:31:37 UTC
Jesse et al,

  I haven't bisected this 100% yet, but I've got it down to a series of 20 
commits or so, all of which came through my merge of 'drm-intel-next' on 
Sept 24.

I've bisected this thing almost all day today, because it wasn't trivial 
(there was another unrelated bug in resume that had already been fixed, 
but that interacted with the bisect, so I've done this bisection over 
about ten thousand commits three times now, first with confusion, second 
to find the unrelated bug, and now finally to find this i915 resume bug).

It triggers almost every time on one of my laptops (keyword being 
"almost": I have to suspend/resume about 10 times to be absolutely sure), 
and it _only_ triggers if I suspend/resume the laptop by physically 
closing and opening the lid.

When the bug triggers, you see crazy shit like

        Restarting tasks ... done.
        [drm] LVDS-8: set mode 1366x768 c
        [drm] LVDS-8: set mode _NET_SYSTEM_TRAY_OPCODE 15

where that "_NET_SYSTEM_TRAY_OPCODE" can be any random string that 
obviously doesn't have anything to do with a mode. Sometimes it includes 
random UTF-8 characters like

	[drm] LVDS-8: set mode [0001] 15
	[drm] LVDS-8: set mode [0001] 1a

adn sometimes it's just an empty string:

	[drm] LVDS-8: set mode  15

but the end result is always the same: no panel output at resume.

I can do a "killall Xorg" from the network, in which case the panel comes 
back (the machine is working fine other than the bogus mode), and I see a 
single

	[drm] LVDS-8: set mode 1366x768 c

in the dmesg when that happens. So the mode list is correct in the kernel, 
but it looks like some nasty race condition that causes resume to 
sometimes try to do it twice at the same time, and the second case races 
and uses a bogus mode.

Right now I have it narrowed down to:

 - good: cecc6b63a5de547a345c491bb4c18c01a15984a4 ("drm/radeon/r600: use 
   fence->timeout directly")

 - bad: 8dd81a381e8886129c0923f1fe22ff5ca36ae8da ("drm/i915: Fix LVDS 
   panel fitting on Arrandale")

which leaves 20 possible bad commits. Looking at the list, I suspect 
c1c7af60892070e4b82ad63bbfb95ae745056de0 ("drm/i915: force mode set at lid 
open time") which already had one fix to it, but that fix doesn't help my 
case even if it apparently helped somebody else.

[ The fix that does _not_ fix it is 06891e27a9b5: "drm/i915: fix 
  suspend/resume breakage in lid notifier" ]

I'll continue the bisect, but I'm already sure you guys are responsible 
(and I'm pretty certain it will end up at that "force mode set at lid open 
time" one), so here's an early heads-up.

I bet this is related to 

	 http://bugzilla.kernel.org/show_bug.cgi?id=14484

which bisected to the same thing (and apparently without my problems, 
probably because his bisection wasn't over the whole range from 2.6.31 to 
now). So I added bugzilla to the cc.

			Linus
Comment 4 Jesse Barnes 2009-10-28 23:56:36 UTC
On Wed, 28 Oct 2009 16:31:32 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> Jesse et al,
> 
>   I haven't bisected this 100% yet, but I've got it down to a series
> of 20 commits or so, all of which came through my merge of
> 'drm-intel-next' on Sept 24.
> 
> I've bisected this thing almost all day today, because it wasn't
> trivial (there was another unrelated bug in resume that had already
> been fixed, but that interacted with the bisect, so I've done this
> bisection over about ten thousand commits three times now, first with
> confusion, second to find the unrelated bug, and now finally to find
> this i915 resume bug).
> 
> It triggers almost every time on one of my laptops (keyword being 
> "almost": I have to suspend/resume about 10 times to be absolutely
> sure), and it _only_ triggers if I suspend/resume the laptop by
> physically closing and opening the lid.
> 
> When the bug triggers, you see crazy shit like
> 
>         Restarting tasks ... done.
>         [drm] LVDS-8: set mode 1366x768 c
>         [drm] LVDS-8: set mode _NET_SYSTEM_TRAY_OPCODE 15

Wow, definitely something bad going on there...  I'm not sure how we'd
clobber the mode string, but if it was corrupted we could definitely
end up trying to program a bad mode.  Could be busted locking ends up
making us call into the mode set stuff a bunch of times at resume (on
resume some machines end up sending multiple lid events, so we'll try
to force a mode set several times if we're not careful).

I'll look at those commits and see if I can figure out what's going on.
Comment 5 Linus Torvalds 2009-10-29 00:10:43 UTC
On Wed, 28 Oct 2009, Jesse Barnes wrote:
> 
> Wow, definitely something bad going on there...  I'm not sure how we'd
> clobber the mode string, but if it was corrupted we could definitely
> end up trying to program a bad mode.

Yeah. I've bisected a couple of times more, and it's now in the

	d0cbde93cc..c1c7af608

range. With that "force mode set at lid open time" still there, and still 
as the obvious contender.

> Could be busted locking ends up making us call into the mode set stuff a 
> bunch of times at resume (on resume some machines end up sending 
> multiple lid events, so we'll try to force a mode set several times if 
> we're not careful).

I'm not seeing any locking at all, apart from that

                mutex_lock(&dev->mode_config.mutex);
                drm_helper_resume_force_mode(dev);
                mutex_unlock(&dev->mode_config.mutex);

code in intel_lid_notify(). That mutex locking was added later, and 
doesn't help my case. But it's not helping against any "multiply called" 
issues, and the test for "dev_priv->suspended" is outside the lock.

In general, the "locking" seems totally random and utterly bogus. There's 
code like this:

	list_for_each_entry(scan, &connector->probed_modes, head) {
		mutex_lock(&dev->mode_config.mutex);
		if (scan->type & DRM_MODE_TYPE_PREFERRED) {

ie the lock is _outside_ the loop, but nothing seems to protect the actual 
'probed_modes' list itself. That's just totally crazy. The _only_ point of 
getting the lock there would be if you'd protect the loop and the list, 
but that is explicitly _not_ what it protects.

I don't know what it actually tries to protect, and maybe there is some 
reason for it all, but it looks like some five-year-old with a crayon was 
playing with the code.

		Linus
Comment 6 Jesse Barnes 2009-10-29 01:15:49 UTC
On Wed, 28 Oct 2009 17:10:40 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> I'm not seeing any locking at all, apart from that
> 
>                 mutex_lock(&dev->mode_config.mutex);
>                 drm_helper_resume_force_mode(dev);
>                 mutex_unlock(&dev->mode_config.mutex);
> 
> code in intel_lid_notify(). That mutex locking was added later, and 
> doesn't help my case. But it's not helping against any "multiply
> called" issues, and the test for "dev_priv->suspended" is outside the
> lock.

Yeah, it could be locked, but the suspend/resume paths that set and
clear that flag should already be atomic.  The main thing ->suspended
tries to avoid is having the LID notifier get called before we've run
our resume routine.  Any subsequent calls should be ok (even if multiple
calls occur).  Though if something is corrupted when we get called all
bets are off of course.

> In general, the "locking" seems totally random and utterly bogus.
> There's code like this:
> 
>       list_for_each_entry(scan, &connector->probed_modes, head) {
>               mutex_lock(&dev->mode_config.mutex);
>               if (scan->type & DRM_MODE_TYPE_PREFERRED) {

Yuck.  This particular bug has been there since it was first merged.
It's in the init code so the exposure shouldn't be too great (we only
call that once at load time), but still bad.

Looking at some other users of probed_modes, the locking rules have
definitely changed.  My comment about the caller of
drm_connector_cleanup clearly isn't true anymore since that function
takes the mode config lock itself.

Unfortunately I don't think either of these bugs explain your issue.

> ie the lock is _outside_ the loop, but nothing seems to protect the
> actual 'probed_modes' list itself. That's just totally crazy. The
> _only_ point of getting the lock there would be if you'd protect the
> loop and the list, but that is explicitly _not_ what it protects.
> 
> I don't know what it actually tries to protect, and maybe there is
> some reason for it all, but it looks like some five-year-old with a
> crayon was playing with the code.

Yeah looks like the locking in at least some of our mode list handling
is busted.  At one point it was consistent but I think that was long
before it went upstream and things were ripped apart and put back
together a few times.

In your particular case, it looks like the crtc->mode is broken at
resume time, which could mean either we've corrupted the master crtc
list (unlikely) or some previous mode set has set crtc->mode to
garbage.  The only place we write that is drm_crtc_helper_set_mode, and
I don't see where that would totally corrupt things.

So somehow the multiple calls to drm_helper_resume_force_mode(dev); are
hosing things.  You'll get one call from the core driver resume, then
probably one or more from the lid notifier.  AFAIK though multiple
calls *should* be fine.  Dave?

Jesse
Comment 7 Linus Torvalds 2009-10-29 01:33:49 UTC
On Wed, 28 Oct 2009, Jesse Barnes wrote:
> 
> In your particular case, it looks like the crtc->mode is broken at
> resume time, which could mean either we've corrupted the master crtc
> list (unlikely) or some previous mode set has set crtc->mode to
> garbage.  The only place we write that is drm_crtc_helper_set_mode, and
> I don't see where that would totally corrupt things.
> 
> So somehow the multiple calls to drm_helper_resume_force_mode(dev); are
> hosing things.  You'll get one call from the core driver resume, then
> probably one or more from the lid notifier.  AFAIK though multiple
> calls *should* be fine.  Dave?

Well, the resume code does _not_ try to get the lock, so I could well 
imagine that the two calls to drm_helper_resume_force_mode() would race 
against each other. One from i915_resume() (without the lock), and one 
from intel_lid_notify() (which was changed to hold the lock, but with the 
other caller not doing so, the lock does nothing).

I'll try adding the lock to the i915_resume() path, and see if that helps 
any. But right now I have to go fetch the kids from gymnastics..

		Linus
Comment 8 Linus Torvalds 2009-10-29 03:22:37 UTC
On Wed, 28 Oct 2009, Linus Torvalds wrote:
> 
> I'll try adding the lock to the i915_resume() path, and see if that helps 
> any. But right now I have to go fetch the kids from gymnastics..

That doesn't help.

But I added some more debugging, and there are actually _three_ cases. 

The first one is from the resume itself:

 [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
 [<ffffffff81365b7c>] drm_helper_resume_force_mode+0x5a/0xb5
 [<ffffffff8137d6a8>] i915_resume+0xde/0x111
 [<ffffffff8137d70b>] i915_pci_resume+0x30/0x46
 [<ffffffff812b5081>] pci_legacy_resume+0x4b/0x70
 [<ffffffff812b5251>] pci_pm_resume+0x63/0xad
 [<ffffffff813c015b>] pm_op+0x81/0x116
 [<ffffffff813c0e76>] dpm_resume_end+0xf4/0x44b
 [<ffffffff810b1d07>] suspend_devices_and_enter+0x190/0x1d8
 [<ffffffff810b1e3d>] enter_state+0xee/0x158
 [<ffffffff810b1335>] state_store+0xc5/0xf9
 [<ffffffff8129a451>] kobj_attr_store+0x2a/0x40
 [<ffffffff811c5ba1>] sysfs_write_file+0x106/0x156
 [<ffffffff81154baf>] vfs_write+0xbd/0x12e
 [<ffffffff81154d12>] sys_write+0x59/0x91
 [<ffffffff81012032>] system_call_fastpath+0x16/0x1b

(looks like the "lid event" one got suppressed), the second one is from 
fbcon:

 [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
 [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
 [<ffffffff81363e3a>] drm_fb_helper_pan_display+0xb1/0x11d
 [<ffffffff812d1b9e>] fb_pan_display+0xcf/0x126
 [<ffffffff812e1014>] bit_update_start+0x33/0x6c
 [<ffffffff812de10f>] fbcon_switch+0x41f/0x43b
 [<ffffffff8134eb77>] redraw_screen+0xe1/0x197
 [<ffffffff812e077f>] fbcon_event_notify+0xbb/0x4fc
 [<ffffffff8155f9a3>] notifier_call_chain+0x44/0x86
 [<ffffffff81096361>] __blocking_notifier_call_chain+0x5d/0x88
 [<ffffffff810963b3>] blocking_notifier_call_chain+0x27/0x3d
 [<ffffffff812d16f2>] fb_notifier_call_chain+0x2e/0x44
 [<ffffffff812d20b1>] fb_set_suspend+0x65/0x8a
 [<ffffffff812d7034>] store_fbstate+0x57/0x7e
 [<ffffffff813b60d5>] dev_attr_store+0x33/0x49
 [<ffffffff811c5ba1>] sysfs_write_file+0x106/0x156
 [<ffffffff81154baf>] vfs_write+0xbd/0x12e
 [<ffffffff81154d12>] sys_write+0x59/0x91
 [<ffffffff81012032>] system_call_fastpath+0x16/0x1b

and the third one is from X doing it.

 [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
 [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
 [<ffffffff813788b6>] drm_mode_setcrtc+0x2f9/0x33f
 [<ffffffff8136b7bb>] drm_ioctl+0x237/0x2f4
 [<ffffffff81167aac>] vfs_ioctl+0x7e/0xaa
 [<ffffffff8116800d>] do_vfs_ioctl+0x496/0x4f2
 [<ffffffff811680ce>] sys_ioctl+0x65/0x9c
 [<ffffffff81012032>] system_call_fastpath+0x16/0x1b

I don't know why the 'lid' thing seems to matter, but I think it's the 
last one that always screws up the mode for some reason.

			Linus
Comment 9 Linus Torvalds 2009-10-29 04:13:14 UTC
On Wed, 28 Oct 2009, Linus Torvalds wrote:
> 
> and the third one is from X doing it.
> 
>  [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
>  [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
>  [<ffffffff813788b6>] drm_mode_setcrtc+0x2f9/0x33f
>  [<ffffffff8136b7bb>] drm_ioctl+0x237/0x2f4
>  [<ffffffff81167aac>] vfs_ioctl+0x7e/0xaa
>  [<ffffffff8116800d>] do_vfs_ioctl+0x496/0x4f2
>  [<ffffffff811680ce>] sys_ioctl+0x65/0x9c
>  [<ffffffff81012032>] system_call_fastpath+0x16/0x1b
> 
> I don't know why the 'lid' thing seems to matter, but I think it's the 
> last one that always screws up the mode for some reason.

Ok, with drm.debug=4, I see this around the first one (early resume):

	[drm:drm_mode_debug_printmodeline], Modeline 12:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
	[drm] LVDS-8: set mode 1366x768 c

but then later on I get a lot more:

	Restarting tasks ... done.
	[drm:drm_crtc_helper_set_config], 
	[drm:drm_crtc_helper_set_config], crtc: ffff880037ac8800 4 fb: ffff880037aabf00 connectors: ffff880076438fc0 num_connectors: 1 (x, y) (0, 0)
	[drm:drm_crtc_helper_set_config], crtc has no fb, full mode set
	[drm:drm_crtc_helper_set_config], encoder changed, full mode switch
	[drm:drm_crtc_helper_set_config], crtc changed, full mode switch
	[drm:drm_crtc_helper_set_config], setting connector 7 crtc to ffff880037ac8800
	[drm:drm_crtc_helper_set_config], attempting to set mode from userspace
	[drm:drm_mode_debug_printmodeline], Modeline 12:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
	[drm:drm_mode_debug_printmodeline], Modeline 12:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
	[drm] LVDS-8: set mode 1366x768 c
	[drm:drm_crtc_helper_set_config], 
	[drm:drm_crtc_helper_set_config], crtc: ffff880037ac8800 4 fb: ffff8800762ee900 connectors: ffff880076503d40 num_connectors: 1 (x, y) (0, 0)
	[drm:drm_crtc_helper_set_config], crtc has no fb, full mode set
	[drm:drm_crtc_helper_set_config], encoder changed, full mode switch
	[drm:drm_crtc_helper_set_config], crtc changed, full mode switch
	[drm:drm_crtc_helper_set_config], setting connector 7 crtc to ffff880037ac8800
	[drm:drm_crtc_helper_set_config], attempting to set mode from userspace
	[drm:drm_mode_debug_printmodeline], Modeline 21:"_NET_SYSTEM_TRAY_OPCODE" 0 71080 1366 1414 1446 1494 768 770 775 793 0x0 0x9
	[drm:drm_mode_debug_printmodeline], Modeline 21:"_NET_SYSTEM_TRAY_OPCODE" 0 71080 1366 1414 1446 1494 768 770 775 793 0x0 0x9
	[drm] LVDS-8: set mode _NET_SYSTEM_TRAY_OPCODE 15
	[drm:drm_mode_getconnector], connector id 7:
	[drm:drm_helper_probe_single_connector_modes], LVDS-1
	[drm:drm_helper_probe_single_connector_modes], Probed modes for LVDS-1
	[drm:drm_mode_debug_printmodeline], Modeline 27:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
	[drm:drm_mode_getconnector], connector id 7:
	[drm:drm_mode_getconnector], connector id 5:
	[drm:drm_helper_probe_single_connector_modes], VGA-1
	[drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
	[drm:drm_mode_getconnector], connector id 5:
	[drm:drm_helper_probe_single_connector_modes], VGA-1
	[drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
	[drm:drm_mode_getconnector], connector id 14:
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1 is disconnected
	[drm:drm_mode_getconnector], connector id 14:
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1 is disconnected
	[drm:drm_mode_getconnector], connector id 16:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
	[drm:drm_mode_getconnector], connector id 16:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
	[drm:drm_mode_getconnector], connector id 18:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
	[drm:drm_mode_getconnector], connector id 18:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
	[drm:drm_mode_getconnector], connector id 7:
	[drm:drm_helper_probe_single_connector_modes], LVDS-1
	[drm:drm_helper_probe_single_connector_modes], Probed modes for LVDS-1
	[drm:drm_mode_debug_printmodeline], Modeline 27:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
	[drm:drm_mode_getconnector], connector id 7:
	[drm:drm_mode_getconnector], connector id 5:
	[drm:drm_helper_probe_single_connector_modes], VGA-1
	[drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
	[drm:drm_mode_getconnector], connector id 5:
	[drm:drm_helper_probe_single_connector_modes], VGA-1
	[drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
	[drm:drm_mode_getconnector], connector id 14:
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1 is disconnected
	[drm:drm_mode_getconnector], connector id 14:
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1
	[drm:drm_helper_probe_single_connector_modes], HDMI Type A-1 is disconnected
	[drm:drm_mode_getconnector], connector id 16:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
	[drm:drm_mode_getconnector], connector id 16:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
	[drm:drm_mode_getconnector], connector id 18:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
	[drm:drm_mode_getconnector], connector id 18:
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2
	[drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected

very odd. And I have _no_ idea why this started happening with that whole 
lid code. An dI have no idea why the screen stays black - that odd 
user-mode mode looks like it has roughly the right timings, even if the 
name is apparently corrupted (probably just some old stale pointer). 

			Linus
Comment 10 Linus Torvalds 2009-10-29 15:41:50 UTC
On Wed, 28 Oct 2009, Linus Torvalds wrote:

> 
> 
> On Wed, 28 Oct 2009, Linus Torvalds wrote:
> > 
> > I'll try adding the lock to the i915_resume() path, and see if that helps 
> > any. But right now I have to go fetch the kids from gymnastics..
> 
> That doesn't help.

So in all my traces, I don't actually see _why_ that lid code matters. But 
just to verify that my bisect was ok, I tried a fairly heavy-handed 
revert, and applied it on top of -git as of yesterday on the problematic 
laptop.

And I can confirm that this "just rip out the lid event" patch makes that 
laptop suspend and resume many times in a row with no problems. Without 
this patch, current -git will usually result in a black screen on the 
first resume (although I think it may sometimes make it to the second or 
third one).

But I don't really claim to know/understand exactly why, and what the 
exact problem is. My traces just confuse me.

			Linus

---
 drivers/gpu/drm/i915/i915_drv.c      |    4 ----
 drivers/gpu/drm/i915/i915_drv.h      |    3 ---
 drivers/gpu/drm/i915/intel_display.c |    2 --
 drivers/gpu/drm/i915/intel_lvds.c    |   27 ---------------------------
 4 files changed, 0 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b93814c..ef4d2e8 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -89,8 +89,6 @@ static int i915_suspend(struct drm_device *dev, pm_message_t state)
 		pci_set_power_state(dev->pdev, PCI_D3hot);
 	}
 
-	dev_priv->suspended = 1;
-
 	return 0;
 }
 
@@ -124,8 +122,6 @@ static int i915_resume(struct drm_device *dev)
 		drm_helper_resume_force_mode(dev);
 	}
 
-	dev_priv->suspended = 0;
-
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6035d3d..65d1991 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -256,8 +256,6 @@ typedef struct drm_i915_private {
 	unsigned int edp_support:1;
 	int lvds_ssc_freq;
 
-	struct notifier_block lid_notifier;
-
 	int crt_ddc_bus; /* -1 = unknown, else GPIO to use for CRT DDC */
 	struct drm_i915_fence_reg fence_regs[16]; /* assume 965 */
 	int fence_reg_start; /* 4 if userland hasn't ioctl'd us yet */
@@ -274,7 +272,6 @@ typedef struct drm_i915_private {
 	struct drm_i915_display_funcs display;
 
 	/* Register state */
-	bool suspended;
 	u8 saveLBB;
 	u32 saveDSPACNTR;
 	u32 saveDSPBCNTR;
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 3c14240..58b3698 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -24,8 +24,6 @@
  *	Eric Anholt <eric@anholt.net>
  */
 
-#include <linux/module.h>
-#include <linux/input.h>
 #include <linux/i2c.h>
 #include <linux/kernel.h>
 #include "drmP.h"
diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds.c
index 98ae3d7..a6d026d 100644
--- a/drivers/gpu/drm/i915/intel_lvds.c
+++ b/drivers/gpu/drm/i915/intel_lvds.c
@@ -656,24 +656,6 @@ static int intel_lvds_get_modes(struct drm_connector *connector)
 	return 0;
 }
 
-static int intel_lid_notify(struct notifier_block *nb, unsigned long val,
-			    void *unused)
-{
-	struct drm_i915_private *dev_priv =
-		container_of(nb, struct drm_i915_private, lid_notifier);
-	struct drm_device *dev = dev_priv->dev;
-
-	if (acpi_lid_open() && !dev_priv->suspended) {
-		mutex_lock(&dev->mode_config.mutex);
-		drm_helper_resume_force_mode(dev);
-		mutex_unlock(&dev->mode_config.mutex);
-	}
-
-	drm_sysfs_hotplug_event(dev_priv->dev);
-
-	return NOTIFY_OK;
-}
-
 /**
  * intel_lvds_destroy - unregister and free LVDS structures
  * @connector: connector to free
@@ -683,14 +665,10 @@ static int intel_lid_notify(struct notifier_block *nb, unsigned long val,
  */
 static void intel_lvds_destroy(struct drm_connector *connector)
 {
-	struct drm_device *dev = connector->dev;
 	struct intel_output *intel_output = to_intel_output(connector);
-	struct drm_i915_private *dev_priv = dev->dev_private;
 
 	if (intel_output->ddc_bus)
 		intel_i2c_destroy(intel_output->ddc_bus);
-	if (dev_priv->lid_notifier.notifier_call)
-		acpi_lid_notifier_unregister(&dev_priv->lid_notifier);
 	drm_sysfs_connector_remove(connector);
 	drm_connector_cleanup(connector);
 	kfree(connector);
@@ -1057,11 +1035,6 @@ out:
 		pwm |= PWM_PCH_ENABLE;
 		I915_WRITE(BLC_PWM_PCH_CTL1, pwm);
 	}
-	dev_priv->lid_notifier.notifier_call = intel_lid_notify;
-	if (acpi_lid_notifier_register(&dev_priv->lid_notifier)) {
-		DRM_DEBUG("lid notifier registration failed\n");
-		dev_priv->lid_notifier.notifier_call = NULL;
-	}
 	drm_sysfs_connector_add(connector);
 	return;
Comment 11 Jesse Barnes 2009-10-29 16:37:14 UTC
On Thu, 29 Oct 2009 08:41:47 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Wed, 28 Oct 2009, Linus Torvalds wrote:
> 
> > 
> > 
> > On Wed, 28 Oct 2009, Linus Torvalds wrote:
> > > 
> > > I'll try adding the lock to the i915_resume() path, and see if
> > > that helps any. But right now I have to go fetch the kids from
> > > gymnastics..
> > 
> > That doesn't help.
> 
> So in all my traces, I don't actually see _why_ that lid code
> matters. But just to verify that my bisect was ok, I tried a fairly
> heavy-handed revert, and applied it on top of -git as of yesterday on
> the problematic laptop.
> 
> And I can confirm that this "just rip out the lid event" patch makes
> that laptop suspend and resume many times in a row with no problems.
> Without this patch, current -git will usually result in a black
> screen on the first resume (although I think it may sometimes make it
> to the second or third one).
> 
> But I don't really claim to know/understand exactly why, and what the 
> exact problem is. My traces just confuse me.

Yeah this looks really weird.  Something in resume is corrupting the
mode that userspace ends up using??  If you want to back out the lid
stuff for 2.6.32 that's ok, even if we're just working around the
problem.

I don't see anything in the lid notifier that might corrupt memory, but
I'll check again.  I'll also go through the userspace interfaces to see
if I can figure out where we'd end up using a corrupted mode.

I'll fix up the locking failure in lvds_init and drm_mode_cleanup too,
and Dave noticed a failure in drm_mode_remove as well; it should be
using drm_mode_destroy (again probably unrelated to your problem, but
it's good to audit this stuff again).
Comment 12 Eric Anholt 2009-10-29 16:53:35 UTC
On Thu, 2009-10-29 at 09:37 -0700, Jesse Barnes wrote:
> On Thu, 29 Oct 2009 08:41:47 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > 
> > On Wed, 28 Oct 2009, Linus Torvalds wrote:
> > 
> > > 
> > > 
> > > On Wed, 28 Oct 2009, Linus Torvalds wrote:
> > > > 
> > > > I'll try adding the lock to the i915_resume() path, and see if
> > > > that helps any. But right now I have to go fetch the kids from
> > > > gymnastics..
> > > 
> > > That doesn't help.
> > 
> > So in all my traces, I don't actually see _why_ that lid code
> > matters. But just to verify that my bisect was ok, I tried a fairly
> > heavy-handed revert, and applied it on top of -git as of yesterday on
> > the problematic laptop.
> > 
> > And I can confirm that this "just rip out the lid event" patch makes
> > that laptop suspend and resume many times in a row with no problems.
> > Without this patch, current -git will usually result in a black
> > screen on the first resume (although I think it may sometimes make it
> > to the second or third one).
> > 
> > But I don't really claim to know/understand exactly why, and what the 
> > exact problem is. My traces just confuse me.
> 
> Yeah this looks really weird.  Something in resume is corrupting the
> mode that userspace ends up using??  If you want to back out the lid
> stuff for 2.6.32 that's ok, even if we're just working around the
> problem.
> 
> I don't see anything in the lid notifier that might corrupt memory, but
> I'll check again.  I'll also go through the userspace interfaces to see
> if I can figure out where we'd end up using a corrupted mode.
> 
> I'll fix up the locking failure in lvds_init and drm_mode_cleanup too,
> and Dave noticed a failure in drm_mode_remove as well; it should be
> using drm_mode_destroy (again probably unrelated to your problem, but
> it's good to audit this stuff again).

The mode name corruption may be a red herring, since we're setting
crtc->mode = *mode on modeset and requesting a modeset of &crtc->mode at
force_resume, instead of doing refcounting.  The name might get freed
out from under us if the mode list changes for some reason. (right?)

-- 
Eric Anholt
eric@anholt.net                         eric.anholt@intel.com
Comment 13 Linus Torvalds 2009-11-02 16:50:31 UTC
On Thu, 29 Oct 2009, Jesse Barnes wrote:
> 
> Yeah this looks really weird.  Something in resume is corrupting the
> mode that userspace ends up using??  If you want to back out the lid
> stuff for 2.6.32 that's ok, even if we're just working around the
> problem.

I think I know (at least somewhat) what's up now.

It's actually _two_ independent bugs in the same area (well, they're 
independent, but they're all about the same thing).

What I think happens is:

 - commit c1c7af60892070e4b82ad63bbfb95ae745056de0 was buggy, and causes 
   both the lid open and the resume to set the mode at the same time, and 
   the end result is a black screen.

 - that bug was fixed in commit 06891e27a9b5dba5268bb80e41a283f51335afe7

 - .. but in the meantime, in between the bug and the fix, the same bug 
   crept in as commit 06324194eee97a51b5f172270df49ec39192d6cc! Now we 
   generate a KMS uevent, and so now X will try to set the mode with an 
   ioctl as a result (both on close _and_ open).

 - part of the problem with commit 06324194ee was at least partially that 
   we generate the event for the lid _close_ too, and for all I know X may 
   do odd things.  I say "may" because I don't actually know or care what 
   X does, but I _do_ know that my traces show X using ioctl's to set the 
   crazy modes, and I know it doesn't happen if I don't send the lid close 
   event.

 - end result: it doesn't matter which commit I test: anything after that 
   original c1c7af608 is broken in the same way (== no screen), but I 
   think the breakage may be due to different reasons.

Now, I won't guarantee the above is correct, but it seems to fit the data. 
And based on it, I can cook up a patch that works for me. It is kind of 
based on the same concept as commit 06891e27 (that was supposed to fix the 
original breakage), but instead of trying to have a '->suspended' flag to 
keep lid events separate from the suspend path events, it has a 
'->modeset_on_lid' flag that says "should I care about lid events".

Suspend/resume routines both set ->modeset_on_lid to false, because it is 
no longer a flag for "am I suspended", but now it's a flag for "should I 
restore the video mode on lid open". And for the suspend/resume paths, the 
answer is "no, the video mode will be/was restored by resume".

But equally importantly, it also gets rid of the uevents for the close 
lid case. Otherwise I get this when closing the lid, just before suspend:

	[drm:drm_mode_getconnector], connector id 7:
	[drm:drm_helper_probe_single_connector_modes], LVDS-1
	[drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
	[drm:drm_mode_debug_printmodeline], Modeline 21:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
	[drm:drm_mode_prune_invalid], Not using 1366x768 mode -3
	[drm:drm_mode_getconnector], connector id 7:
	[drm:drm_helper_probe_single_connector_modes], LVDS-1
	[drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
	[drm:drm_crtc_helper_set_config], 
	[drm:drm_crtc_helper_set_config], crtc: ffff880037ab8000 4 fb: ffff880037a8fc00 connectors: ffff88007643ddc0 num_connectors: 1 (x, y) (0, 0)
	[drm:drm_crtc_helper_set_config], setting connector 7 crtc to ffff880037ab8000
	[drm:drm_crtc_helper_set_config], 
	[drm:drm_crtc_helper_set_config], crtc: ffff880037ab8000 4 fb: ffff880037a8fc00 connectors: ffff88007643ddc0 num_connectors: 1 (x, y) (0, 0)
	[drm:drm_crtc_helper_set_config], setting connector 7 crtc to ffff880037ab8000

and that thing _before_ the suspend seems to be implicated in the machine 
not resuming with any graphics. Possibly because X "remembers" that 
lid-down state, so when X resumes, it will resume with the disconnected 
thing.

I also got rid of the 'lid open' uevent, because while that one doesn't 
seem as catastrophic, it does seem to sometimes cause X to not switch back 
to the right VT. Don't ask me why.

Quite frankly, I don't know what the point of those uevents are, but this 
does seem to matter on my laptop. Maybe it's just timing, but I think it 
does make X do something. I don't see the insane modes from X any more 
when I drop the uevent notifications.

I dunno. This may not be entirely the right thing (what if the machine was 
started or resumed with the lid closed?), but the appended patch does seem 
to make my laptop work again for the case I care about (namely "suspend 
and resume by closing/opening the lid").

Oh, and I realize very well that this may be Xorg version dependent. This 
laptop is running Fedora-12 beta, maybe that whole uevent handling is 
something that differs between different X versions?

Comments?

But right now the alternative, as far as I'm concerned, would seem to be 
to just revert _all_ the lid logic entirely as per the previous "rip it 
all out" patch. This one only rips out the KMS uevent cases, and massages 
the lid open logic a bit - but leaves the basic 'restore mode on lid open' 
case.

Does anybody have any other suggestions?

		Linus

---
 drivers/gpu/drm/i915/i915_drv.c   |    5 +++--
 drivers/gpu/drm/i915/i915_drv.h   |    2 +-
 drivers/gpu/drm/i915/intel_lvds.c |   25 ++++++++++++++++++++-----
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b93814c..7f436ec 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -89,7 +89,8 @@ static int i915_suspend(struct drm_device *dev, pm_message_t state)
 		pci_set_power_state(dev->pdev, PCI_D3hot);
 	}
 
-	dev_priv->suspended = 1;
+	/* Modeset on resume, not lid events */
+	dev_priv->modeset_on_lid = 0;
 
 	return 0;
 }
@@ -124,7 +125,7 @@ static int i915_resume(struct drm_device *dev)
 		drm_helper_resume_force_mode(dev);
 	}
 
-	dev_priv->suspended = 0;
+	dev_priv->modeset_on_lid = 0;
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6035d3d..c5df223 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -274,7 +274,7 @@ typedef struct drm_i915_private {
 	struct drm_i915_display_funcs display;
 
 	/* Register state */
-	bool suspended;
+	bool modeset_on_lid;
 	u8 saveLBB;
 	u32 saveDSPACNTR;
 	u32 saveDSPBCNTR;
diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds.c
index 98ae3d7..808bbe4 100644
--- a/drivers/gpu/drm/i915/intel_lvds.c
+++ b/drivers/gpu/drm/i915/intel_lvds.c
@@ -656,6 +656,15 @@ static int intel_lvds_get_modes(struct drm_connector *connector)
 	return 0;
 }
 
+/*
+ * Lid events. Note the use of 'modeset_on_lid':
+ *  - we set it on lid close, and reset it on open
+ *  - we use it as a "only once" bit (ie we ignore
+ *    duplicate events where it was already properly
+ *    set/reset)
+ *  - the suspend/resume paths will also set it to
+ *    zero, since they restore the mode ("lid open").
+ */
 static int intel_lid_notify(struct notifier_block *nb, unsigned long val,
 			    void *unused)
 {
@@ -663,13 +672,19 @@ static int intel_lid_notify(struct notifier_block *nb, unsigned long val,
 		container_of(nb, struct drm_i915_private, lid_notifier);
 	struct drm_device *dev = dev_priv->dev;
 
-	if (acpi_lid_open() && !dev_priv->suspended) {
-		mutex_lock(&dev->mode_config.mutex);
-		drm_helper_resume_force_mode(dev);
-		mutex_unlock(&dev->mode_config.mutex);
+	if (!acpi_lid_open()) {
+		dev_priv->modeset_on_lid = 1;
+		return NOTIFY_OK;
 	}
 
-	drm_sysfs_hotplug_event(dev_priv->dev);
+	if (!dev_priv->modeset_on_lid)
+		return NOTIFY_OK;
+
+	dev_priv->modeset_on_lid = 0;
+
+	mutex_lock(&dev->mode_config.mutex);
+	drm_helper_resume_force_mode(dev);
+	mutex_unlock(&dev->mode_config.mutex);
 
 	return NOTIFY_OK;
 }
Comment 14 Jesse Barnes 2009-11-02 17:14:48 UTC
On Mon, 2 Nov 2009 08:49:55 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 29 Oct 2009, Jesse Barnes wrote:
> > 
> > Yeah this looks really weird.  Something in resume is corrupting the
> > mode that userspace ends up using??  If you want to back out the lid
> > stuff for 2.6.32 that's ok, even if we're just working around the
> > problem.
> 
> I think I know (at least somewhat) what's up now.
> 
> It's actually _two_ independent bugs in the same area (well, they're 
> independent, but they're all about the same thing).
> 
> What I think happens is:
> 
>  - commit c1c7af60892070e4b82ad63bbfb95ae745056de0 was buggy, and
> causes both the lid open and the resume to set the mode at the same
> time, and the end result is a black screen.
> 
>  - that bug was fixed in commit
> 06891e27a9b5dba5268bb80e41a283f51335afe7
> 
>  - .. but in the meantime, in between the bug and the fix, the same
> bug crept in as commit 06324194eee97a51b5f172270df49ec39192d6cc! Now
> we generate a KMS uevent, and so now X will try to set the mode with
> an ioctl as a result (both on close _and_ open).
> 
>  - part of the problem with commit 06324194ee was at least partially
> that we generate the event for the lid _close_ too, and for all I
> know X may do odd things.  I say "may" because I don't actually know
> or care what X does, but I _do_ know that my traces show X using
> ioctl's to set the crazy modes, and I know it doesn't happen if I
> don't send the lid close event.
> 
>  - end result: it doesn't matter which commit I test: anything after
> that original c1c7af608 is broken in the same way (== no screen), but
> I think the breakage may be due to different reasons.

Ok, that analysis makes some sense.  It's only in recent versions of
Fedora that X will listen to hotplug events, but I'm still not sure why
the mode it uses gets corrupted.  Nonetheless, avoiding a bunch of mode
setting noise is probably a good thing in itself.

> Now, I won't guarantee the above is correct, but it seems to fit the
> data. And based on it, I can cook up a patch that works for me. It is
> kind of based on the same concept as commit 06891e27 (that was
> supposed to fix the original breakage), but instead of trying to have
> a '->suspended' flag to keep lid events separate from the suspend
> path events, it has a '->modeset_on_lid' flag that says "should I
> care about lid events".
> 
> Suspend/resume routines both set ->modeset_on_lid to false, because
> it is no longer a flag for "am I suspended", but now it's a flag for
> "should I restore the video mode on lid open". And for the
> suspend/resume paths, the answer is "no, the video mode will be/was
> restored by resume".
> 
> But equally importantly, it also gets rid of the uevents for the
> close lid case. Otherwise I get this when closing the lid, just
> before suspend:
> 
>       [drm:drm_mode_getconnector], connector id 7:
>       [drm:drm_helper_probe_single_connector_modes], LVDS-1
>       [drm:drm_helper_probe_single_connector_modes], LVDS-1 is
> disconnected [drm:drm_mode_debug_printmodeline], Modeline
> 21:"1366x768" 60 71080 1366 1414 1446 1494 768 770 775 793 0x48 0x9
> [drm:drm_mode_prune_invalid], Not using 1366x768 mode -3
> [drm:drm_mode_getconnector], connector id 7:
> [drm:drm_helper_probe_single_connector_modes], LVDS-1
> [drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
> [drm:drm_crtc_helper_set_config], [drm:drm_crtc_helper_set_config],
> crtc: ffff880037ab8000 4 fb: ffff880037a8fc00 connectors:
> ffff88007643ddc0 num_connectors: 1 (x, y) (0, 0)
> [drm:drm_crtc_helper_set_config], setting connector 7 crtc to
> ffff880037ab8000 [drm:drm_crtc_helper_set_config],
> [drm:drm_crtc_helper_set_config], crtc: ffff880037ab8000 4 fb:
> ffff880037a8fc00 connectors: ffff88007643ddc0 num_connectors: 1 (x,
> y) (0, 0) [drm:drm_crtc_helper_set_config], setting connector 7 crtc
> to ffff880037ab8000
> 
> and that thing _before_ the suspend seems to be implicated in the
> machine not resuming with any graphics. Possibly because X
> "remembers" that lid-down state, so when X resumes, it will resume
> with the disconnected thing.
> 
> I also got rid of the 'lid open' uevent, because while that one
> doesn't seem as catastrophic, it does seem to sometimes cause X to
> not switch back to the right VT. Don't ask me why.
> 
> Quite frankly, I don't know what the point of those uevents are, but
> this does seem to matter on my laptop. Maybe it's just timing, but I
> think it does make X do something. I don't see the insane modes from
> X any more when I drop the uevent notifications.
> 
> I dunno. This may not be entirely the right thing (what if the
> machine was started or resumed with the lid closed?), but the
> appended patch does seem to make my laptop work again for the case I
> care about (namely "suspend and resume by closing/opening the lid").

Yeah, the main point of the lid event stuff was to handle cases like
that.  If you start up with the lid closed but then open it later
presumably you want X to reconfigure itself onto your newly opened
display.  Likewise if you plug in an external monitor and close the lid
(but don't suspend).  Apparently X is having trouble with it though.

> Oh, and I realize very well that this may be Xorg version dependent.
> This laptop is running Fedora-12 beta, maybe that whole uevent
> handling is something that differs between different X versions?
> 
> Comments?
> 
> But right now the alternative, as far as I'm concerned, would seem to
> be to just revert _all_ the lid logic entirely as per the previous
> "rip it all out" patch. This one only rips out the KMS uevent cases,
> and massages the lid open logic a bit - but leaves the basic 'restore
> mode on lid open' case.
> 
> Does anybody have any other suggestions?

Yeah, the patch looks fine; removing the uevents altogether is probably
the safest path until X sorts itself out.

Thanks,
Jesse
Comment 15 Riccardo Magliocchetti 2009-11-03 21:33:34 UTC
Tried 2.6.32-rc6 with no joy, now when i try to resume from suspend it fails and goes back to suspension again, then i retry again, it resumes but the screen stays black.

Speaking of userspace i don't have the latest and greatest stuff: X server is 1.6.5, intel driver is 2.9.0 and drm 2.4.14 from debian sid.
Comment 16 Riccardo Magliocchetti 2009-11-04 17:59:09 UTC
Created attachment 23649 [details]
extract from /var/log/messages of boot, failed suspend, suspend and resume with blank screen
Comment 17 Riccardo Magliocchetti 2009-11-04 18:00:04 UTC
Created attachment 23650 [details]
X logs
Comment 18 Riccardo Magliocchetti 2009-11-04 18:01:41 UTC
I've attached messages and X logs as requested by Linus. Will setup netconsole or serial cable if needed.
Comment 19 Linus Torvalds 2009-11-06 03:54:28 UTC
[ Added Rafael to the cc due to the suspend/resume thing, and Krzysztof 
  due to him having touched fbcon locking - you touched it last, so 
  "tag, you're it!"

  Basically, I have a laptop with some odd resume problems with intel 
  graphics, and while it now works for me, it doesn't work for some 
  others, so it's likely timing-related or something ]

On Wed, 28 Oct 2009, Linus Torvalds wrote:
> 
> But I added some more debugging, and there are actually _three_ cases. 

 [ of touching the graphics mode ]

> The first one is from the resume itself:

 [ This one is very much expected and required ]

> the second one is from fbcon:
> 
>  [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
>  [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
>  [<ffffffff81363e3a>] drm_fb_helper_pan_display+0xb1/0x11d
>  [<ffffffff812d1b9e>] fb_pan_display+0xcf/0x126
>  [<ffffffff812e1014>] bit_update_start+0x33/0x6c
>  [<ffffffff812de10f>] fbcon_switch+0x41f/0x43b
>  [<ffffffff8134eb77>] redraw_screen+0xe1/0x197
>  [<ffffffff812e077f>] fbcon_event_notify+0xbb/0x4fc
>  [<ffffffff8155f9a3>] notifier_call_chain+0x44/0x86
>  [<ffffffff81096361>] __blocking_notifier_call_chain+0x5d/0x88
>  [<ffffffff810963b3>] blocking_notifier_call_chain+0x27/0x3d
>  [<ffffffff812d16f2>] fb_notifier_call_chain+0x2e/0x44
>  [<ffffffff812d20b1>] fb_set_suspend+0x65/0x8a
>  [<ffffffff812d7034>] store_fbstate+0x57/0x7e
>  [<ffffffff813b60d5>] dev_attr_store+0x33/0x49
>  [<ffffffff811c5ba1>] sysfs_write_file+0x106/0x156
>  [<ffffffff81154baf>] vfs_write+0xbd/0x12e
>  [<ffffffff81154d12>] sys_write+0x59/0x91
>  [<ffffffff81012032>] system_call_fastpath+0x16/0x1b
> 
> and the third one is from X doing it.
> 
>  [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
>  [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
>  [<ffffffff813788b6>] drm_mode_setcrtc+0x2f9/0x33f
>  [<ffffffff8136b7bb>] drm_ioctl+0x237/0x2f4
>  [<ffffffff81167aac>] vfs_ioctl+0x7e/0xaa
>  [<ffffffff8116800d>] do_vfs_ioctl+0x496/0x4f2
>  [<ffffffff811680ce>] sys_ioctl+0x65/0x9c
>  [<ffffffff81012032>] system_call_fastpath+0x16/0x1b
> 
> I don't know why the 'lid' thing seems to matter, but I think it's the 
> last one that always screws up the mode for some reason.

I'm wondering if it's the fbcon one that screws up.

I'm looking at the fbcon resume path, and it looks like fbcon does a 
FB_EVENT_RESUME callback when coming back. Which results in 
fbcon_resumed() being called, which in turn does a unconditional 
"update_screen(vc);" call.

And there doesn't seem to be any protection for the screen being in 
graphics mode and owned by the X server etc. So maybe fbcon stomps on some 
X server settings when resuming?

That FB_EVENT_RESUME callback happens late, too. Long after we've unfrozen 
user processes, and after we've already released the console semaphore 
from the suspend path. So I could easily see X owning the screen at that 
point.

				Linus
Comment 20 Rafael J. Wysocki 2009-11-06 18:28:54 UTC
On Friday 06 November 2009, Linus Torvalds wrote:
> 
> [ Added Rafael to the cc due to the suspend/resume thing, and Krzysztof 
>   due to him having touched fbcon locking - you touched it last, so 
>   "tag, you're it!"
> 
>   Basically, I have a laptop with some odd resume problems with intel 
>   graphics, and while it now works for me, it doesn't work for some 
>   others, so it's likely timing-related or something ]

Actually, I'm currently fighting with a strange resume-from-hibernation issue
which causes my box (Toshiba Portege R500) to hang hard, apparently during
the final switch from a VT to X after devices have been resumed already.

This definitely is a regression for me, most probably a regression from 2.6.31,
and since I'm sort of able to reproduce it, I'm trying to identify the commit
that exposed it.

> On Wed, 28 Oct 2009, Linus Torvalds wrote:
> > 
> > But I added some more debugging, and there are actually _three_ cases. 
> 
>  [ of touching the graphics mode ]
> 
> > The first one is from the resume itself:
> 
>  [ This one is very much expected and required ]
> 
> > the second one is from fbcon:
> > 
> >  [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
> >  [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
> >  [<ffffffff81363e3a>] drm_fb_helper_pan_display+0xb1/0x11d
> >  [<ffffffff812d1b9e>] fb_pan_display+0xcf/0x126
> >  [<ffffffff812e1014>] bit_update_start+0x33/0x6c
> >  [<ffffffff812de10f>] fbcon_switch+0x41f/0x43b
> >  [<ffffffff8134eb77>] redraw_screen+0xe1/0x197
> >  [<ffffffff812e077f>] fbcon_event_notify+0xbb/0x4fc
> >  [<ffffffff8155f9a3>] notifier_call_chain+0x44/0x86
> >  [<ffffffff81096361>] __blocking_notifier_call_chain+0x5d/0x88
> >  [<ffffffff810963b3>] blocking_notifier_call_chain+0x27/0x3d
> >  [<ffffffff812d16f2>] fb_notifier_call_chain+0x2e/0x44
> >  [<ffffffff812d20b1>] fb_set_suspend+0x65/0x8a
> >  [<ffffffff812d7034>] store_fbstate+0x57/0x7e
> >  [<ffffffff813b60d5>] dev_attr_store+0x33/0x49
> >  [<ffffffff811c5ba1>] sysfs_write_file+0x106/0x156
> >  [<ffffffff81154baf>] vfs_write+0xbd/0x12e
> >  [<ffffffff81154d12>] sys_write+0x59/0x91
> >  [<ffffffff81012032>] system_call_fastpath+0x16/0x1b
> > 
> > and the third one is from X doing it.
> > 
> >  [<ffffffff81365a47>] drm_crtc_helper_set_mode+0x27a/0x355
> >  [<ffffffff81366506>] drm_crtc_helper_set_config+0x5ae/0x79f
> >  [<ffffffff813788b6>] drm_mode_setcrtc+0x2f9/0x33f
> >  [<ffffffff8136b7bb>] drm_ioctl+0x237/0x2f4
> >  [<ffffffff81167aac>] vfs_ioctl+0x7e/0xaa
> >  [<ffffffff8116800d>] do_vfs_ioctl+0x496/0x4f2
> >  [<ffffffff811680ce>] sys_ioctl+0x65/0x9c
> >  [<ffffffff81012032>] system_call_fastpath+0x16/0x1b
> > 
> > I don't know why the 'lid' thing seems to matter, but I think it's the 
> > last one that always screws up the mode for some reason.
> 
> I'm wondering if it's the fbcon one that screws up.
> 
> I'm looking at the fbcon resume path, and it looks like fbcon does a 
> FB_EVENT_RESUME callback when coming back. Which results in 
> fbcon_resumed() being called, which in turn does a unconditional 
> "update_screen(vc);" call.
> 
> And there doesn't seem to be any protection for the screen being in 
> graphics mode and owned by the X server etc. So maybe fbcon stomps on some 
> X server settings when resuming?
> 
> That FB_EVENT_RESUME callback happens late, too. Long after we've unfrozen 
> user processes, and after we've already released the console semaphore 
> from the suspend path. So I could easily see X owning the screen at that 
> point.

That indeed looks bogus.

Whoever can reproduce the issue, please compile fbcon out and see if that
helps.
Comment 21 Linus Torvalds 2009-11-06 19:01:52 UTC
On Fri, 6 Nov 2009, Rafael J. Wysocki wrote:
> > 
> > That FB_EVENT_RESUME callback happens late, too. Long after we've unfrozen 
> > user processes, and after we've already released the console semaphore 
> > from the suspend path. So I could easily see X owning the screen at that 
> > point.
> 
> That indeed looks bogus.

Sadly, it doesn't seem to matter.

I've tried the appended patch, and afaik, it changes nothing.

Even with just two modesets - one by the resume code, and one by X, I'm 
coming back with a black screen. But with current -git, it only happens if 
compiz is enabled (and mostly - but not always - it's accompanied by the 
GPU being frozen, so X kills itself, and then X restarts and everything is 
fine).

And for me, it seems to only happen if I use the lid to suspend/resume. If 
I resume by mousing down System -> Shutdown -> Suspend, it all works fine. 
That's true even if I then close the lid _while_ the machine is suspended, 
and then open it to resume the machine.

So it seems to be the lid close itself that triggers something for me. I 
wonder if the lid close event does something bad to the GPU state (ACPI, 
SMM, whatever crazy crap), and then when we suspend (because the lid close 
_also_ results in a suspend event, of course), we save that bad state for 
resume.

I assume that's not the behavior you see (or do you have some "hibernate 
on lid close" mode?), so these problems are probably independent things.

> Whoever can reproduce the issue, please compile fbcon out and see if that
> helps.

You can't do KMS without fbcon.

			Linus

--- harmless patch that looks ok, and doesn't matter ---
 drivers/video/console/fbcon.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
index 5a686ce..0cc4e83 100644
--- a/drivers/video/console/fbcon.c
+++ b/drivers/video/console/fbcon.c
@@ -2841,7 +2841,8 @@ static void fbcon_resumed(struct fb_info *info)
 		return;
 	vc = vc_cons[ops->currcon].d;
 
-	update_screen(vc);
+	if (CON_IS_VISIBLE(vc) && vc->vc_mode == KD_TEXT)
+		update_screen(vc);
 }
 
 static void fbcon_modechanged(struct fb_info *info)
Comment 22 Anonymous Emailer 2009-11-06 20:40:11 UTC
Reply-To: krzysztof.h1@poczta.fm

On Fri, 6 Nov 2009 19:30:06 +0100
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> On Friday 06 November 2009, Linus Torvalds wrote:
> > 
> > [ Added Rafael to the cc due to the suspend/resume thing, and Krzysztof 
> >   due to him having touched fbcon locking - you touched it last, so 
> >   "tag, you're it!"
> > 
> >   Basically, I have a laptop with some odd resume problems with intel 
> >   graphics, and while it now works for me, it doesn't work for some 
> >   others, so it's likely timing-related or something ]
> 
> Actually, I'm currently fighting with a strange resume-from-hibernation issue
> which causes my box (Toshiba Portege R500) to hang hard, apparently during
> the final switch from a VT to X after devices have been resumed already.
> 

I have tried to reproduce this but I got different results on Asus A3500L + Ubuntu 9.04 + git kernel updated today:
- fbcon + intelfb: the laptop boots and works (tried only one suspend/resume cycle)
- fbcon + i915 drm (no fb console): the laptop boots, X starts, mouse works but no menu is displayed
- fbcon + i915 kms: the laptop boots, blank screen, ctrl+alt+del does not work

It seems that there is also an i915 drm problem regardless resume issue.

Regards,
Krzysztof

-----------------------------------------------------------------------------
Dzwonki na telefon!
Pobierz >> http://link.interia.pl/f2411
Comment 23 Riccardo Magliocchetti 2009-11-14 20:30:30 UTC
Created attachment 23785 [details]
2.6.32-rc7 messages

Update to 2.6.32-rc7 Now i have an interesting line 835:

Nov 14 19:21:57 montag kernel: [   81.775991] PM: Device PNP0C0D:00 failed to resume: error 1

where PNP0C0D:00 is the lid switch:

Nov 14 19:20:35 montag kernel: [    2.167094] input: Lid Switch as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0D:00/input/input1
Nov 14 19:20:35 montag kernel: [    2.167186] ACPI: Lid Switch [LID0]
Comment 24 Andy Isaacson 2009-11-20 03:01:17 UTC
(In reply to comment #21)
> --- harmless patch that looks ok, and doesn't matter ---
>  drivers/video/console/fbcon.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
> index 5a686ce..0cc4e83 100644
> --- a/drivers/video/console/fbcon.c
> +++ b/drivers/video/console/fbcon.c
> @@ -2841,7 +2841,8 @@ static void fbcon_resumed(struct fb_info *info)
>          return;
>      vc = vc_cons[ops->currcon].d;
> 
> -    update_screen(vc);
> +    if (CON_IS_VISIBLE(vc) && vc->vc_mode == KD_TEXT)
> +        update_screen(vc);
>  }
> 
>  static void fbcon_modechanged(struct fb_info *info)

Even with this patch I'm seeing hangs (no capslock, no network traffic) and failure to restore video modes on a Dell E4300 with GM45 graphics.

I'm running 66b00a7 with the above patch.

If I boot with nomodeset=1, everything works as expected -- I can close the laptop to suspend, open it to resume.

If I boot with KMS and come to the GDK login screen, then close the lid, the laptop will suspend (the power LED goes to the "breathing" state) and on resume the screen stays black.  Logging in over the network and "killall Xorg" results in GDM starting a new Xorg which fixes it.

If I boot with KMS and log in to the Gnome desktop, then close the lid, the laptop hangs hard with the LCD black and the fans running.  The keyboard LEDs are not blinking and the machine isn't responding to the network.  running ps in a tight loop reported that pm-suspend is running when it crashed, but running pm-suspend without closing the lid does not trigger a crash.  netconsole doesn't report anything interesting.

Twice out of perhaps 20 boots I've seen the following via netconsole:

[  100.575511] [drm] TV-25: set mode NTSC 480i 0
[  101.026415] [drm] TV-25: set mode NTSC 480i 0
[  103.216031] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  103.216043] render error detected, EIR: 0x00000010
[  103.216048]   IPEIR: 0x00000000
[  103.216051]   IPEHR: 0x01800020
[  103.216054]   INSTDONE: 0xffffffff
[  103.216057]   INSTPS: 0x8001e020
[  103.216060]   INSTDONE1: 0xbfffffff
[  103.216063]   ACTHD: 0x0281c014
[  103.216067] page table error
[  103.216070]   PGTBL_ER: 0x00100000
[  103.216075] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[  103.216073] render error detected, EIR: 0x00000010
[  103.216075]   IPEIR: 0x00000000
[  103.216077]   IPEHR: 0x01800020
[  103.216078]   INSTDONE: 0xffffffff
[  103.216079]   INSTPS: 0x8001e020
[  103.216080]   INSTDONE1: 0xbfffffff
[  103.216081]   ACTHD: 0x0281c014
[  103.216083] page table error
[  103.216084]   PGTBL_ER: 0x00100000
[  103.216086] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[  103.216144] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 4929 at 4925)

but it's not consistent.

I *do* see
[  169.388562] PM: Device PNP0C0D:00 failed to resume: error 1
after a successful pm-suspend invocation and resume.

If I revert back to 2.6.31.4 (actually Ubuntu's tree with some local patches) then close-lid-to-suspend works, and I don't see PNP0C0D failure reported after resume.
Comment 25 Jesse Barnes 2009-11-20 16:39:14 UTC
What about with Linus's revert from comment #10?  (though maybe that's already in the tree, I haven't checked).
Comment 26 Andy Isaacson 2009-11-20 17:17:46 UTC
(In reply to comment #25)
> What about with Linus's revert from comment #10?  (though maybe that's
> already
> in the tree, I haven't checked).

The patch in comment #10 doesn't apply to linus' tip because of c9354c85c1c7bac788ce57d3c17f2016c1c45b1d, which is a pretty thorough rework of the same codepath and was hoped to resolve this issue.
Comment 27 Andy Isaacson 2009-11-20 18:40:52 UTC
Aha, if I turn off CONFIG_DMAR then a8a8a66 stops hanging on lid-close-suspend -- instead on resume I consistently get a black screen that can be recovered by restarting Xorg.

I do have a ton of this in dmesg:
[  470.951471] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

I'll attach my complete log.
Comment 28 Andy Isaacson 2009-11-20 18:41:35 UTC
Created attachment 23845 [details]
dmesg-2.6.32-rc8-cvp20091022-00011-ga8a8a66
Comment 29 Andy Isaacson 2009-11-20 19:17:06 UTC
Created attachment 23846 [details]
serialconsole-2.6.32-rc8-cvp20091022-00011-ga8a8a66

Got serial console configured and got more output.  There's a corrupt mode name in this dmesg:

00000000  5b 20 20 34 37 34 2e 30  34 39 32 32 30 5d 20 5b  |[  474.049220] [|
00000010  64 72 6d 5d 20 4c 56 44  53 2d 38 3a 20 73 65 74  |drm] LVDS-8: set|
00000020  20 6d 6f 64 65 20 ef bf  bd ef bf bd ef bf bd 20  | mode ......... |
00000030  32 37 0a                                          |27.|
00000033

I thought that was expected to be resolved by recent patches?  This boot is with Linus' a8a8a66 without my local merge of comment #21's patch.
Comment 30 Andy Isaacson 2009-11-20 23:20:32 UTC
OK, I ran 14 reboot cycles with a8a8a66, docked, with serial console hooked up. In every case I had GDM auto-login to an actual desktop, using compiz.

In every case I got the following during the suspend sequence:
[   51.932719] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 3457 at 3447)
[   52.264040] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

In almost every case (I'm not sure about the first 4 boots because I didn't notice it was variable, but it was present in the last 10) I got at least one of the following before suspend:
[   52.286074] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

In every case the machine successfully printed "PM: Entering mem sleep" on serial.

In every case but one it successfully went to sleep (the power LED went to the "breathing" state and the fans shut off).  In one case (boot #11) the fans and the power LED stayed on, with no additional output on the serial console.  I had to power cycle the machine by holding down the power button.

In 6 cases (boots 1, 2, 4, 5, 6, and 10) the machine did not resume from sleep -- the BIOS seemed to come out of sleep (the power LED is solid, the fans are on, the CD drive did its startup seek) but the kernel never printed anything after "Entering mem sleep".

In the remaining 7 cases the kernel resumed successfully to a black screen.  I get a stream of:
[   71.293840] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[   71.300030] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[   71.306254] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[   71.312551] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[   71.319326] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
if I kill Xorg it's restarted by GDM, and I get a cursor (that moves when I touch the touchpad), but in all 7 cases I never saw the desktop get painted.  I can Ctrl-Alt-F1 to switch back to tty1 and that seems to work fine.  While X is running I get sporadic additional "Execbuf while wedged" messages.

I'm running Karmic with:

xserver-xorg-video-intel             2:2.9.0-1ubuntu2
libdrm-intel1                        2.4.14-1ubuntu1
xserver-xorg-core                    2:1.6.4-2ubuntu4
Comment 31 Andy Isaacson 2009-11-20 23:50:11 UTC
I just ran 6 more suspends.  Three of them resumed OK and did the "Execbuf while wedged" failure mode, one suspended and failed to resume, and two failed to suspend.
Comment 32 Andy Isaacson 2009-11-21 00:17:06 UTC
Running pm-suspend directly (and resuming by pressing the power button), I just completed 13 suspend/resume cycles without a hiccup; the display came back correctly every time.
Comment 33 Jesse Barnes 2009-12-10 21:41:37 UTC
I wonder if this patch helps at all?  Without it, we may have ended up reading or writing random memory (or invalid memory).  The fact that behavior changes when DMAR is disabled made me think of this one.

commit fc61901373987ad61851ed001fe971f3ee8d96a3
Author: David Woodhouse <dwmw2@infradead.org>
Date:   Wed Dec 2 11:00:05 2009 +0000

    agp/intel-agp: Clear entire GTT on startup
Comment 34 Andy Isaacson 2009-12-11 01:34:01 UTC
(In reply to comment #33)
> I wonder if this patch helps at all?  Without it, we may have ended up
> reading
> or writing random memory (or invalid memory).  The fact that behavior changes
> when DMAR is disabled made me think of this one.
> 
> commit fc61901373987ad61851ed001fe971f3ee8d96a3
> Author: David Woodhouse <dwmw2@infradead.org>
> Date:   Wed Dec 2 11:00:05 2009 +0000
> 
>     agp/intel-agp: Clear entire GTT on startup

I tried fc6190, no significant change to my failure -- I still see GPU hangs on lid close, whether or not DMAR is enabled and whether or not I choose "suspend" or "blank screen" for the gnome-power-manager "what do I do on lid close" option.

If I shortcircuit intel_lvds_detect to always return connector_status_connected, the problem goes away entirely.

I've also updated to the latest BIOS for this Dell E4300, and I think that improves the DMAR situation but it has no impact on the GPU hangs.  I don't see hangs on Lenovo hardware.

Here's drm.debug=4 for a GPU hang with fc6190:

 380.810967] [drm:drm_mode_getconnector], connector id 5:
[  380.810974] [drm:drm_helper_probe_single_connector_modes], VGA-1
[  380.824213] [drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
[  380.824220] [drm:drm_mode_getconnector], connector id 5:
[  380.824224] [drm:drm_helper_probe_single_connector_modes], VGA-1
[  380.832046] [drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
[  380.832090] [drm:drm_mode_getconnector], connector id 7:
[  380.832095] [drm:drm_helper_probe_single_connector_modes], LVDS-1
[  380.834173] [drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
[  380.834176] [drm:drm_mode_debug_printmodeline], Modeline 33:"1280x800" 60 82500 1280 1352 1480 1655 800 803 809 831 0x48 0x9
[  380.834180] [drm:drm_mode_prune_invalid], Not using 1280x800 mode -3
[  380.834184] [drm:drm_mode_debug_printmodeline], Modeline 34:"1280x800" 40 56300 1280 1352 1480 1694 800 803 809 831 0x40 0x9
[  380.834188] [drm:drm_mode_prune_invalid], Not using 1280x800 mode -3
[  380.834192] [drm:drm_mode_getconnector], connector id 7:
[  380.834194] [drm:drm_helper_probe_single_connector_modes], LVDS-1
[  380.836140] [drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
[  380.836226] [drm:drm_mode_getconnector], connector id 14:
[  380.836229] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1
[  380.836232] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
[  380.836235] [drm:drm_mode_getconnector], connector id 14:
[  380.836237] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1
[  380.836240] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
[  380.836253] [drm:drm_mode_getconnector], connector id 16:
[  380.836256] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2
[  380.836259] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
[  380.836261] [drm:drm_mode_getconnector], connector id 16:
[  380.836264] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2
[  380.836266] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
[  380.836278] [drm:drm_mode_getconnector], connector id 18:
[  380.836281] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1
[  380.836290] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  380.836292] [drm:intel_update_watermarks], plane B (pipe 1) clock: 82500
[  380.864634] [drm:intel_crtc_mode_set], Mode for pipe A:
[  380.864637] [drm:drm_mode_debug_printmodeline], Modeline 0:"NTSC 480i" 0 107520 1280 1368 1496 1712 1024 1027 1034 1104 0x40 0x0
[  380.865088] [drm:intel_crtc_mode_set], disabling CxSR downclocking
[  380.892195] [drm:intel_pipe_set_base], No FB bound
[  380.892199] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  380.892202] [drm:intel_update_watermarks], plane B (pipe 1) clock: 82500
[  380.892208] [drm] TV-19: set mode NTSC 480i 0
[  380.916054] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  380.916058] [drm:intel_update_watermarks], plane B (pipe 1) clock: 82500
[  380.988192] [drm:intel_tv_detect_type], No TV connection detected
[  381.258822] [drm:intel_update_watermarks], plane B (pipe 1) clock: 82500
[  381.304036] [drm:g4x_disable_fbc], disabled FBC
[  381.328189] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1 is disconnected
[  381.328222] [drm:drm_mode_getconnector], connector id 18:
[  381.328226] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1
[  381.328237] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.352481] [drm:intel_crtc_mode_set], Mode for pipe A:
[  381.352483] [drm:drm_mode_debug_printmodeline], Modeline 0:"NTSC 480i" 0 107520 1280 1368 1496 1712 1024 1027 1034 1104 0x40 0x0
[  381.352935] [drm:intel_crtc_mode_set], disabling CxSR downclocking
[  381.376065] [drm:intel_pipe_set_base], No FB bound
[  381.376068] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.376074] [drm] TV-19: set mode NTSC 480i 0
[  381.400282] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.472204] [drm:intel_tv_detect_type], No TV connection detected
[  381.520205] [drm:g4x_disable_fbc], disabled FBC
[  381.544356] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1 is disconnected
[  381.561972] [drm:drm_mode_getconnector], connector id 5:
[  381.561978] [drm:drm_helper_probe_single_connector_modes], VGA-1
[  381.569055] [drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
[  381.569062] [drm:drm_mode_getconnector], connector id 5:
[  381.569066] [drm:drm_helper_probe_single_connector_modes], VGA-1
[  381.577236] [drm:drm_helper_probe_single_connector_modes], VGA-1 is disconnected
[  381.577275] [drm:drm_mode_getconnector], connector id 7:
[  381.577280] [drm:drm_helper_probe_single_connector_modes], LVDS-1
[  381.578987] [drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
[  381.578991] [drm:drm_mode_getconnector], connector id 7:
[  381.578993] [drm:drm_helper_probe_single_connector_modes], LVDS-1
[  381.581001] [drm:drm_helper_probe_single_connector_modes], LVDS-1 is disconnected
[  381.581022] [drm:drm_mode_getconnector], connector id 14:
[  381.581025] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1
[  381.581029] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
[  381.581032] [drm:drm_mode_getconnector], connector id 14:
[  381.581035] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1
[  381.581038] [drm:drm_helper_probe_single_connector_modes], DisplayPort-1 is disconnected
[  381.581053] [drm:drm_mode_getconnector], connector id 16:
[  381.581056] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2
[  381.581060] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
[  381.581063] [drm:drm_mode_getconnector], connector id 16:
[  381.581066] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2
[  381.581069] [drm:drm_helper_probe_single_connector_modes], DisplayPort-2 is disconnected
[  381.581082] [drm:drm_mode_getconnector], connector id 18:
[  381.581085] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1
[  381.581095] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.604667] [drm:intel_crtc_mode_set], Mode for pipe A:
[  381.604670] [drm:drm_mode_debug_printmodeline], Modeline 0:"NTSC 480i" 0 107520 1280 1368 1496 1712 1024 1027 1034 1104 0x40 0x0
[  381.605121] [drm:intel_crtc_mode_set], disabling CxSR downclocking
[  381.628243] [drm:intel_pipe_set_base], No FB bound
[  381.628247] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.628253] [drm] TV-19: set mode NTSC 480i 0
[  381.652225] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.724256] [drm:intel_tv_detect_type], No TV connection detected
[  381.772230] [drm:g4x_disable_fbc], disabled FBC
[  381.796370] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1 is disconnected
[  381.796379] [drm:drm_mode_getconnector], connector id 18:
[  381.796382] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1
[  381.796387] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.820629] [drm:intel_crtc_mode_set], Mode for pipe A:
[  381.820631] [drm:drm_mode_debug_printmodeline], Modeline 0:"NTSC 480i" 0 107520 1280 1368 1496 1712 1024 1027 1034 1104 0x40 0x0
[  381.821082] [drm:intel_crtc_mode_set], disabling CxSR downclocking
[  381.844198] [drm:intel_pipe_set_base], No FB bound
[  381.844202] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.844207] [drm] TV-19: set mode NTSC 480i 0
[  381.868201] [drm:intel_update_watermarks], plane A (pipe 0) clock: 107520
[  381.940148] [drm:intel_tv_detect_type], No TV connection detected
[  381.988125] [drm:g4x_disable_fbc], disabled FBC
[  382.012261] [drm:drm_helper_probe_single_connector_modes], SVIDEO-1 is disconnected
[  383.216325] [drm:intel_crtc_cursor_set], 
[  383.216329] [drm:intel_crtc_cursor_set], cursor off
[  384.120035] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  384.120045] render error detected, EIR: 0x00000000
[  384.120063] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 14504 at 14502)
[  384.145043] [drm:g4x_disable_fbc], disabled FBC
[  384.389227] wlan0: deauthenticating from 00:0b:0e:46:5c:00 by local choice (reason=3)
[  384.389274] wlan0: deauthenticating from 00:0b:0e:46:5c:00 by local choice (reason=3)
[  384.452536] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  384.452540] render error detected, EIR: 0x00000000
[  384.452568] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 14510 at 14502)
[  384.477052] [drm:g4x_disable_fbc], disabled FBC
[  384.665352] PM: Syncing filesystems ... done.
[  384.697214] PM: Preparing system for mem sleep
[  384.776198] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  384.776203] render error detected, EIR: 0x00000000
[  384.776225] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 14514 at 14502)
[  384.801233] [drm:g4x_disable_fbc], disabled FBC
[  385.100090] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  385.100095] render error detected, EIR: 0x00000000
[  385.100112] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 14517 at 14502)
[  385.125083] [drm:g4x_disable_fbc], disabled FBC
[  385.596197] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  385.596202] render error detected, EIR: 0x00000000
[  385.596218] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 14519 at 14502)
[  385.621213] [drm:g4x_disable_fbc], disabled FBC
[  385.622225] PM: Entering mem sleep
cv
Comment 35 Riccardo Magliocchetti 2009-12-12 15:11:18 UTC
(In reply to comment #33)
> I wonder if this patch helps at all?  Without it, we may have ended up
> reading
> or writing random memory (or invalid memory).  The fact that behavior changes
> when DMAR is disabled made me think of this one.
> 
> commit fc61901373987ad61851ed001fe971f3ee8d96a3
> Author: David Woodhouse <dwmw2@infradead.org>
> Date:   Wed Dec 2 11:00:05 2009 +0000
> 
>     agp/intel-agp: Clear entire GTT on startup

It does not help and i've always had DMAR disabled.
Comment 36 Santi 2010-01-07 23:21:33 UTC
I think I had the same issue, but it seems to be fixed with the latest git version v2.6.33-rc3-97-g2c1f189. In particular, the following commit has all the points:

commit a256537
Author: Zhao Yakui <yakui.zhao@intel.com>
Date:   Fri Dec 11 02:26:11 2009

    drm/i915: Update LVDS connector status when receiving ACPI LID event

    Dirk reports that nothing is displayed on LVDS when using ubuntu 9.1 after
    close/reopen the LID. And I also reproduce this issue on another laptop.
    After some tests and debug, it seems that it is related with that the
    LVDS status is not updated in time in course of suspend/resume.
    
    Now the LID state is used to check whether the LVDS is connected or
    disconnected. And when the LID is closed, it means that the LVDS is
    disconnected. When it is reopened, it means that the LVDS is connected.
    At the same time on some distributions the LID event is also used to put
    the system into suspend state. When the LID is closed, the system will enter
    the suspend state. When the LID is reopened, the system will be resumed.
    
    In such case when the LID is closed, user-space script will receive the LID
    notification event and  detect the LVDS as disconnected. Then the system will
    enter the suspended state. When the LID is reopened, the system will be
    resumed. As the LVDS status is not updated in course of resume, it will cause
    that the LVDS connector is marked as unused and disabled. After the resume is
    finished,user-space script will try to configure the display mode for LVDS.
    But unfortunately as the LVDS status is not updated in time and it is still
    marked as disconnected, the LVDS and its corresponding CRTC will be disabled
    again in the function of drm_helper_disable_unused_functions after changing
    mode for LVDS.
    
    So we had better check and update the status of LVDS connector after receiving
    the LID notication event. Then after the system is resumed from suspended
    state, we can set the display mode for LVDS correctly.
Comment 37 Riccardo Magliocchetti 2010-01-08 14:59:59 UTC
(In reply to comment #36)
> I think I had the same issue, but it seems to be fixed with the latest git
> version v2.6.33-rc3-97-g2c1f189. In particular, the following commit has all
> the points:
> 
> commit a256537
> Author: Zhao Yakui <yakui.zhao@intel.com>
> Date:   Fri Dec 11 02:26:11 2009
> 
>     drm/i915: Update LVDS connector status when receiving ACPI LID event

Santi, i've seen the commit and i'm already running latest git; unfortunately now gnome power manager refuses to suspend the machine on lid close (also with older kernels) so i need to find out another way to suspend the machine when closing lid. Thanks.
Comment 38 Riccardo Magliocchetti 2010-01-14 12:50:31 UTC
(In reply to comment #37)
> (In reply to comment #36)
> > I think I had the same issue, but it seems to be fixed with the latest git
> > version v2.6.33-rc3-97-g2c1f189. In particular, the following commit has
> all
> > the points:
> > 
> > commit a256537
> > Author: Zhao Yakui <yakui.zhao@intel.com>
> > Date:   Fri Dec 11 02:26:11 2009
> > 
> >     drm/i915: Update LVDS connector status when receiving ACPI LID event
> 
> Santi, i've seen the commit and i'm already running latest git; unfortunately
> now gnome power manager refuses to suspend the machine on lid close (also
> with
> older kernels) so i need to find out another way to suspend the machine when
> closing lid. Thanks.

So i managed to suspend my machine closing the lid, unfortunately still not sucessfully. I see that the machine suspend, then i click the power button to resume and i see a login prompt (so some progress here) then the machine goes back to suspension again. I click the power button again and i have a black screen. I see that i have some messages warning about not having a working touchpad / mouse (i8042) after resume.

I've updated the software a bit:
- linux is 2.6.33-rc4
- xserver is 1.7.4
- intel video driver 2.10.0
- libdrm is git  06a2d6567e5aadc2e109942f71afae76a8398969

I'm attaching kernel and xorg logs.
Comment 39 Riccardo Magliocchetti 2010-01-14 12:51:45 UTC
Created attachment 24557 [details]
Boot and suspend / resume cycle logs with kernel 2.6.33-rc4

For easy grep suspend starts at 12:00:22
Comment 40 Riccardo Magliocchetti 2010-01-14 12:52:54 UTC
Created attachment 24558 [details]
xorg logs, don't know if they are useful
Comment 41 Jesse Barnes 2010-02-11 18:59:10 UTC
Assuming you're getting a bunch of extra LID events, this patch should work around the problem, care to give it a try?

diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds
index 75a9772..9f04788 100644
--- a/drivers/gpu/drm/i915/intel_lvds.c
+++ b/drivers/gpu/drm/i915/intel_lvds.c
@@ -643,6 +643,8 @@ static enum drm_connector_status intel_lvds_detect(struct dr
 {
        enum drm_connector_status status = connector_status_connected;
 
+       return status;
+
        if (!acpi_lid_open() && !dmi_check_system(bad_lid_status))
                status = connector_status_disconnected;
 
@@ -702,6 +704,8 @@ static int intel_lid_notify(struct notifier_block *nb, unsig
        struct drm_device *dev = dev_priv->dev;
        struct drm_connector *connector = dev_priv->int_lvds_connector;
 
+       return NOTIFY_OK;
+
        /*
         * check and update the status of LVDS connector after receiving
         * the LID nofication event.
Comment 42 Andy Isaacson 2010-02-12 05:44:13 UTC
On Thu, Feb 11, 2010 at 10:59:46AM -0800, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14484
> --- Comment #41 from Jesse Barnes <jbarnes@virtuousgeek.org>  2010-02-11
> 18:59:10 ---
> Assuming you're getting a bunch of extra LID events, this patch should work
> around the problem, care to give it a try?
> 
> diff --git a/drivers/gpu/drm/i915/intel_lvds.c
> b/drivers/gpu/drm/i915/intel_lvds
> index 75a9772..9f04788 100644
> --- a/drivers/gpu/drm/i915/intel_lvds.c
> +++ b/drivers/gpu/drm/i915/intel_lvds.c
> @@ -643,6 +643,8 @@ static enum drm_connector_status intel_lvds_detect(struct
> dr
>  {
>         enum drm_connector_status status = connector_status_connected;
> 
> +       return status;
> +
>         if (!acpi_lid_open() && !dmi_check_system(bad_lid_status))
>                 status = connector_status_disconnected;
> 
> @@ -702,6 +704,8 @@ static int intel_lid_notify(struct notifier_block *nb,
> unsig
>         struct drm_device *dev = dev_priv->dev;
>         struct drm_connector *connector = dev_priv->int_lvds_connector;
> 
> +       return NOTIFY_OK;
> +
>         /*
>          * check and update the status of LVDS connector after receiving
>          * the LID nofication event.

bugzilla's web UI is broken ("Can't connect to the database.  Error: Too
many connections") so here's some email. :)


(In reply to comment #41)
> Assuming you're getting a bunch of extra LID events, this patch should
> work around the problem, care to give it a try?

I tried out this patch on top of Linus' tip, e28cab4.  Since I've seen
some inconsistency on the exact behavior in recent versions, first I
tried git tip, and two variables: on AC or on battery, and logging
into GDM versus logging into getty and then running startx.

Booting into GDM, e28cab4 works correctly.  Closing the lid triggers
the appropriate action from the Gnome power management control panel,
and it's correctly behaving according to my power settings when I'm on
AC versus battery.

If I stop GDM, log in at getty, and "startx", then close the lid,
e28cab4 fails with:
[ 5578.904249] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer
elapsed... GPU hung
when I open the lid.  This is quite reliable -- 5/5 failures when X was
started by startx as a non-root user, and dozens of successful lid
closes when running X from GDM.

Then I applied your patch.  Now, I cannot get Gnome to suspend the
laptop, either on AC power or on battery power.  I still see lid
events on /dev/input/event0 using evtest, but the suspend-on-lid-close
does not trigger.  Also, I have not triggered "GPU hung" in a few
dozen tries, but since I suspect that's connected to the
suspend-on-lid-close behavior, that's not suprising.

-andy
Comment 43 Riccardo Magliocchetti 2010-02-12 11:07:41 UTC
(In reply to comment #41)
> Assuming you're getting a bunch of extra LID events, this patch should work
> around the problem, care to give it a try?

Sure, here's the results:

v2.6.33-rc7-199-g676ad58 -> X hang but resume works fine the first time
v2.6.33-rc7-199-g676ad58 + your patch -> everything ok
Comment 44 Matthias Hopf 2010-02-12 11:11:29 UTC
I think this is closely related to bug 14997. The patch in comment #41 works around the issue there. Haven't seen negative side effects yet (apart from probably slightly higher power consumption due to CRTC still running). Lid events still work, also I don't see any negative side effects regarding suspend or hibernate.
Comment 45 Jesse Barnes 2010-02-12 21:14:08 UTC
Andy, I think you're running into multiple errors.  libdrm has received some major fixes recently for crashes and GPU hangs, can you update that just to rule out the bugs fixed there?

I don't understand why my patch would keep your machine from suspending at all...  What if you change the "return NOTIFY_OK;" to "return NOTIFY_DONE;"?  Maybe something in the LID event handling breaks with a simple NOTIFY_OK...
Comment 46 Andy Isaacson 2010-02-16 19:40:31 UTC
(In reply to comment #45)
> Andy, I think you're running into multiple errors.  libdrm has received some
> major fixes recently for crashes and GPU hangs, can you update that just to
> rule out the bugs fixed there?

Agreed, I think I am seeing multiple problems with similar failure modes.

I've updated to the latest bits (as of last week) in the xorg-edgers PPA:

xserver-xorg             1:7.4+3ubuntu10
xserver-xorg-core        2:1.6.5+git20091107+server-1.6-branch.2dbcb06a-0ubuntu0sarvatt~karmic
xserver-xorg-video-intel 2:2.10.0+git20100210.f0d760bf-0ubuntu0sarvatt2~karmic
libdrm-intel1            2.4.17+git20100210.4f0f8717-0ubuntu0sarvatt~karmic
libdrm2                  2.4.17+git20100210.4f0f8717-0ubuntu0sarvatt~karmic

and 2.6.33-rc8-00026-g0813e22, and with this configuration I'm now seeing GPU hangs on lid close again.  (Booted to GDM, running on AC, Gnome configured to blank LCD on lid close on AC.)  Switching to tty1 then back to X seems to have revived it.

[    0.679470] [drm] Initialized drm 1.1.0 20060810
[    0.679799] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    0.680225] i915 0000:00:02.0: setting latency timer to 64
[    0.695534] [drm] MTRR allocation failed.  Graphics performance may suffer.
[    0.696037] i915 0000:00:02.0: irq 27 for MSI/MSI-X
[    0.696042] [drm] set up 31M of stolen space
[    1.751699] fb0: inteldrmfb frame buffer device
[    1.784417] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[  177.749238] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  177.762042] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 2461 at 2457)


Rebooting now with drm.debug=7...
Comment 47 Riccardo Magliocchetti 2010-04-21 12:21:55 UTC
I'm using patch in comment #41 without the first hunk since a few because intel_lvds_detect now returns connector_status_connected everytime.

Today I've tried 2.6.34-rc5 without the second hunk ("return NOTIFY_OK;" in intel_lid_notify()) and had no problems after suspending. Will do a few suspend cycles and report back later.
Comment 48 Riccardo Magliocchetti 2010-04-26 17:57:16 UTC
Confirm that vanilla kernel works fine for me, thanks.
Comment 49 Rafael J. Wysocki 2010-04-26 18:37:09 UTC
Closing, then, right?