Bug 27722 - SNB/ILK: VT-switch causes lock-up
Summary: SNB/ILK: VT-switch causes lock-up
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri-intel@kernel-bugs.osdl.org
URL:
Keywords:
: 17181 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-01-28 10:55 UTC by Takashi Iwai
Modified: 2011-02-19 23:45 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.37
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel mesasge (7.49 KB, text/plain)
2011-01-28 11:00 UTC, Takashi Iwai
Details
Is this the fix? Finally? (1.63 KB, patch)
2011-01-28 11:59 UTC, Chris Wilson
Details | Diff
A band-aid fix (2.65 KB, patch)
2011-01-28 19:25 UTC, Takashi Iwai
Details | Diff
Avoid BUG_ON() in i915_gem_object_pin/unpin (1.56 KB, patch)
2011-01-28 19:31 UTC, Takashi Iwai
Details | Diff
0000 - simplify NOFB (1.31 KB, patch)
2011-01-31 11:20 UTC, Chris Wilson
Details | Diff
0001 - Disable attached encoders for NOFB (1.14 KB, patch)
2011-01-31 11:21 UTC, Chris Wilson
Details | Diff
Simplify NOFB (1.58 KB, patch)
2011-01-31 12:11 UTC, Chris Wilson
Details | Diff

Description Takashi Iwai 2011-01-28 10:55:14 UTC
After plugging/unplugging the external VGA monitor, changing between VT1 and VT7 repeatedly triggers a lock-up on SandyBridge and IronLake machines.

The procedure is:

- Start X (not necessarily GNOME; twm or bare X also suffers)
- Plug a VGA monitor
- run "xrandr --auto"
- Switch to VT1, back to VT7
- Unplug the VGA
- run "xrandr --auto"
- Switch to VT1, back to VT7
- Repeat to switch to VT1/V7; at once or second repeat, the machine locks up.

Tested on two SandyBridge laptops and one IronLake laptop.

This happens with the vanilla 2.6.37 kernel.  And, it doesn't happen with
2.6.36 kernel.  So, it's yet-another-2.6.37 regression.

I quickly tested 2.6.38-rc2+ as of commit 6fb1b30, but the problem still exists.
Comment 1 Takashi Iwai 2011-01-28 10:59:21 UTC
I could capture an Oops from 2.6.38-rc2.  Attached below.
Comment 2 Takashi Iwai 2011-01-28 11:00:03 UTC
Created attachment 45342 [details]
Kernel mesasge
Comment 3 Chris Wilson 2011-01-28 11:06:29 UTC
This was first reported against 2.6.35.
Comment 4 Takashi Iwai 2011-01-28 11:20:08 UTC
Just by a wild guess (bisecting 2.6.37 for SandyBridge is too painful), I tried to revert the commit below:

commit cdd59983118c02d9c5ba0c116ded1faef47ec452
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 8 16:30:16 2010 +0100

    drm/i915: Drop crtc->fb pin on disable.
    
    In order to handle disable_functions() where the framebuffer is
    decoupled from the crtc we need to unpin the fb in order to prevent a
    leak.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857
    Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Bingo, now no more lock-up.
Comment 5 Chris Wilson 2011-01-28 11:23:07 UTC
That's still a symptom of the same bug, reverting will just cause it BUG later when the pin count overflows.
Comment 6 Takashi Iwai 2011-01-28 11:32:52 UTC
But not at switching VT twice or so :)
It's too easy to reproduce (in 5 seconds).
 
So, something still wrong in the refcounting, right?
Comment 7 Chris Wilson 2011-01-28 11:59:10 UTC
Created attachment 45352 [details]
Is this the fix? Finally?

Found another, and this looks like it could well be the root of all evil.
Comment 8 Takashi Iwai 2011-01-28 19:24:07 UTC
Unfortunately not.  It caused a black screen at boot.  But it's close.
It seems that somewhere still needs crtc->fb even though crtc is not used.

I fixed temporarily by adding a new flag.  Yes, it's ugly.
Comment 9 Takashi Iwai 2011-01-28 19:25:04 UTC
Created attachment 45392 [details]
A band-aid fix
Comment 10 Takashi Iwai 2011-01-28 19:26:37 UTC
BTW, can we replace BUG_ON() with WARN_ON() and return in i915_gem_object_{pin|unpin}()?  It's really painful to have a freeze in the graphic driver although this could be a non-critical at this moment.  Then you'll have a better chance to survive a bit longer and saves proper logs.
Comment 11 Takashi Iwai 2011-01-28 19:31:05 UTC
Created attachment 45402 [details]
Avoid BUG_ON() in i915_gem_object_pin/unpin
Comment 12 Chris Wilson 2011-01-28 20:23:57 UTC
Which is the setup with the black screen on boot? Any clues?

I agree that OOPSing in the middle of the boot process is a bad thing to do, but it is telling us that we have critically unpinned the buffer too early and who knows what damage the GPU may do whilst it points at the wrong memory.
Comment 13 Takashi Iwai 2011-01-28 20:36:47 UTC
(In reply to comment #12)
> Which is the setup with the black screen on boot? Any clues?

When i915 module is loaded and X starts, LVDS gets off soon while the backlight is kept on.  And somewhere it locks up.  It didn't give proper Oops via netconsole, unfortunately.

I have no machine here at home, so I can test first on the next Monday.

> I agree that OOPSing in the middle of the boot process is a bad thing to do,
> but it is telling us that we have critically unpinned the buffer too early
> and
> who knows what damage the GPU may do whilst it points at the wrong memory.

The problem isn't only at boot.  The worst case is during S3/S4.  Since the serial port is a rare precious on laptop, it's very hard to get a log when BUG_ON() is triggered during S3/S4.
Comment 14 Takashi Iwai 2011-01-29 15:25:49 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Which is the setup with the black screen on boot? Any clues?
> 
> When i915 module is loaded and X starts, LVDS gets off soon while the
> backlight
> is kept on.  And somewhere it locks up.  It didn't give proper Oops via
> netconsole, unfortunately.

Sorry, my fault: after double-check, I seem to have checked the wrongly patched kernel.  I rebuilt cleanly the kernel with your patch, and now it starts working.

The patch looks promising.  I'm going to test more on Monday.
Comment 15 Chris Wilson 2011-01-29 16:27:11 UTC
Takashi, thanks for double checking!

I was getting extremely worried there, and I look forward for a confirmation on Monday... :)
Comment 16 Takashi Iwai 2011-01-31 10:08:13 UTC
Confirmed the patch in comment 7 works.

Though, I'm not entire sure whether it's safe to skip the NULL check of set->mode at that point.  Shouldn't we keep the check there?

Also, after looking over the whole code, I think it's cleaner to implement drm_crtc_helper_set_mode() to pass the new fb instead of the old fb.  Then crtc->fb is changed solely in that function, and you can omit set->crtc->enabled check in drm_crtc_helper_set_config() but just calls drm_crtc_helper_set_mode() directly.

Thanks.
Comment 17 Chris Wilson 2011-01-31 11:20:38 UTC
Created attachment 45622 [details]
0000 - simplify NOFB
Comment 18 Chris Wilson 2011-01-31 11:21:36 UTC
Created attachment 45632 [details]
0001 - Disable attached encoders for NOFB

This prevents the almighty bang when set->mode is NULL...
Comment 19 Chris Wilson 2011-01-31 11:24:39 UTC
Takashi, I'd appreciate your review on those two patches (in conjunction with the earlier path) to check that they are the simplest fix.

I'm hesitating to modify the calling semantics of drm_crtc_helper_set_mode() since it is used elsewhere... And I think we now are correct, and the patch is so small...
Comment 20 Chris Wilson 2011-01-31 11:40:42 UTC
On review, the second patch was broken and *unnecessary*. No connectors decouples all encoders, hence disabling the crtc.
Comment 21 Takashi Iwai 2011-01-31 11:56:29 UTC
Hmm, I'm slightly confused.

The patch in comment 17 sets mode->set to NULL forcibly when no fb is set.
But, you removed mode->set NULL check in the patch in comment 7.
And if I understand correctly, mode_changed still can become true, thus it may pass NULL to drm_crtc_helper_set_mode().

Or does drm_helper_crtc_in_use() always return false in such a case?

Regarding the change of drm_crtc_helper_set_mode(): I don't suggest to rewrite for the regression fixes, of course :)  I suggested it in case you refactor the code, as it'd be cleaner in that way.
Comment 22 Chris Wilson 2011-01-31 12:10:07 UTC
It's not the simplest of functions to keep track of...

Let's focus on the NOFB, disabling the CRTC case.

set->fb = NULL (and by convention set->mode = 0 et al)

=> full modeset

a) traverse passed in connector list and get encoders for them
=> no change, set->num_connectors == 0, so new_encoder === connector->encoder (and so not reset)


b) if (connector->encoder->crtc == set->crtc) connector->encoder->crtc = NULL

So given set->fb == NULL, the crtc is decoupled from all encoders and so drm_crtc_is_enabled() will always return false.

So in order to prevent the set->mode NULL deference we need to ensure set->fb is NULL when ever set->mode is NULL and vice versa.

Slight adjustment to simplify NOFB required...
Comment 23 Chris Wilson 2011-01-31 12:11:36 UTC
Created attachment 45642 [details]
Simplify NOFB
Comment 24 Takashi Iwai 2011-01-31 12:26:39 UTC
(In reply to comment #22)
> It's not the simplest of functions to keep track of...
> 
> Let's focus on the NOFB, disabling the CRTC case.
> 
> set->fb = NULL (and by convention set->mode = 0 et al)
> 
> => full modeset
> 
> a) traverse passed in connector list and get encoders for them
> => no change, set->num_connectors == 0, so new_encoder === connector->encoder
> (and so not reset)
> 
> 
> b) if (connector->encoder->crtc == set->crtc) connector->encoder->crtc = NULL
> 
> So given set->fb == NULL, the crtc is decoupled from all encoders and so
> drm_crtc_is_enabled() will always return false.
> 
> So in order to prevent the set->mode NULL deference we need to ensure set->fb
> is NULL when ever set->mode is NULL and vice versa.
> 
> Slight adjustment to simplify NOFB required...

Thanks, this enlightens me well!

Meanwhile, I tested the patch with various mode changes, and it seems working, no Oops or such.

For both patches,
    Tested-by: Takashi Iwai <tiwai@suse.de>
Comment 25 Chris Wilson 2011-01-31 14:45:09 UTC
*** Bug 17181 has been marked as a duplicate of this bug. ***
Comment 27 Florian Mickler 2011-02-19 23:45:58 UTC
both merged in .38-rc4.

Note You need to log in before you can comment on or make changes to this bug.