After plugging/unplugging the external VGA monitor, changing between VT1 and VT7 repeatedly triggers a lock-up on SandyBridge and IronLake machines. The procedure is: - Start X (not necessarily GNOME; twm or bare X also suffers) - Plug a VGA monitor - run "xrandr --auto" - Switch to VT1, back to VT7 - Unplug the VGA - run "xrandr --auto" - Switch to VT1, back to VT7 - Repeat to switch to VT1/V7; at once or second repeat, the machine locks up. Tested on two SandyBridge laptops and one IronLake laptop. This happens with the vanilla 2.6.37 kernel. And, it doesn't happen with 2.6.36 kernel. So, it's yet-another-2.6.37 regression. I quickly tested 2.6.38-rc2+ as of commit 6fb1b30, but the problem still exists.
I could capture an Oops from 2.6.38-rc2. Attached below.
Created attachment 45342 [details] Kernel mesasge
This was first reported against 2.6.35.
Just by a wild guess (bisecting 2.6.37 for SandyBridge is too painful), I tried to revert the commit below: commit cdd59983118c02d9c5ba0c116ded1faef47ec452 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 8 16:30:16 2010 +0100 drm/i915: Drop crtc->fb pin on disable. In order to handle disable_functions() where the framebuffer is decoupled from the crtc we need to unpin the fb in order to prevent a leak. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857 Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Bingo, now no more lock-up.
That's still a symptom of the same bug, reverting will just cause it BUG later when the pin count overflows.
But not at switching VT twice or so :) It's too easy to reproduce (in 5 seconds). So, something still wrong in the refcounting, right?
Created attachment 45352 [details] Is this the fix? Finally? Found another, and this looks like it could well be the root of all evil.
Unfortunately not. It caused a black screen at boot. But it's close. It seems that somewhere still needs crtc->fb even though crtc is not used. I fixed temporarily by adding a new flag. Yes, it's ugly.
Created attachment 45392 [details] A band-aid fix
BTW, can we replace BUG_ON() with WARN_ON() and return in i915_gem_object_{pin|unpin}()? It's really painful to have a freeze in the graphic driver although this could be a non-critical at this moment. Then you'll have a better chance to survive a bit longer and saves proper logs.
Created attachment 45402 [details] Avoid BUG_ON() in i915_gem_object_pin/unpin
Which is the setup with the black screen on boot? Any clues? I agree that OOPSing in the middle of the boot process is a bad thing to do, but it is telling us that we have critically unpinned the buffer too early and who knows what damage the GPU may do whilst it points at the wrong memory.
(In reply to comment #12) > Which is the setup with the black screen on boot? Any clues? When i915 module is loaded and X starts, LVDS gets off soon while the backlight is kept on. And somewhere it locks up. It didn't give proper Oops via netconsole, unfortunately. I have no machine here at home, so I can test first on the next Monday. > I agree that OOPSing in the middle of the boot process is a bad thing to do, > but it is telling us that we have critically unpinned the buffer too early > and > who knows what damage the GPU may do whilst it points at the wrong memory. The problem isn't only at boot. The worst case is during S3/S4. Since the serial port is a rare precious on laptop, it's very hard to get a log when BUG_ON() is triggered during S3/S4.
(In reply to comment #13) > (In reply to comment #12) > > Which is the setup with the black screen on boot? Any clues? > > When i915 module is loaded and X starts, LVDS gets off soon while the > backlight > is kept on. And somewhere it locks up. It didn't give proper Oops via > netconsole, unfortunately. Sorry, my fault: after double-check, I seem to have checked the wrongly patched kernel. I rebuilt cleanly the kernel with your patch, and now it starts working. The patch looks promising. I'm going to test more on Monday.
Takashi, thanks for double checking! I was getting extremely worried there, and I look forward for a confirmation on Monday... :)
Confirmed the patch in comment 7 works. Though, I'm not entire sure whether it's safe to skip the NULL check of set->mode at that point. Shouldn't we keep the check there? Also, after looking over the whole code, I think it's cleaner to implement drm_crtc_helper_set_mode() to pass the new fb instead of the old fb. Then crtc->fb is changed solely in that function, and you can omit set->crtc->enabled check in drm_crtc_helper_set_config() but just calls drm_crtc_helper_set_mode() directly. Thanks.
Created attachment 45622 [details] 0000 - simplify NOFB
Created attachment 45632 [details] 0001 - Disable attached encoders for NOFB This prevents the almighty bang when set->mode is NULL...
Takashi, I'd appreciate your review on those two patches (in conjunction with the earlier path) to check that they are the simplest fix. I'm hesitating to modify the calling semantics of drm_crtc_helper_set_mode() since it is used elsewhere... And I think we now are correct, and the patch is so small...
On review, the second patch was broken and *unnecessary*. No connectors decouples all encoders, hence disabling the crtc.
Hmm, I'm slightly confused. The patch in comment 17 sets mode->set to NULL forcibly when no fb is set. But, you removed mode->set NULL check in the patch in comment 7. And if I understand correctly, mode_changed still can become true, thus it may pass NULL to drm_crtc_helper_set_mode(). Or does drm_helper_crtc_in_use() always return false in such a case? Regarding the change of drm_crtc_helper_set_mode(): I don't suggest to rewrite for the regression fixes, of course :) I suggested it in case you refactor the code, as it'd be cleaner in that way.
It's not the simplest of functions to keep track of... Let's focus on the NOFB, disabling the CRTC case. set->fb = NULL (and by convention set->mode = 0 et al) => full modeset a) traverse passed in connector list and get encoders for them => no change, set->num_connectors == 0, so new_encoder === connector->encoder (and so not reset) b) if (connector->encoder->crtc == set->crtc) connector->encoder->crtc = NULL So given set->fb == NULL, the crtc is decoupled from all encoders and so drm_crtc_is_enabled() will always return false. So in order to prevent the set->mode NULL deference we need to ensure set->fb is NULL when ever set->mode is NULL and vice versa. Slight adjustment to simplify NOFB required...
Created attachment 45642 [details] Simplify NOFB
(In reply to comment #22) > It's not the simplest of functions to keep track of... > > Let's focus on the NOFB, disabling the CRTC case. > > set->fb = NULL (and by convention set->mode = 0 et al) > > => full modeset > > a) traverse passed in connector list and get encoders for them > => no change, set->num_connectors == 0, so new_encoder === connector->encoder > (and so not reset) > > > b) if (connector->encoder->crtc == set->crtc) connector->encoder->crtc = NULL > > So given set->fb == NULL, the crtc is decoupled from all encoders and so > drm_crtc_is_enabled() will always return false. > > So in order to prevent the set->mode NULL deference we need to ensure set->fb > is NULL when ever set->mode is NULL and vice versa. > > Slight adjustment to simplify NOFB required... Thanks, this enlightens me well! Meanwhile, I tested the patch with various mode changes, and it seems working, no Oops or such. For both patches, Tested-by: Takashi Iwai <tiwai@suse.de>
*** Bug 17181 has been marked as a duplicate of this bug. ***
drm-intel-fixes: http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=9334ef755f060e251f3f395caeda1a58b6834ea3 http://git.kernel.org/?p=linux/kernel/git/ickle/drm-intel.git;a=commit;h=ede3ff5204b0117d00609f4980df3b864cefe96f
both merged in .38-rc4.