Bug 12899

Summary: Crash in i915.ko: i915_driver_irq_handler
Product: Drivers Reporter: Helge Bahmann (helge.bahmann)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: RESOLVED CODE_FIX    
Severity: high CC: akpm, eric, gordon.jin, jbarnes, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29-rc8 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 12398    

Description Helge Bahmann 2009-03-20 07:13:06 UTC
linux-2.6.29-rc8, no kernel modesetting, X server running with legacy DRI

I observe frequent kernel NULL pointer dereferences in i915_driver_irq_handler while switching from a running X server back into the VGA text console. Machine locks hard very shortly afterwards, so all backtraces via serial console are truncated, the last messages are at best:

[drm:gm45_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0
BUG: unable to handle kernel NULL pointer dereference at 00000084
IP: [<f90b736b>] i915_driver_irq_handler+0x135/0x1b7 [i915]

From the disassembly I guess that the culprit is:

	if (dev->primary->master) {
		master_priv = dev->primary->master->driver_priv;
		if (master_priv->sarea_priv)
			master_priv->sarea_priv->last_dispatch =
				READ_BREADCRUMB(dev_priv);    <---- CRASH
	}

and after applying the following small patch:

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 87b6b60..d7fe821 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -249,7 +249,7 @@ irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS)
 
 		if (dev->primary->master) {
 			master_priv = dev->primary->master->driver_priv;
-			if (master_priv->sarea_priv)
+			if (master_priv->sarea_priv && dev_priv->hw_status_page)
 				master_priv->sarea_priv->last_dispatch =
 					READ_BREADCRUMB(dev_priv);
 		}


the problem goes away. The patch is most certainly *wrong*, but that it hides the problem hints that there must somewhere be a race between clearing out "->hw_status_page" and the interrupt handler. Someone with better understanding of the driver should have a look.
Comment 1 Eric Anholt 2009-03-20 13:06:45 UTC
Looks sane to me.  Could you send the patch to lkml or dri-devel@lists.sf.net with a commit message and Signed-off-by:?
Comment 2 Helge Bahmann 2009-03-22 06:38:56 UTC
Are you sure this is sane? I mean, i915_gem_cleanup_hws sets dev_priv->hw_status_page=NULL without any locks held (that I am aware of, at least), and the irq handler is accessing the value without any locks either. The check for dev_priv->hw_status_page!=0 added into the irq handler IMHO just "shortens" the race window to the few instructions between performing the test and the subsequent READ_BREADCRUMB and thus "hides" the problem better.

Is there any guard between i915_gem_cleanup_hws and the irq handler racing which I am missing so far?
Comment 3 Andrew Morton 2009-03-30 22:00:37 UTC
afacit this hasn't been fixed, and we presumably have a regression
in 2.6.29?
Comment 4 Helge Bahmann 2009-04-01 11:40:40 UTC
To my knowledge, the code is unchanged, but maybe I can invest some time into investigating further tomorrow to produce a patch I am more confident of
Comment 5 Rafael J. Wysocki 2009-04-26 11:21:27 UTC
Notify-Also : DRI <dri-devel@lists.sourceforge.net>
Comment 6 Jesse Barnes 2009-04-27 16:54:54 UTC
Another possibility would be to move the gem_cleanup_hws call to occur after the interrupt handler is removed.  That should ensure we don't get any interrupts before clearing the hardware status page pointer.
Comment 7 Gordon Jin 2009-09-17 07:50:07 UTC
Does this issue still exist in 2.6.31?
Comment 8 Jesse Barnes 2009-09-17 15:43:16 UTC
This one is fixed.
Comment 9 Gordon Jin 2009-09-18 02:31:21 UTC
closing then