linux-2.6.29-rc8, no kernel modesetting, X server running with legacy DRI I observe frequent kernel NULL pointer dereferences in i915_driver_irq_handler while switching from a running X server back into the VGA text console. Machine locks hard very shortly afterwards, so all backtraces via serial console are truncated, the last messages are at best: [drm:gm45_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0 BUG: unable to handle kernel NULL pointer dereference at 00000084 IP: [<f90b736b>] i915_driver_irq_handler+0x135/0x1b7 [i915] From the disassembly I guess that the culprit is: if (dev->primary->master) { master_priv = dev->primary->master->driver_priv; if (master_priv->sarea_priv) master_priv->sarea_priv->last_dispatch = READ_BREADCRUMB(dev_priv); <---- CRASH } and after applying the following small patch: diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 87b6b60..d7fe821 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -249,7 +249,7 @@ irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) if (dev->primary->master) { master_priv = dev->primary->master->driver_priv; - if (master_priv->sarea_priv) + if (master_priv->sarea_priv && dev_priv->hw_status_page) master_priv->sarea_priv->last_dispatch = READ_BREADCRUMB(dev_priv); } the problem goes away. The patch is most certainly *wrong*, but that it hides the problem hints that there must somewhere be a race between clearing out "->hw_status_page" and the interrupt handler. Someone with better understanding of the driver should have a look.
Looks sane to me. Could you send the patch to lkml or dri-devel@lists.sf.net with a commit message and Signed-off-by:?
Are you sure this is sane? I mean, i915_gem_cleanup_hws sets dev_priv->hw_status_page=NULL without any locks held (that I am aware of, at least), and the irq handler is accessing the value without any locks either. The check for dev_priv->hw_status_page!=0 added into the irq handler IMHO just "shortens" the race window to the few instructions between performing the test and the subsequent READ_BREADCRUMB and thus "hides" the problem better. Is there any guard between i915_gem_cleanup_hws and the irq handler racing which I am missing so far?
afacit this hasn't been fixed, and we presumably have a regression in 2.6.29?
To my knowledge, the code is unchanged, but maybe I can invest some time into investigating further tomorrow to produce a patch I am more confident of
Notify-Also : DRI <dri-devel@lists.sourceforge.net>
Another possibility would be to move the gem_cleanup_hws call to occur after the interrupt handler is removed. That should ensure we don't get any interrupts before clearing the hardware status page pointer.
Does this issue still exist in 2.6.31?
This one is fixed.
closing then