Bug 13819

Summary: system freeze when switching to console
Product: Drivers Reporter: Reinette Chatre (reinette.chatre)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Severity: normal CC: akpm, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    

Description Reinette Chatre 2009-07-23 17:57:33 UTC
I have a Sony Vaio VGN-Z540 laptop. This machine was able to suspend/resume without problem in 2.6.31-rc2, but in 2.6.31-rc3 it started freezing up when I try to suspend. The system freezes with a blank screen and the keyboard leds start blinking. I have to use system power button to shut it down. The problem is not limited to suspend though - the same type of freeze occurs when I try to switch from X to a console (ctrl-alt-f1). The same problem exists in 2.6.31-rc4 (head 4be3bd7849165e7efa6b0b35a23d6a3598d97465). When I booth without gdm and trigger suspend with "echo disk > /sys/power/state" it works fine.

I was able to bisect the problem to the commit below:

commit 44c695b13bee558c73a89bc79f6253a4ba637386
Merge: eee33ab... 0611254...
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Jul 10 19:14:48 2009 -0700

    Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
    * 'linux-next' of git://git.infradead.org/ubifs-2.6:
      UBIFS: fix corruption dump
      UBIFS: clean up free space checking
      UBIFS: small amendments in the LEB scanning code
      UBIFS: dump a little more in case of corruptions
      MAINTAINERS: update ahunter's e-mail address
      UBIFS: allow more than one volume to be mounted
      UBIFS: fix assertion warning
      UBIFS: minor spelling and grammar fixes
      UBIFS: fix 64-bit divisions in debug print
      UBIFS: few spelling fixes
      UBIFS: set write-buffer timout to 3-5 seconds
      UBIFS: slightly optimize write-buffer timer usage
      UBIFS: improve debugging messaged
      UBIFS: fix integer overflow warning

I don't know how accurate it is though because I tried to do a sanity check on a repo that first has this commit as HEAD and next reverted this repo to HEAD^. The problem was present in both of these repos.

Comment 1 Andrew Morton 2009-07-23 18:06:09 UTC
The commit result seems unlikely to be correct.  Are you actually using UBIFS?
Comment 2 Reinette Chatre 2009-07-23 18:09:48 UTC
I thought the same thing and that is why I did that sanity check (which failed). My kernel is not compiled with CONFIG_UBIFS_FS
Comment 3 Reinette Chatre 2009-07-23 18:22:20 UTC
I am starting with a fresh bisect now. Will reports back results when this is complete.
Comment 4 Reinette Chatre 2009-07-23 23:47:14 UTC
I think I know why my previous bisect result was wrong. During the bisect I had to test a revision that did not boot on my system, I then guessed it as "bad" and proceeded. With the next bisect the kernel revision changed to 2.6.31-rc1, and I knew that this worked on 2.6.31-rc2 so I changed my guess to "good" to get bisect to go back to 2.6.31-rc2. This was wrong as I see now the new bisect has a kernel version of 2.6.31-rc1 even though 2.6.31-rc2 works fine. It must have something to do with how the trees are merged.

Anyway, I rerun bisect with different good commit and was able to get a reliable first bad commit. This commit makes more sense as I am using i915. 

commit 6ff4fd05676bc5b5c930bef25901e489f7843660
Author: ling.ma@intel.com <ling.ma@intel.com>
Date:   Thu Jun 25 10:59:22 2009 +0800

    drm/i915: Set SSC frequency for 8xx chips correctly
    All 8xx class chips have the 66/48 split, not just 855.
    Signed-off-by: Ma Ling <ling.ma@intel.com>
    Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Eric Anholt <eric@anholt.net>

My device:
00:02.0 0300: 8086:2a42 (rev 07)
        Subsystem: 104d:9025
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 31
        Region 0: Memory at e8400000 (64-bit, non-prefetchable) [size=4M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 8130 [size=8]
        Capabilities: <access denied>
Comment 5 Andrew Morton 2009-07-23 23:59:07 UTC
OK, thanks, I'll reassign it to DRI.

This is a post-2.6.31-rc2 regression.
Comment 6 Reinette Chatre 2009-07-24 16:01:05 UTC
I am now trying to use 2.6.31-rc4 but would like to use X also. I thus reverted this commit from 2.6.31-rc4 but the freeze problem still exists. The above commit was identified clearly in a git-bisect that ran without problems, but reverting it does not fix the issue. This is weird. Is there a way in which I can obtain any logs to help debug this?
Comment 7 Reinette Chatre 2009-07-29 22:05:41 UTC
This problem has me very confused. I have tried bisecting it three times
now and every time I end up with a patch from a merge of the
'drm-intel-next' branch, but I do not always get the same commit from
that branch as the "first bad commit". I went ahead and "rolled my own"
bisect by using rc4 and reverting all the patches from that branch
merge. As expected, that gave me a working setup again. I then did a
manual bisect and found that "drm/i915: enable error detection & state
collection" was the bad commit. I wanted to confirm this with a sanity
check, but could not revert it on its own, I had to revert the following
to get a working setup based off the current linux-2.6

drm/i915: Don't update display FIFO watermark on IGDNG
drm/i915: add FIFO watermark support
drm/i915: enable error detection & state collection
Comment 8 Reinette Chatre 2009-08-17 16:30:44 UTC
Same problem in 2.6.31-rc6. Unfortunately the patches I previously reverted to get a working system does not revert cleanly anymore.
Comment 9 Rafael J. Wysocki 2009-08-20 14:54:11 UTC
On Thursday 20 August 2009, reinette chatre wrote:
> On Wed, 2009-08-19 at 13:26 -0700, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.30.  Please verify if it still should be listed and let me know
> > (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13819
> > Subject             : system freeze when switching to console
> > Submitter   : Reinette Chatre <reinette.chatre@intel.com>
> > Date                : 2009-07-23 17:57 (28 days old)
> This issue is still present in 2.6.31-rc6. Unfortunately the patches I
> reverted to get a working system does not revert cleanly anymore.
Comment 10 Reinette Chatre 2009-09-10 22:22:54 UTC
This is fixed by:

commit e6890f6f3dc2d9024a08b1a149d9bd5208eea350
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Sep 8 17:09:24 2009 -0700

    i915: disable interrupts before tearing down GEM state
    Reinette Chatre reports a frozen system (with blinking keyboard LEDs)
    when switching from graphics mode to the text console, or when
    suspending (which does the same thing). With netconsole, the oops
    turned out to be
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000084
        IP: [<ffffffffa03ecaab>] i915_driver_irq_handler+0x26b/0xd20 [i915]
    and it's due to the i915_gem.c code doing drm_irq_uninstall() after
    having done i915_gem_idle(). And the i915_gem_idle() path will do
      i915_gem_idle() ->
        i915_gem_cleanup_ringbuffer() ->
          i915_gem_cleanup_hws() ->
            dev_priv->hw_status_page = NULL;
    but if an i915 interrupt comes in after this stage, it may want to
    access that hw_status_page, and gets the above NULL pointer dereference.
    And since the NULL pointer dereference happens from within an interrupt,
    and with the screen still in graphics mode, the common end result is
    simply a silently hung machine.
    Fix it by simply uninstalling the irq handler before idling rather than
    after. Fixes
    Reported-and-tested-by: Reinette Chatre <reinette.chatre@intel.com>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>