Bug 16488

Summary: [i915] Framebuffer ID error after suspend/hibernate leading to X crash
Product: Drivers Reporter: Milan Bouchet-Valat (nalimilan)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED INVALID    
Severity: high CC: chris, maciej.rutecki, rjw, s_chriscollins
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.36rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 16055    

Description Milan Bouchet-Valat 2010-08-01 08:55:45 UTC
I've been experiencing X freezes and crashes for more than a year, and with every kernel version the cause of the bug changes. After Linus pushed 985b823b919273fe1327d56d2196b4f92e5d0fae to 2.6.35rc6 (see below [2]), I'm now getting an "invalid framebuffer id" error that kills my X server. Before that commit, I was getting an oops, which was reported in bugs.fd.o as [1].


/var/log/kern.log:
[ 1467.408347] PM: Finishing wakeup.
[ 1467.408350] Restarting tasks ... done.
[ 1467.434616] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[ 1467.747233] sky2 0000:02:00.0: eth0: enabling interface [...]
[ 1512.204160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed...
GPU hung
[ 1512.205452] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns
-5 (awaiting 11072 at 11071)


At this point, the X server is killed, and won't restart:
Fatal server error:
Failed to submit batchbuffer: Input/output error


Excerpt from lspci -vnn:
00:02.1 Display controller [0380]: Intel Corporation Mobile 915GM/GMS/910GML
Express Graphics Controller [8086:2792] (rev 03)
        Subsystem: Toshiba America Info Systems Device [1179:ff00]
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
        Region 0: Memory at 64000000 (32-bit, non-prefetchable) [disabled]
[size=512K]
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-


1: https://bugs.freedesktop.org/show_bug.cgi?id=26974
2:
commit 985b823b919273fe1327d56d2196b4f92e5d0fae
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Jul 2 10:04:42 2010 +1000

    drm/i915: fix hibernation since i915 self-reclaim fixes

    Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
    Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
    i915 page allocator where we weren't before due to some over-eager
    removal of the page mapping gfp_flags games the code used to play.

    This caused hibernate on Intel hardware to result in a lot of memory
    corruptions on resume.  See for example

      http://bugzilla.kernel.org/show_bug.cgi?id=13811
Comment 1 Chris Wilson 2010-08-01 09:23:52 UTC
May be connected with https://bugs.freedesktop.org/show_bug.cgi?id=29230
Comment 2 Andrew Morton 2010-08-02 23:55:44 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 1 Aug 2010 08:55:49 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=16488

Innocuous-looking one-liner is said to have made Milan's X server even
worse than normal.

>            Summary: [i915] Framebuffer ID error after suspend/hibernate
>                     leading to X crash
>            Product: Drivers
>            Version: 2.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Video(DRI - Intel)
>         AssignedTo: drivers_video-dri-intel@kernel-bugs.osdl.org
>         ReportedBy: nalimilan@club.fr
>                 CC: chris@chris-wilson.co.uk
>         Regression: Yes
> 
> 
> I've been experiencing X freezes and crashes for more than a year, and with
> every kernel version the cause of the bug changes. After Linus pushed
> 985b823b919273fe1327d56d2196b4f92e5d0fae to 2.6.35rc6 (see below [2]), I'm
> now
> getting an "invalid framebuffer id" error that kills my X server. Before that
> commit, I was getting an oops, which was reported in bugs.fd.o as [1].
> 
> 
> /var/log/kern.log:
> [ 1467.408347] PM: Finishing wakeup.
> [ 1467.408350] Restarting tasks ... done.
> [ 1467.434616] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
> [ 1467.747233] sky2 0000:02:00.0: eth0: enabling interface [...]
> [ 1512.204160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer
> elapsed...
> GPU hung
> [ 1512.205452] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request
> returns
> -5 (awaiting 11072 at 11071)
> 
> 
> At this point, the X server is killed, and won't restart:
> Fatal server error:
> Failed to submit batchbuffer: Input/output error
> 
> 
> Excerpt from lspci -vnn:
> 00:02.1 Display controller [0380]: Intel Corporation Mobile 915GM/GMS/910GML
> Express Graphics Controller [8086:2792] (rev 03)
>         Subsystem: Toshiba America Info Systems Device [1179:ff00]
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
>         <TAbort-
> <MAbort- >SERR- <PERR- INTx-
>         Region 0: Memory at 64000000 (32-bit, non-prefetchable) [disabled]
> [size=512K]
>         Capabilities: [d0] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> 
> 
> 1: https://bugs.freedesktop.org/show_bug.cgi?id=26974
> 2:
> commit 985b823b919273fe1327d56d2196b4f92e5d0fae
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Fri Jul 2 10:04:42 2010 +1000
> 
>     drm/i915: fix hibernation since i915 self-reclaim fixes
> 
>     Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
>     Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
>     i915 page allocator where we weren't before due to some over-eager
>     removal of the page mapping gfp_flags games the code used to play.
> 
>     This caused hibernate on Intel hardware to result in a lot of memory
>     corruptions on resume.  See for example
> 
>       http://bugzilla.kernel.org/show_bug.cgi?id=13811
> 
> -- 
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
Comment 3 Chris Wilson 2010-08-03 08:44:22 UTC
On Mon, 2 Aug 2010 16:55:03 -0700, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sun, 1 Aug 2010 08:55:49 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=16488
> 
> Innocuous-looking one-liner is said to have made Milan's X server even
> worse than normal.

We go from a random OOPS to a consistent error (and a failing userspace).
It sounds more likely that we have uncovered a real bug, probably in
the ddx.
Comment 4 Linus Torvalds 2010-08-03 16:55:44 UTC
On Tue, Aug 3, 2010 at 12:25 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Mon, 2 Aug 2010 16:55:03 -0700, Andrew Morton <akpm@linux-foundation.org>
> wrote:
>>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Sun, 1 Aug 2010 08:55:49 GMT
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>> > https://bugzilla.kernel.org/show_bug.cgi?id=16488
>>
>> Innocuous-looking one-liner is said to have made Milan's X server even
>> worse than normal.
>
> We go from a random OOPS to a consistent error (and a failing userspace).
> It sounds more likely that we have uncovered a real bug, probably in
> the ddx.

I can't really imagine that that one-liner made the difference. Not
under any normal load. I suspect it just changes some allocation
pattern very subtly, and then the memory scribble (or whatever) that
really causes the bug perhaps changes.

The original oops reported in launchpad was

  BUG: unable to handle kernel NULL pointer dereference at 00000108
  IP: [<f8578b97>] intel_release_load_detect_pipe+0x27/0xb0 [i915]

and as far as I can tell, that's due to a load off a NULL crtc, here:

        struct drm_crtc_helper_funcs *crtc_funcs = crtc->helper_private;

the disassembly is

   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	83 ec 14             	sub    $0x14,%esp
   6:	89 5d f4             	mov    %ebx,-0xc(%ebp)
   9:	89 75 f8             	mov    %esi,-0x8(%ebp)
   c:	89 7d fc             	mov    %edi,-0x4(%ebp)
   f:	0f 1f 44 00 00       	nopl   0x0(%eax,%eax,1)
  14:	8b b0 ec 02 00 00    	mov    0x2ec(%eax),%esi   # crtc = encoder->crtc
  1a:	89 c3                	mov    %eax,%ebx
  1c:	8b 80 f4 02 00 00    	mov    0x2f4(%eax),%eax   # dev = encoder->dev
  22:	89 d7                	mov    %edx,%edi
  24:	89 45 f0             	mov    %eax,-0x10(%ebp)
  27:*	8b 8e 08 01 00 00    	mov    0x108(%esi),%ecx     <-- trapping
instruction (crtc_funcs = crtc->helper_private)
  2d:	80 bb 04 03 00 00 00 	cmpb   $0x0,0x304(%ebx)   #
intel_encoder->load_detect_temp
  34:	75 2a                	jne    0x60
  36:	0f b6 46 18          	movzbl 0x18(%esi),%eax  # crtc->enabled
  3a:	84 c0                	test   %al,%al

in case anybody cares. However, I have no idea how ctrc would be NULL
in the first place there, it comes from

        struct drm_encoder *encoder = &intel_encoder->enc;
        ...
        struct drm_crtc *crtc = encoder->crtc;

and I don't know the setup code. It _does_ strike me that the C code does:

        ...
        struct drm_crtc *crtc = encoder->crtc;
        struct drm_encoder_helper_funcs *encoder_funcs =
encoder->helper_private;
        struct drm_crtc_helper_funcs *crtc_funcs = crtc->helper_private;

        if (intel_encoder->load_detect_temp) {
                encoder->crtc = NULL;
                connector->encoder = NULL;
        ....

where I react to the fact that first we load "crtc = encoder->crtc"
and dereference that pointer (crtc->helper_private) without checking
whether it might be NULL, and then in some case we clear that field
(encoder->crtc = NULL), so clearly the whole "encoder->crtc" field
_can_ be NULL.

However, I don't see why it should only show up for some people...

                          Linus
Comment 5 Milan Bouchet-Valat 2010-08-30 14:09:14 UTC
So, does the above comment from Linus help? From what I understand of it, it could be interesting to check pointers for nullity and get more details about when it happens. Do you think it would be interesting that I build a kernel with a few debug lines added? Please provide a few hints!
Comment 6 Milan Bouchet-Valat 2010-09-08 20:44:57 UTC
With 2.6.36rc3, the "invalid framebuffer ID" error happens directly when resuming, rather than after a few minutes. X doesn't reappear at all, and the X server isn't killed. In dmesg, the other i915 errors from the bug description no longer appear.

All in all, it seems that the bug is becoming more and more easy to reproduce reliably with every release. Can this help?
Comment 7 Milan Bouchet-Valat 2010-09-09 09:53:40 UTC
Actually, the error message even appeared without suspending today, and several times. X froze, so I restarted it, and everything is fine now (contrary to what happens when the freeze occurs after suspend).
[  553.546002] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[  553.546014] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[  573.370038] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[  573.370051] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id


In dmesg there are other errors. While the first are harmless apparently (as I read on LKML), maybe the page table error is interesting:
[   14.117320] [drm] initialized overlay support
[   14.740849] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane, expect flickering: entries required = 36, available = 31.
[   14.923219] Console: switching to colour frame buffer device 106x30
[   14.928031] render error detected, EIR: 0x00000010
[   14.928031] page table error
[   14.928031]   PGTBL_ER: 0x00000100
[   14.928031] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
[   14.928031] render error detected, EIR: 0x00000010
[   14.928031] page table error
[   14.928031]   PGTBL_ER: 0x00000100
[   14.935625] fb0: inteldrmfb frame buffer device
[   14.935627] drm: registered panic notifier
Comment 8 S. Christian Collins 2010-12-27 04:26:57 UTC
Is this bug still present in kernel 2.6.37?
Comment 9 Milan Bouchet-Valat 2010-12-30 15:08:18 UTC
It doesn't! I still see some minor glyphs corruption, but that's another story.

So it took approximately 10 cycles to get this fixed... ;-)
Comment 10 Chris Wilson 2010-12-30 15:30:04 UTC
(In reply to comment #9) 
> So it took approximately 10 cycles to get this fixed... ;-)

Because it wasn't a kernel bug.