Bug 15659

Summary: [Regresion] [2.6.34-rc1] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Product: Drivers Reporter: Maciej Rutecki (maciej.rutecki)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED CODE_FIX    
Severity: normal CC: andrey.sofronov, bjorn.helgaas, chris, docekal, info, jbarnes, maciej.rutecki, matorola, mihai.dontu, radist.morse, rjw, tmezzadra, wrar, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Acpidump from hp/compaq nx6310
Log form X.org after crash
Config from 2.6.34-rc3
Debug information
dump from '/sys/kernel/debug/dri/*'

Description Maciej Rutecki 2010-03-31 19:29:00 UTC
Subject    : [Regresion] [2.6.34-rc1] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Submitter  : Maciej Rutecki <maciej.rutecki@gmail.com>
Date       : 2010-03-25 20:04
Message-ID : 201003252104.24965.maciej.rutecki@gmail.com
References : http://marc.info/?l=linux-kernel&m=126954749618319&w=2

This entry is being used for tracking a regression from 2.6.33.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Maciej Rutecki 2010-03-31 19:42:02 UTC
*** Bug 15660 has been marked as a duplicate of this bug. ***
Comment 2 ykzhao 2010-04-02 07:08:28 UTC
Hi, Maciej
    Will you please confirm whether the issue can be workaround by adding the boot option of "pci=nocrs"?
    If so, maybe it is caused by that no _CRS object is found under the scope of "\_SB.PCI0". 
    

thanks.
Comment 3 Maciej Rutecki 2010-04-02 18:47:50 UTC
No it doesn't help:
http://marc.info/?l=linux-kernel&m=126954818419307&w=2

Regards

PS. I'm offline from Saturday to Tuesday, so I can test kernel after Wednesday.
Comment 4 Maciej Rutecki 2010-04-06 18:20:24 UTC
(In reply to comment #3)
> No it doesn't help:
[...]

Hmm. Maybe it was coincidence; already in -rc3 I work since 5 days and seems, that pci=nocrs helps.

Regards
Comment 5 Maciej Rutecki 2010-04-06 18:21:18 UTC
Created attachment 25884 [details]
Acpidump from hp/compaq nx6310
Comment 6 Maciej Rutecki 2010-04-07 20:03:11 UTC
Unfortunately, problem still occurs on 2.6.34-rc3 and pci=nocrs option. System very often hangs during start KDE4 (after KDM login):

Apr  7 21:54:01 gumis kernel: [   82.428051] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU
hung
Apr  7 21:54:01 gumis kernel: [   82.428244] render error detected, EIR: 0x00000000
Apr  7 21:54:01 gumis kernel: [   82.428467] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 703 at 693)
Apr  7 21:54:02 gumis kdm[2265]: X server for display :0 terminated unexpectedly
Apr  7 21:54:02 gumis acpid: client 2274[0:0] has disconnected
Apr  7 21:54:02 gumis acpid: client connected from 3296[0:0]
Apr  7 21:54:02 gumis acpid: 1 client rule loaded
Apr  7 21:54:04 gumis kdm[2265]: X server died during startup
Apr  7 21:54:04 gumis kdm[2265]: X server for display :0 cannot be started, session disabled
Comment 7 Maciej Rutecki 2010-04-07 20:04:05 UTC
Created attachment 25907 [details]
Log form X.org after crash
Comment 8 Chris Wilson 2010-04-07 22:43:58 UTC
So it doesn't match a recent kernel regression, but it could be any number of bugs in xf86-video-intel and friends. If the hang reoccurs, could you plese upload the /sys/kernel/debug/dri/0/i915_error_state? Then I can see if it is similar to one of the recently fixed userspace bugs.
Comment 9 Maciej Rutecki 2010-04-09 13:03:17 UTC
/sys/kernel/debug is empty

ls -la /sys/kernel/debug/
razem 0
drwxr-xr-x 2 root root 0 04-09 15:01 .
drwxr-xr-x 5 root root 0 2010-04-09  ..

Which kernel option I have to enable?

Regards
Comment 10 Maciej Rutecki 2010-04-09 13:03:57 UTC
Created attachment 25929 [details]
Config from 2.6.34-rc3
Comment 11 Rafael J. Wysocki 2010-04-21 04:59:03 UTC
On Tuesday 20 April 2010, Maciej Rutecki wrote:
> On wtorek, 20 kwietnia 2010 o 05:19:20 Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.33.  Please verify if it still should be listed and let the
> >  tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15659
> > Subject             : [Regresion] [2.6.34-rc1] [drm:i915_hangcheck_elapsed]
> *ERROR*
> >  Hangcheck timer elapsed... GPU hung Submitter      : Maciej Rutecki
> >  <maciej.rutecki@gmail.com>
> > Date                : 2010-03-25 20:04 (26 days old)
> > Message-ID  : <201003252104.24965.maciej.rutecki@gmail.com>
> > References  : http://marc.info/?l=linux-kernel&m=126954749618319&w=2
> > 
> 
> Bug still exists in 2.6.34-rc4
Comment 12 Bjorn Helgaas 2010-05-04 22:36:33 UTC
Oops, I think we dropped the ball on this one.

It looks like CONFIG_DEBUG_FS=y is what you need for /sys/kernel/debug/dri/0/i915_error_state, and your config in comment #10 has that set.  Make sure you have debugfs mounted with "mount -t debugfs none /sys/kernel/debug/".

Can you reproduce this reliably?

Do you have a reliable workaround?

Please attach a dmesg log from current kernel (e.g., 34-rc6) where the problem occurs.  I don't see a "pci=use_crs"-related problem in the old logs, but there have been several fixes in that area, so let's try a current kernel just in case it's related.
Comment 13 Maciej Rutecki 2010-05-05 14:24:11 UTC
(In reply to comment #12)
> Oops, I think we dropped the ball on this one.
> 
> It looks like CONFIG_DEBUG_FS=y is what you need for
> /sys/kernel/debug/dri/0/i915_error_state, and your config in comment #10 has
> that set.  Make sure you have debugfs mounted with "mount -t debugfs none
> /sys/kernel/debug/".
> 
> Can you reproduce this reliably?

I will try with -rc6

> 
> Do you have a reliable workaround?

Yes: i915.modeset=0 (disable framebuffer) helps since -rc1.

Regards
Comment 14 Maciej Rutecki 2010-05-10 14:09:29 UTC
-rc6 seems be very stable. I will test -rc7 and tell how it works. If OK, I will close bug.

Regards
Comment 15 Maciej Rutecki 2010-05-10 20:31:47 UTC
In -rc7 bug still exists:
May 10 22:01:01 gumis kernel: [   93.940050] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU
hung
May 10 22:01:01 gumis kernel: [   93.940257] render error detected, EIR: 0x00000000
May 10 22:01:01 gumis kernel: [   93.940486] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (
awaiting 579 at 577)
May 10 22:01:02 gumis kdm[2370]: X server for display :0 terminated unexpectedly
Comment 16 Maciej Rutecki 2010-05-10 20:33:10 UTC
Created attachment 26323 [details]
Debug information
Comment 17 Morse 2010-05-21 20:56:12 UTC
CC
Same bug.

May 21 22:32:10 morsebook kernel: [ 3301.783038] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
May 21 22:32:10 morsebook kernel: [ 3301.783198] render error detected, EIR: 0x00000000
May 21 22:32:10 morsebook kernel: [ 3301.783222] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 520648 at 520645)
May 21 22:32:11 morsebook gdm[6392]: WARNING: gdm_slave_xioerror_handler: Fatal error X - Restart :0

(the last line is a guess - i have localized version)

freshly compiled 2.6.34

Can provide some info, if you tell me what to do - I never participated in kernel debugging before.
Comment 18 Michal Docekal 2010-05-25 07:49:12 UTC
I can confirm this bug on my netbook with Mobile 945GME as well:

May 25 07:45:00 basilisk kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
May 25 07:45:00 basilisk kernel: render error detected, EIR: 0x00000000
May 25 07:45:00 basilisk kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 20046 at 20044)

Happens on freshly compiled 2.6.34 with i915 KMS enabled. In my case, it happens mainly after manipulation with xrandr and video outputs (LVDS1 and VGA1), cycling them on and off, switching between video outputs.

In my case, X11 doesn't go down, but image on screen freezes and whatever I do, it doesn't change. I do, however, see the mouse cursor moving. I can switch to console (where everything is fine) and restart X, which does not help (black screen with moving cursor), the only thing I can do is reboot.

I don't know if it's connected in any way, but suspending and resuming with external VGA on results in violent screen flickering. Cycling the external VGA output off and back on helps.

Anyway, I'm willing to provide any assistance necessary, test new patches, etc.
Comment 19 Anatoly Pugachev 2010-05-26 13:03:07 UTC
cross-linking https://bugzilla.redhat.com/show_bug.cgi?id=573177 since a bug seems the same.
Comment 20 James Ettle 2010-05-26 14:34:17 UTC
Seeing this on a Dell Optiplex 760 (BIOS A00) with Intel X4500 graphics (Q45/Q43, 8086:2e12), since about 2.6.32. Occasionally, X restarts with

  Fatal server error:
  Failed to submit batchbuffer: Input/output error

in its logs, accompanied by the GPU hung messages. Intervals between this happening vary from minutes to days. The monitor is on a standard VGA outlet.

Conversely, I've *never* seen this on my laptop with Intel X3100 graphics and the same software.
Comment 21 Mihai Donțu 2010-06-02 16:05:41 UTC
It happens on my fairly old DELL Latitude D520. Works flawlessly on 2.6.33 but hangs every now and then on 2.6.34. I'm waiting for a new xorg-driver-intel though, maybe that will take care of it.

From syslog:
kernel: [18151.450071] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung                        
kernel: [18151.450198] render error detected, EIR: 0x00000000                                                          
kernel: [18151.450265] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 1108629 at 1108626)

Fortunately, I can switch to a console and initiate a cold reboot.
Comment 22 Mihai Donțu 2010-06-03 13:54:48 UTC
Created attachment 26633 [details]
dump from '/sys/kernel/debug/dri/*'

This archive contains everything I was able to dump from /sys/kernel/debug/dri/* and the dmesg. The last two warnings from dmesg are because I tried to dump /sys/kernel/debug/dri/*/vma and the kernel did not agree with me.
Comment 23 Chris Wilson 2010-06-06 12:12:24 UTC
Mihai, the cause for your hangs appears to be a missing cache-line in the batch buffers:
...
0x0954a1ac:      0x7d800003: 3DSTATE_DRAWING_RECTANGLE
0x0954a1b0:      0x00000000:    dword 1
0x0954a1b4:      0x00000000:    dword 2
0x0954a1b8:      0x041a0578:    dword 3
0x0954a1bc:      0x00000000:    dword 4
0x0954a1c0:      0x00000000: MI_NOOP <-- start
0x0954a1c4:      0x00000000: MI_NOOP
0x0954a1c8:      0x00000000: MI_NOOP
0x0954a1cc:      0x00000000: MI_NOOP
0x0954a1d0:      0x00000000: MI_NOOP
0x0954a1d4:      0x00000000: MI_NOOP
0x0954a1d8:      0x00000000: MI_NOOP
0x0954a1dc:      0x09541eb0: MI_LOAD_SCAN_LINES_INCL
Bad length (50) in MI_LOAD_SCAN_LINES_INCL, [2, 2]
0x0954a1e0: HEAD 0x00000000:    dword 1
0x0954a1e4:      0x00000000:    dword 2
0x0954a1e8:      0x00000000:    dword 3
0x0954a1ec:      0x00000000:    dword 4
0x0954a1f0:      0x00000000:    dword 5
0x0954a1f4:      0x00000000:    dword 6
0x0954a1f8:      0x00000000:    dword 7
0x0954a1fc:      0x00000000:    dword 8
0x0954a200:      0x00028566:    dword 9 <-- end
0x0954a204:      0x6ba00966:    dword 10
...
0x0954aebc:      0x7d040031: 3DSTATE_LOAD_STATE_IMMEDIATE_1
0x0954aec0:      0x09542b10:    S0
0x0954aec4:      0x00000000:    S1   <--- start
0x0954aec8:      0x00000000: MI_NOOP
0x0954aecc:      0x00000000: MI_NOOP
0x0954aed0:      0x00000000: MI_NOOP
0x0954aed4:      0x00000000: MI_NOOP
0x0954aed8:      0x00000000: MI_NOOP
0x0954aedc:      0x00000000: MI_NOOP
0x0954aee0:      0x00000000: MI_NOOP
0x0954aee4:      0x00000000: MI_NOOP
0x0954aee8:      0x00000000: MI_NOOP
0x0954aeec:      0x00000000: MI_NOOP
0x0954aef0:      0x00000000: MI_NOOP
0x0954aef4:      0x00000000: MI_NOOP
0x0954aef8:      0x00000000: MI_NOOP
0x0954aefc:      0x00000000: MI_NOOP <-- end
0x0954af00:      0x03402000: MI UNKNOWN
0x0954af04:      0x01000000: MI_USER_INTERRUPT

So this appears to be a bad interaction between the kernel driver and the h/w.
Comment 24 Rafael J. Wysocki 2010-06-13 12:02:40 UTC
Handled-By : Chris Wilson <chris@chris-wilson.co.uk>
Comment 25 James Ettle 2010-06-15 10:47:19 UTC
(In reply to comment #20)
> Seeing this on a Dell Optiplex 760 (BIOS A00) with Intel X4500 graphics
> (Q45/Q43, 8086:2e12), since about 2.6.32.

Since upgrading to Fedora 13 (xorg-x11-drv-intel-2.11.0-4.fc13.x86_64, kernel-2.6.33.5-112.fc13.x86_64, libdrm-2.4.20-1.fc13.x86_64), I've not seen this yet on the above-mentioned workstation.
Comment 26 Mihai Donțu 2010-06-15 11:54:44 UTC
I suspected the xorg driver is at fault too, when I saw these in '/var/log/kdm.log':

intel_bufmgr_gem.c:1234: Error setting memory domains 346 (00000040 00000000): Input/output error
X: intel_bufmgr_gem.c:900: drm_intel_gem_bo_unreference_locked_timed: Assertion `((&bo_gem->refcount)->atomic) > 0' failed.

xf86-video-intel: 2.11.0
libdrm: 2.4.20

I'm running the 2.6.33 kernel now and while the problem still reproduces, it does so far less often.
Comment 27 Rafael J. Wysocki 2010-06-15 12:56:04 UTC
OK, since the problem is reproducible with 2.6.33, it certainly is not a regression from that kernel, so dropping from the list of recent regressions.
Comment 28 Alois Nespor 2010-07-05 22:24:34 UTC
same problem - kernel 2.6.34.1, xf86-video-intel 2.12, xorg-server 1.8.2, libdrm 2.4.21.

X4500MHD (G45)
Comment 29 Andrey Rahmatullin 2010-07-08 09:02:23 UTC
The crash is 100% reproducible here, by switching the wallpaper in KDE4.
It's Eee PC 901, i945GME I suppose. Kernel is 2.6.34 vanilla, software is from Debian testing (or unstable, they are the same versions now: xorg-server 1.7.7, intel driver 2.9.1, Mesa 7.7.1, libdrm 2.4.18). I've noticed this for the first time on Jun 4, so it was reproducible with respective software versions.

The kernel says

[  329.050105] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  329.050413] render error detected, EIR: 0x00000000
[  329.050606] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 6890 at 6839)

The X says

Fatal server error:
Failed to submit batchbuffer: Input/output error
X: ../../src/i830_batchbuffer.h:79: intel_batch_emit_dword: Assertion `pI830->batch_ptr != ((void *)0)' failed.

The X driver message is very popular in Google, including bugs in at least 3 distro BTS'es, though I'm not sure it was ever reported to the upstream.
Comment 30 Anatoly Pugachev 2010-07-08 09:31:06 UTC
you can run "x11perf -copywinwin500" as a test case, not depending on DE (like kde or gnome)
Comment 31 Andrey Rahmatullin 2010-07-08 09:48:02 UTC
(In reply to comment #30)
> you can run "x11perf -copywinwin500" as a test case, not depending on DE
> (like
> kde or gnome)

It didn't crash, though apparently not all wallpaper changes trigger the crash (maybe switching to a 800x600 wallpaper will trigger it with less probability, maybe that's just coincidence).
Comment 32 Jesse Barnes 2010-07-23 20:23:00 UTC
Should be fixed by Dave's last pull request:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=944001201ca0196bcdb088129e5866a9f379d08c
Comment 33 Morse 2010-07-24 19:51:35 UTC
So, will it be in 2.6.34.2?
Comment 34 Chris Wilson 2010-07-24 19:54:38 UTC
It's earmarked for stable, so it should be included in the next 2.6.34.y release.
Comment 35 Alois Nespor 2010-07-26 09:48:08 UTC
problem persist with G45 @ kernel 2.6.35-rc6.

Jul 26 11:39:50 lenovo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jul 26 11:39:50 lenovo kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 39243 at 39238)
Jul 26 11:39:50 lenovo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jul 26 11:39:50 lenovo kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 39253 at 39238)
Comment 36 Chris Wilson 2010-07-26 10:07:40 UTC
(In reply to comment #35)
> problem persist with G45 @ kernel 2.6.35-rc6.

Unlikely to be the same problem as the original was a gen3 specific bug and you have a gen4. More likely you have hit one of the countless userspace bugs, but that is impossible to tell since you haven't once uploaded any debug information that is necessary to diagnose your bug.

Please open a new bug report and attach dmesg (with drm.debug=4), Xorg.0.log and /sys/kernel/debug/dri/0/i915_error_state [following a hang].
Comment 37 Andrey Sofronov 2010-07-26 10:27:03 UTC
With latest (26-07-2010) git it still crashes:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 355446 at 355443)

# lspci
00:00.0 (8086:2560) Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 01)
00:02.0 (8086:2562) VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
Comment 38 Chris Wilson 2010-07-26 10:37:37 UTC
(In reply to comment #37)
> With latest (26-07-2010) git it still crashes:
> [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting
> 355446 at 355443)
> 
> # lspci
> 00:00.0 (8086:2560) Host bridge: Intel Corporation
> 82845G/GL[Brookdale-G]/GE/PE
> DRAM Controller/Host-Hub Interface (rev 01)
> 00:02.0 (8086:2562) VGA compatible controller: Intel Corporation
> 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

This is a completely different bug, most likely the 845G h/w erratum that we fail to take account of or the general gen2 incoherency issues. Impossible to tell without the debug information and the conflation of bug reports.
Comment 39 Alois Nespor 2010-07-26 11:08:24 UTC
(In reply to comment #36)
> (In reply to comment #35)
> > problem persist with G45 @ kernel 2.6.35-rc6.
> 
> Unlikely to be the same problem as the original was a gen3 specific bug and
> you
> have a gen4. More likely you have hit one of the countless userspace bugs,
> but
> that is impossible to tell since you haven't once uploaded any debug
> information that is necessary to diagnose your bug.
> 
> Please open a new bug report and attach dmesg (with drm.debug=4), Xorg.0.log
> and /sys/kernel/debug/dri/0/i915_error_state [following a hang].

OK, thank you.
Comment 40 Morse 2010-07-26 14:20:09 UTC
I'm currently running 2.6.35-rc6 for some time on 945GME.

Nothing happens so far, so the fix is certainly fixing at least some issues.
Comment 41 Maciej Rutecki 2010-08-11 18:14:18 UTC
Confirm; 2.6.35 seems be very stable.

Regards