Bug 15659
Summary: | [Regresion] [2.6.34-rc1] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung | ||
---|---|---|---|
Product: | Drivers | Reporter: | Maciej Rutecki (maciej.rutecki) |
Component: | Video(DRI - Intel) | Assignee: | drivers_video-dri-intel (drivers_video-dri-intel) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | andrey.sofronov, bjorn.helgaas, chris, docekal, info, jbarnes, maciej.rutecki, matorola, mihai.dontu, radist.morse, rjw, tmezzadra, wrar, yakui.zhao |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Acpidump from hp/compaq nx6310
Log form X.org after crash Config from 2.6.34-rc3 Debug information dump from '/sys/kernel/debug/dri/*' |
Description
Maciej Rutecki
2010-03-31 19:29:00 UTC
*** Bug 15660 has been marked as a duplicate of this bug. *** Hi, Maciej Will you please confirm whether the issue can be workaround by adding the boot option of "pci=nocrs"? If so, maybe it is caused by that no _CRS object is found under the scope of "\_SB.PCI0". thanks. No it doesn't help: http://marc.info/?l=linux-kernel&m=126954818419307&w=2 Regards PS. I'm offline from Saturday to Tuesday, so I can test kernel after Wednesday. (In reply to comment #3) > No it doesn't help: [...] Hmm. Maybe it was coincidence; already in -rc3 I work since 5 days and seems, that pci=nocrs helps. Regards Created attachment 25884 [details]
Acpidump from hp/compaq nx6310
Unfortunately, problem still occurs on 2.6.34-rc3 and pci=nocrs option. System very often hangs during start KDE4 (after KDM login): Apr 7 21:54:01 gumis kernel: [ 82.428051] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Apr 7 21:54:01 gumis kernel: [ 82.428244] render error detected, EIR: 0x00000000 Apr 7 21:54:01 gumis kernel: [ 82.428467] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 703 at 693) Apr 7 21:54:02 gumis kdm[2265]: X server for display :0 terminated unexpectedly Apr 7 21:54:02 gumis acpid: client 2274[0:0] has disconnected Apr 7 21:54:02 gumis acpid: client connected from 3296[0:0] Apr 7 21:54:02 gumis acpid: 1 client rule loaded Apr 7 21:54:04 gumis kdm[2265]: X server died during startup Apr 7 21:54:04 gumis kdm[2265]: X server for display :0 cannot be started, session disabled Created attachment 25907 [details]
Log form X.org after crash
So it doesn't match a recent kernel regression, but it could be any number of bugs in xf86-video-intel and friends. If the hang reoccurs, could you plese upload the /sys/kernel/debug/dri/0/i915_error_state? Then I can see if it is similar to one of the recently fixed userspace bugs. /sys/kernel/debug is empty ls -la /sys/kernel/debug/ razem 0 drwxr-xr-x 2 root root 0 04-09 15:01 . drwxr-xr-x 5 root root 0 2010-04-09 .. Which kernel option I have to enable? Regards Created attachment 25929 [details]
Config from 2.6.34-rc3
On Tuesday 20 April 2010, Maciej Rutecki wrote:
> On wtorek, 20 kwietnia 2010 o 05:19:20 Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.33. Please verify if it still should be listed and let the
> > tracking team know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15659
> > Subject : [Regresion] [2.6.34-rc1] [drm:i915_hangcheck_elapsed]
> *ERROR*
> > Hangcheck timer elapsed... GPU hung Submitter : Maciej Rutecki
> > <maciej.rutecki@gmail.com>
> > Date : 2010-03-25 20:04 (26 days old)
> > Message-ID : <201003252104.24965.maciej.rutecki@gmail.com>
> > References : http://marc.info/?l=linux-kernel&m=126954749618319&w=2
> >
>
> Bug still exists in 2.6.34-rc4
Oops, I think we dropped the ball on this one. It looks like CONFIG_DEBUG_FS=y is what you need for /sys/kernel/debug/dri/0/i915_error_state, and your config in comment #10 has that set. Make sure you have debugfs mounted with "mount -t debugfs none /sys/kernel/debug/". Can you reproduce this reliably? Do you have a reliable workaround? Please attach a dmesg log from current kernel (e.g., 34-rc6) where the problem occurs. I don't see a "pci=use_crs"-related problem in the old logs, but there have been several fixes in that area, so let's try a current kernel just in case it's related. (In reply to comment #12) > Oops, I think we dropped the ball on this one. > > It looks like CONFIG_DEBUG_FS=y is what you need for > /sys/kernel/debug/dri/0/i915_error_state, and your config in comment #10 has > that set. Make sure you have debugfs mounted with "mount -t debugfs none > /sys/kernel/debug/". > > Can you reproduce this reliably? I will try with -rc6 > > Do you have a reliable workaround? Yes: i915.modeset=0 (disable framebuffer) helps since -rc1. Regards -rc6 seems be very stable. I will test -rc7 and tell how it works. If OK, I will close bug. Regards In -rc7 bug still exists: May 10 22:01:01 gumis kernel: [ 93.940050] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung May 10 22:01:01 gumis kernel: [ 93.940257] render error detected, EIR: 0x00000000 May 10 22:01:01 gumis kernel: [ 93.940486] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 ( awaiting 579 at 577) May 10 22:01:02 gumis kdm[2370]: X server for display :0 terminated unexpectedly Created attachment 26323 [details]
Debug information
CC Same bug. May 21 22:32:10 morsebook kernel: [ 3301.783038] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung May 21 22:32:10 morsebook kernel: [ 3301.783198] render error detected, EIR: 0x00000000 May 21 22:32:10 morsebook kernel: [ 3301.783222] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 520648 at 520645) May 21 22:32:11 morsebook gdm[6392]: WARNING: gdm_slave_xioerror_handler: Fatal error X - Restart :0 (the last line is a guess - i have localized version) freshly compiled 2.6.34 Can provide some info, if you tell me what to do - I never participated in kernel debugging before. I can confirm this bug on my netbook with Mobile 945GME as well: May 25 07:45:00 basilisk kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung May 25 07:45:00 basilisk kernel: render error detected, EIR: 0x00000000 May 25 07:45:00 basilisk kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 20046 at 20044) Happens on freshly compiled 2.6.34 with i915 KMS enabled. In my case, it happens mainly after manipulation with xrandr and video outputs (LVDS1 and VGA1), cycling them on and off, switching between video outputs. In my case, X11 doesn't go down, but image on screen freezes and whatever I do, it doesn't change. I do, however, see the mouse cursor moving. I can switch to console (where everything is fine) and restart X, which does not help (black screen with moving cursor), the only thing I can do is reboot. I don't know if it's connected in any way, but suspending and resuming with external VGA on results in violent screen flickering. Cycling the external VGA output off and back on helps. Anyway, I'm willing to provide any assistance necessary, test new patches, etc. cross-linking https://bugzilla.redhat.com/show_bug.cgi?id=573177 since a bug seems the same. Seeing this on a Dell Optiplex 760 (BIOS A00) with Intel X4500 graphics (Q45/Q43, 8086:2e12), since about 2.6.32. Occasionally, X restarts with Fatal server error: Failed to submit batchbuffer: Input/output error in its logs, accompanied by the GPU hung messages. Intervals between this happening vary from minutes to days. The monitor is on a standard VGA outlet. Conversely, I've *never* seen this on my laptop with Intel X3100 graphics and the same software. It happens on my fairly old DELL Latitude D520. Works flawlessly on 2.6.33 but hangs every now and then on 2.6.34. I'm waiting for a new xorg-driver-intel though, maybe that will take care of it. From syslog: kernel: [18151.450071] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung kernel: [18151.450198] render error detected, EIR: 0x00000000 kernel: [18151.450265] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 1108629 at 1108626) Fortunately, I can switch to a console and initiate a cold reboot. Created attachment 26633 [details]
dump from '/sys/kernel/debug/dri/*'
This archive contains everything I was able to dump from /sys/kernel/debug/dri/* and the dmesg. The last two warnings from dmesg are because I tried to dump /sys/kernel/debug/dri/*/vma and the kernel did not agree with me.
Mihai, the cause for your hangs appears to be a missing cache-line in the batch buffers: ... 0x0954a1ac: 0x7d800003: 3DSTATE_DRAWING_RECTANGLE 0x0954a1b0: 0x00000000: dword 1 0x0954a1b4: 0x00000000: dword 2 0x0954a1b8: 0x041a0578: dword 3 0x0954a1bc: 0x00000000: dword 4 0x0954a1c0: 0x00000000: MI_NOOP <-- start 0x0954a1c4: 0x00000000: MI_NOOP 0x0954a1c8: 0x00000000: MI_NOOP 0x0954a1cc: 0x00000000: MI_NOOP 0x0954a1d0: 0x00000000: MI_NOOP 0x0954a1d4: 0x00000000: MI_NOOP 0x0954a1d8: 0x00000000: MI_NOOP 0x0954a1dc: 0x09541eb0: MI_LOAD_SCAN_LINES_INCL Bad length (50) in MI_LOAD_SCAN_LINES_INCL, [2, 2] 0x0954a1e0: HEAD 0x00000000: dword 1 0x0954a1e4: 0x00000000: dword 2 0x0954a1e8: 0x00000000: dword 3 0x0954a1ec: 0x00000000: dword 4 0x0954a1f0: 0x00000000: dword 5 0x0954a1f4: 0x00000000: dword 6 0x0954a1f8: 0x00000000: dword 7 0x0954a1fc: 0x00000000: dword 8 0x0954a200: 0x00028566: dword 9 <-- end 0x0954a204: 0x6ba00966: dword 10 ... 0x0954aebc: 0x7d040031: 3DSTATE_LOAD_STATE_IMMEDIATE_1 0x0954aec0: 0x09542b10: S0 0x0954aec4: 0x00000000: S1 <--- start 0x0954aec8: 0x00000000: MI_NOOP 0x0954aecc: 0x00000000: MI_NOOP 0x0954aed0: 0x00000000: MI_NOOP 0x0954aed4: 0x00000000: MI_NOOP 0x0954aed8: 0x00000000: MI_NOOP 0x0954aedc: 0x00000000: MI_NOOP 0x0954aee0: 0x00000000: MI_NOOP 0x0954aee4: 0x00000000: MI_NOOP 0x0954aee8: 0x00000000: MI_NOOP 0x0954aeec: 0x00000000: MI_NOOP 0x0954aef0: 0x00000000: MI_NOOP 0x0954aef4: 0x00000000: MI_NOOP 0x0954aef8: 0x00000000: MI_NOOP 0x0954aefc: 0x00000000: MI_NOOP <-- end 0x0954af00: 0x03402000: MI UNKNOWN 0x0954af04: 0x01000000: MI_USER_INTERRUPT So this appears to be a bad interaction between the kernel driver and the h/w. Handled-By : Chris Wilson <chris@chris-wilson.co.uk> (In reply to comment #20) > Seeing this on a Dell Optiplex 760 (BIOS A00) with Intel X4500 graphics > (Q45/Q43, 8086:2e12), since about 2.6.32. Since upgrading to Fedora 13 (xorg-x11-drv-intel-2.11.0-4.fc13.x86_64, kernel-2.6.33.5-112.fc13.x86_64, libdrm-2.4.20-1.fc13.x86_64), I've not seen this yet on the above-mentioned workstation. I suspected the xorg driver is at fault too, when I saw these in '/var/log/kdm.log': intel_bufmgr_gem.c:1234: Error setting memory domains 346 (00000040 00000000): Input/output error X: intel_bufmgr_gem.c:900: drm_intel_gem_bo_unreference_locked_timed: Assertion `((&bo_gem->refcount)->atomic) > 0' failed. xf86-video-intel: 2.11.0 libdrm: 2.4.20 I'm running the 2.6.33 kernel now and while the problem still reproduces, it does so far less often. OK, since the problem is reproducible with 2.6.33, it certainly is not a regression from that kernel, so dropping from the list of recent regressions. same problem - kernel 2.6.34.1, xf86-video-intel 2.12, xorg-server 1.8.2, libdrm 2.4.21. X4500MHD (G45) The crash is 100% reproducible here, by switching the wallpaper in KDE4. It's Eee PC 901, i945GME I suppose. Kernel is 2.6.34 vanilla, software is from Debian testing (or unstable, they are the same versions now: xorg-server 1.7.7, intel driver 2.9.1, Mesa 7.7.1, libdrm 2.4.18). I've noticed this for the first time on Jun 4, so it was reproducible with respective software versions. The kernel says [ 329.050105] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 329.050413] render error detected, EIR: 0x00000000 [ 329.050606] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 6890 at 6839) The X says Fatal server error: Failed to submit batchbuffer: Input/output error X: ../../src/i830_batchbuffer.h:79: intel_batch_emit_dword: Assertion `pI830->batch_ptr != ((void *)0)' failed. The X driver message is very popular in Google, including bugs in at least 3 distro BTS'es, though I'm not sure it was ever reported to the upstream. you can run "x11perf -copywinwin500" as a test case, not depending on DE (like kde or gnome) (In reply to comment #30) > you can run "x11perf -copywinwin500" as a test case, not depending on DE > (like > kde or gnome) It didn't crash, though apparently not all wallpaper changes trigger the crash (maybe switching to a 800x600 wallpaper will trigger it with less probability, maybe that's just coincidence). Should be fixed by Dave's last pull request: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=944001201ca0196bcdb088129e5866a9f379d08c So, will it be in 2.6.34.2? It's earmarked for stable, so it should be included in the next 2.6.34.y release. problem persist with G45 @ kernel 2.6.35-rc6. Jul 26 11:39:50 lenovo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Jul 26 11:39:50 lenovo kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 39243 at 39238) Jul 26 11:39:50 lenovo kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Jul 26 11:39:50 lenovo kernel: [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 39253 at 39238) (In reply to comment #35) > problem persist with G45 @ kernel 2.6.35-rc6. Unlikely to be the same problem as the original was a gen3 specific bug and you have a gen4. More likely you have hit one of the countless userspace bugs, but that is impossible to tell since you haven't once uploaded any debug information that is necessary to diagnose your bug. Please open a new bug report and attach dmesg (with drm.debug=4), Xorg.0.log and /sys/kernel/debug/dri/0/i915_error_state [following a hang]. With latest (26-07-2010) git it still crashes: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 355446 at 355443) # lspci 00:00.0 (8086:2560) Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 01) 00:02.0 (8086:2562) VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) (In reply to comment #37) > With latest (26-07-2010) git it still crashes: > [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung > [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting > 355446 at 355443) > > # lspci > 00:00.0 (8086:2560) Host bridge: Intel Corporation > 82845G/GL[Brookdale-G]/GE/PE > DRAM Controller/Host-Hub Interface (rev 01) > 00:02.0 (8086:2562) VGA compatible controller: Intel Corporation > 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) This is a completely different bug, most likely the 845G h/w erratum that we fail to take account of or the general gen2 incoherency issues. Impossible to tell without the debug information and the conflation of bug reports. (In reply to comment #36) > (In reply to comment #35) > > problem persist with G45 @ kernel 2.6.35-rc6. > > Unlikely to be the same problem as the original was a gen3 specific bug and > you > have a gen4. More likely you have hit one of the countless userspace bugs, > but > that is impossible to tell since you haven't once uploaded any debug > information that is necessary to diagnose your bug. > > Please open a new bug report and attach dmesg (with drm.debug=4), Xorg.0.log > and /sys/kernel/debug/dri/0/i915_error_state [following a hang]. OK, thank you. I'm currently running 2.6.35-rc6 for some time on 945GME. Nothing happens so far, so the fix is certainly fixing at least some issues. Confirm; 2.6.35 seems be very stable. Regards |