Bug 31102

Summary: Corrupted display, unusable X-Server after using Chromium for some minutes.
Product: Drivers Reporter: Stefan Bauer (stefan.andreas.bauer)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED UNREPRODUCIBLE    
Severity: high CC: aanisimov, chris, florian, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc8 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 27352    
Attachments: Fix tiling corruption.

Description Stefan Bauer 2011-03-15 03:31:47 UTC
The problem can be exists with linux 2.6.38-rc* (tested rc5 to rc8), but not with linux 2.6.37 or earlier.

Reproduce:
Browsing the web with Chromium (tested v9 and v10). After some minutes, the screen gets partly freezed, colors are mixed, font isn't rendered correctly, windows can't be closed, mouse cursor is corrupted etc. It is hardly possible to shut down the X server. If this succeeds, it is not possible to restart it. The whole system has to be rebooted.

dmesg extract:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 118947 at 118945, next 118948)
[drm:i915_reset] *ERROR* Failed to reset chip.

Software: Gentoo Linux
xorg-server-1.9.4
xf86-video-intel-2.14.0
libdrm-2.4.23
mesa-7.9.1
KDE 4.4.5

Hardware: Thinkpad Z60m
lspci
00:00.0 Host bridge: Intel Corporation Mobile 915GM/PM/GMS/910GML Express Processor to DRAM Controller (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:1b.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 2 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 3 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) PCI Express Port 4 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #3 (rev 03)
00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #4 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB2 EHCI Controller (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev d3)
00:1f.0 ISA bridge: Intel Corporation 82801FBM (ICH6M) LPC Interface Bridge (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller (rev 03)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751M Gigabit Ethernet PCI Express (rev 11)
14:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev b3)
14:00.1 FireWire (IEEE 1394): Ricoh Co Ltd R5C552 IEEE 1394 Controller (rev 08)
14:00.2 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 17)
14:00.3 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 08)
14:02.0 Network controller: Intel Corporation PRO/Wireless 2915ABG [Calexico2] Network Connection (rev 05)
Comment 1 Chris Wilson 2011-03-19 12:46:54 UTC
Created attachment 51252 [details]
Fix tiling corruption.

This is the v2.6.38 version of the patch to fix tiling corruption of which Chromium suffers.
Comment 2 Stefan Bauer 2011-03-19 17:03:42 UTC
Thank you, unfortunately the patch does not apply on 2.6.38:

$ git am ../0001-drm-i915-Fix-tiling-corruption-from-pipelined-fencin.patch 
Applying: drm/i915: Fix tiling corruption from pipelined fencing
error: patch failed: drivers/gpu/drm/i915/i915_gem.c:2601
error: drivers/gpu/drm/i915/i915_gem.c: patch does not apply
Patch failed at 0001 drm/i915: Fix tiling corruption from pipelined fencing


A more recent kernel from Linus' tree (5bab188a316718a26346cdb25c4cc6b319f8f907) crashes on my system even before X starts, so I can't test your patches based on this version.
Comment 3 Stefan Bauer 2011-03-28 06:20:04 UTC
Today I upgraded Chromium from 10.0.648.133 to 10.0.648.204, and now the problem is not reproducible anymore. I tested Kernels 2.6.38-rc8 and 2.6.38 from Linus' tree without any additional patches.

I can't decide whether there is still need for a fix in the kernel.
Comment 4 Chris Wilson 2011-03-29 08:13:22 UTC
*** Bug 32102 has been marked as a duplicate of this bug. ***
Comment 5 Eddy Petrișor 2011-03-29 20:17:57 UTC
Is this bug the same as bug 30512 ?
Comment 6 Chris Wilson 2011-03-30 07:41:58 UTC
(In reply to comment #5)
> Is this bug the same as bug 30512 ?

No. Different class of hardware unaffected by the pipelined fencing bug. That smells like a userspace driver bug cross-posted to kernel.org.
Comment 7 Chris Wilson 2011-04-05 12:51:16 UTC
(In reply to comment #3)
> Today I upgraded Chromium from 10.0.648.133 to 10.0.648.204, and now the
> problem is not reproducible anymore. I tested Kernels 2.6.38-rc8 and 2.6.38
> from Linus' tree without any additional patches.
> 
> I can't decide whether there is still need for a fix in the kernel.

Well your description matched others where the patch was useful, so assuming that it would have helped here:

commit 29c5a587284195278e233eec5c2234c24fb2c204
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 17 15:23:22 2011 +0000

    drm/i915: Fix tiling corruption from pipelined fencing
    
    ... even though it was disabled. A mistake in the handling of fence reuse
    caused us to skip the vital delay of waiting for the object to finish
    rendering before changing the register. This resulted in us changing the
    fence register whilst the bo was active and so causing the blits to
    complete using the wrong stride or even the wrong tiling. (Visually the
    effect is that small blocks of the screen look like they have been
    interlaced). The fix is to wait for the GPU to finish using the memory
    region pointed to by the fence before changing it.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584
    Cc: Andy Whitcroft <apw@canonical.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    [Note for 2.6.38-stable, we need to reintroduce the interruptible passing]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Dave Airlie <airlied@linux.ie>
Comment 8 Stefan Bauer 2011-05-05 04:05:59 UTC
(In reply to comment #7)
> (In reply to comment #3)
> > Today I upgraded Chromium from 10.0.648.133 to 10.0.648.204, and now the
> > problem is not reproducible anymore. I tested Kernels 2.6.38-rc8 and 2.6.38
> > from Linus' tree without any additional patches.

I was wrong, I can still reproduce this bug with 2.6.38 and 2.6.39-rc6.

> > I can't decide whether there is still need for a fix in the kernel.
> 
> Well your description matched others where the patch was useful, so assuming
> that it would have helped here:
> 
> commit 29c5a587284195278e233eec5c2234c24fb2c204
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Mar 17 15:23:22 2011 +0000
> 
>     drm/i915: Fix tiling corruption from pipelined fencing
> 
>     ... even though it was disabled. A mistake in the handling of fence reuse
>     caused us to skip the vital delay of waiting for the object to finish
>     rendering before changing the register. This resulted in us changing the
>     fence register whilst the bo was active and so causing the blits to
>     complete using the wrong stride or even the wrong tiling. (Visually the
>     effect is that small blocks of the screen look like they have been
>     interlaced). The fix is to wait for the GPU to finish using the memory
>     region pointed to by the fence before changing it.
> 
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34584
>     Cc: Andy Whitcroft <apw@canonical.com>
>     Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>     [Note for 2.6.38-stable, we need to reintroduce the interruptible
>     passing]
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Tested-by: Dave Airlie <airlied@linux.ie>

As the problem still exists in 2.6.39-rc6 and I cannot see this patch either in mainline nor in 2.6.38-stable, what will be the next step? Please also notice that I wasn't able to test it since it does not apply against .38.
Comment 9 Rafael J. Wysocki 2011-05-14 22:29:19 UTC
Fixed by commit 29c5a587284195278e233eec5c2234c24fb2c204 .
Comment 10 Stefan Bauer 2011-05-23 15:54:36 UTC
(In reply to comment #9)
> Fixed by commit 29c5a587284195278e233eec5c2234c24fb2c204 .

I'm sorry, but this issue isn't solved completely. I can still reproduce errors as following just by using Chromium on some Javascript and image intensive websites:

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 59733 at 59731, next 59734)
[drm:i915_reset] *ERROR* Failed to reset chip.

I'm currently using:
Linux 2.6.39
xorg-server 1.9.5
libdrm 2.4.25
xf86-video-intel 2.14.0
mesa 7.10.2
chromium 11.0.696.68
Comment 11 Stefan Bauer 2011-05-30 05:43:45 UTC
Just to complete the picture: Since 2.6.39, the effects of this GPU hangers aren't that drastic anymore; the system remains responsible. But here is what I just experienced:
- surfed with Chromium on google.com
- suddenly: X-Server completely blocked for about 2 sec
- X-Server recovered somehow, but lots of colors are confused, fonts are rendered incorrectly
- But windows still could be moved, closed etc.
- dmesg says "GPU hung" like in comment #10
- to be evil, now try to run glxgears
- X/kdm (?) crashed and restarted immediately, I log back in
- colors / fonts look fine now!
- try to run glxgears again: works fine now!
- the only thing that's still not working is XVideo

I will be testing 3.0-rcx sooner or later, hopefully we get this issue fixed for 3.0 ;)
Comment 12 Stefan Bauer 2012-01-30 12:11:21 UTC
This issue is solved in the meanwhile.
Comment 13 Florian Mickler 2012-01-30 22:15:38 UTC
Do you have a pointer to the fix?
Else we rather close this as unreproducible.
Comment 14 Stefan Bauer 2012-01-31 09:30:31 UTC
(In reply to comment #13)
> Do you have a pointer to the fix?

Not really. v3.0 works, v2.6.39 doesn't, but I don't know which exact commit fixed it.

> Else we rather close this as unreproducible.

I'm fine with unreproducible, thanks.