Bug 15276

Summary: latest git kernel: general protection fault: 0000 [#1]
Product: Drivers Reporter: Maciej Rutecki (maciej.rutecki)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: CLOSED CODE_FIX    
Severity: normal CC: alexdeucher, andreas.wallberg, florian, glisse, maciej.rutecki, markus, neuro, papadako, rjw, shawn.starr, thellstrom
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14885    
Attachments: Xorg.0.log
lspci
Kernel 2.6.33-rc7 configuration
message.log after applying the patch
dmesg output regarding radeon_fence.c
dmesg output for two screen black-outs
Rendering artifacts in Firefox when compositing is enabled in KDE/Kwin

Description Maciej Rutecki 2010-02-11 19:37:58 UTC
Subject    : latest git kernel: general protection fault: 0000 [#1]
Submitter  : Markus Trippelsdorf <markus@trippelsdorf.de>
Date       : 2010-02-09 8:36
Message-ID : 20100209083605.GA1766@arch.tripp.de
References : http://marc.info/?l=linux-kernel&m=126570498804223&w=2

This entry is being used for tracking a regression from 2.6.33.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Markus Trippelsdorf 2010-02-12 07:36:21 UTC
There are also other people with the same problem see:
http://bbs.archlinux.org/viewtopic.php?pid=706873#p706873
Comment 2 Jérôme Glisse 2010-02-12 09:34:45 UTC
Bug is different, the kms code is a rewrite. You are using 2.6.33-rc7 without any modification right ? (ie you did not merge other branch in it). Also please attach an Xorg log and lspci -vvnn
Comment 3 Markus Trippelsdorf 2010-02-12 09:51:57 UTC
Created attachment 24997 [details]
Xorg.0.log
Comment 4 Markus Trippelsdorf 2010-02-12 09:52:49 UTC
Created attachment 24998 [details]
lspci
Comment 5 Markus Trippelsdorf 2010-02-12 09:53:53 UTC
(In reply to comment #2)
> Bug is different, the kms code is a rewrite. You are using 2.6.33-rc7 without
> any modification right ? (ie you did not merge other branch in it).

Yes, I'm running a plain Linus' git kernel without modifications (kms is enabled).
Comment 6 Michał Witkowski 2010-02-12 19:04:16 UTC
I'm having similar issues to the ones described in the linked post:
http://bbs.archlinux.org/viewtopic.php?pid=706873#p706873

I have a Radeon Mobility 3450. The weird hangs started to appear after applying the patch to fix bug 15186 to vanilla 2.6.33-rc6. Before that I never had hangs with KMS.

Packages:
mesa-git 20100119
xf86-video-ati-git 20100119

I just compiled kernel 2.6.33-rc7, I'll report if the hangs appear here too.
Comment 7 Michał Witkowski 2010-02-12 21:05:20 UTC
Ok, full system specs:
Lenovo U330
Arch Linux current (as of 20100219) x86_64
Radeon Mobility 3450
Kernel 2.6.33-rc7 (and before 2.6.33-rc6 with fix for bug 15186 and vanilla 2.6.33-rc6)
mesa, libdrm, glproto, xf86-video-ati etc. from git 20100212 (and before 20100119)
KDE 4.4.0 (from Arch's KDEmod stable)
using radeon KMS without any additional patches 
no settings in xorg.conf

Problem:
System freezes on various graphics actions. The system is unresponsive to keyboard, can't SSH in, external (HDMI-connected) monitor reports lack of signal.

Occurances:
1. When displaying "Present windows"
2. When hovering over window thumbnails on the taskbar
3. On link hover in Firefox 3.6
4. Playing various flash-based games (numerous times) like Hex Empire (yea, killing time :P)

Curious thing is, that these crashes didn't happen in vanilla 2.6.33-rc6. They started appearing in a kernel built with the patch for bug 15186. It still happens in kernel 2.6.33-rc7 (one crash in Hex Empire, one in present windows).

Please note: these crashes occur in Kwin with and _without_ compositing enabled. A plain-old 2D desktop is also prone to this problem.

Unfortunately I've got no log information or any segfaults to show. If there is a way I could assist with information, please let me know.
Comment 8 Alex Deucher 2010-02-12 22:27:39 UTC
(In reply to comment #7)
> Curious thing is, that these crashes didn't happen in vanilla 2.6.33-rc6.
> They
> started appearing in a kernel built with the patch for bug 15186. It still
> happens in kernel 2.6.33-rc7 (one crash in Hex Empire, one in present
> windows).

Any chance you can use git bisect to track down the problematic commit?
Comment 9 Andreas Wallberg 2010-02-12 22:30:56 UTC
I think I better post my Archlinux information here as well:

I run kernel + X + ATI git packages which are now a few days old (since I am
unable to compile mesa-full) but for more than a week, my machine has crashed
at least once or twice a day, sometimes more. I can not ssh to it. I run KDE SC
4.4 with compositing and crashes has a tendency to happen at these occasions:

i) Leaving the machine unattended for a while, so that both the laptop and the
external monitor are suspended or powered off. When I return I have found it
dead a number of times, it just does not wake up.

This has not happened after I turned off "Let PowerDevil manage screen
powersaving", but that was only yesterday so I am not sure if that really made
any difference.

ii) Using "Present Windows" as an effect for window switching with Kwin and
pressing Alt+Tab two or more times VERY quickly has crashed the computer at at
least four occasions.

iii) Pressing (maybe even just hovering, I am not sure) a tab in a KDE app like
Dolphin or GTK app like Firefox has brought the machine down as well.

My laptop is a Thinkpad W500 with a FireGL 5700/ATI Radeon Mobility 3650. Of
the various triggers reported above, the "Present Windows" one has been the
most reliable one for me. I never managed to catch anything in logs though.

I should add though that after turning Compositing off and rebooting, I have
not had any crash so far in about 20 hours with the same usage pattern as
before.
Comment 10 Michał Witkowski 2010-02-13 07:41:38 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Curious thing is, that these crashes didn't happen in vanilla 2.6.33-rc6.
> They
> > started appearing in a kernel built with the patch for bug 15186. It still
> > happens in kernel 2.6.33-rc7 (one crash in Hex Empire, one in present
> windows).
> 
> Any chance you can use git bisect to track down the problematic commit?

Could you please clarify how this should be done? A small howto would be nice, as I'm not good with git.

I presume this means building numerous kernels at various git points, right? Is there a way to narrow it down to only the radeon patches so that the number of kernels rebuilt will be lower?

The problem is, I can't find a 100% reproducible case. The closest I got is the flash Hex Empire get which crashes after 2-5 minutes. Testing it that way would be problematic though.

Just a question: If I revert to vanilla 2.6.33rc6 and see that there are no hard-locks, would that be enough to pinpoint the problem to the bug 15186 fix? I started encountering problems after applying the fix to vanilla 2.6.33rc-6.
Comment 11 Markus Trippelsdorf 2010-02-13 08:08:14 UTC
Michael, the first thing I would try is to revert the bug 15186 fix:
git revert 062b389c8704e539e234cfd67c7e034a514f50bf
Then rebuilt your kernel and try to reproduce the problem.
If the bug went away we know the cause...

If the bug is still there you can try to git bisect the problem:
First reset git:
git reset --hard v2.6.33-rc8
Then start bisecting by:
git bisect start
git bisect bad      (Current version is bad)
git bisect good 2.6.33-rc6

rebuilt the kernel and start testing.
If the build is bug free, run: git bisect good
If the kernel still has the bug, run: git bisect bad 
Then repeat the above until you have pinpointed the problem.

See man git-bisect for more info.
Comment 12 Michał Witkowski 2010-02-13 09:41:32 UTC
I did the following:
1. I've reverted to 2.6.33rc6 with bug 15186 fix applied. It crashed 3 minutes into playing Hex Empire.

2. I've reverted to 2.6.33rc6 vanilla (no patches). I've successfully run 3-4 full games of Hex Empire (10-15 minutes of play each) and used compositing for 2hours without a single crash. What's curious is that I don't see the artifacts the bug 15186 was there to solve. The ugly semi-transparent yellowish strips instead of KWin shadows is gone. I can only see little artifacts on the KDE panel clock when doing the "x11perf -create" test. It seems that something in mesa/libdrm/glproto/xf86-video-ati fixed most of the corruption between 20100119 and 20100212.

Of course this doesn't mean that the bug 15186 fix is responsible for the crashes. I'll run this (2.6.33rc6 vanilla) kernel for the weekend. If I see any crashes I'll report.

Also, I've tried building 2.6.33-rc7 with reversing the patch provided by Jerome in bug 15186. Unfortunately, I get patching rejects. Anyone could point me to the patchfile from git which would revert cleanly in 2.6.33-rc7?
Comment 13 Michał Witkowski 2010-02-13 10:41:33 UTC
Ok, nevermind what I said above.

I've just had crash with vanilla 2.6.33-rc6 playing Hex Empire. It seems that the bug 15186 isn't the source of the crash, it just causes it to happen more often.
Comment 14 Panagiotis Papadakos 2010-02-13 13:43:07 UTC
It seems that I also have the same bug, as in http://bugzilla.kernel.org/show_bug.cgi?id=15276#c7
In my case it is a hard lockup, with keyboard leds blinking.
But in my case it seem that it happens only with compositing
on in KDE SC 4.4. It seems it is fairly easy to reproduce
when hovering over chromium and firefox URLs. Maybe it is related
to the shadow effect? I will disable it and report back if I get it
again.

KMS: on
kernel: 2.6.33-rc8
radeon: todays git
mesa: todays git
drm: todays git
compositing: on
Comment 15 Michał Witkowski 2010-02-13 14:44:12 UTC
(In reply to comment #14)
> But in my case it seem that it happens only with compositing
> on in KDE SC 4.4. It seems it is fairly easy to reproduce
> when hovering over chromium and firefox URLs. 

Well, for me it happened two/three times without compositing while playing flash-based Hex Empire. The numerous other times it froze was with compositing.
I only once had a crash when hovering links in firefox. I just played around in chromium and nothing happened. Could you provide an example page/exact link?
Comment 16 Panagiotis Papadakos 2010-02-13 15:44:28 UTC
Well, it seems that the shadow lockup I reported above is a different one, than what you have reported (I just play with many URLS in chromium for example and
after some minutes it locks).

I was able though to reproduce your problem, by simply playing some flv videos with mplayer, with composite (and no shadows) and without compositing. In this lockup there are no blinking keyboard leds. 

This is on a X2300 card (which should not be affected by http://bugzilla.kernel.org/show_bug.cgi?id=15186 .
Comment 17 Michał Witkowski 2010-02-13 17:18:09 UTC
(In reply to comment #16)
> I was able though to reproduce your problem, by simply playing some flv
> videos
> with mplayer, with composite (and no shadows) and without compositing. In
> this
> lockup there are no blinking keyboard leds. 

Could you please post links to the flv files which caused your problems?
Also which mplayer renderer are you using?

I wasn't able to generate these lockups even when playing big 1080p files.

I really hope we can find a 100% reproducible example of the lockups.
Comment 18 Panagiotis Papadakos 2010-02-13 17:27:21 UTC
I don't think it is related to some specific flv, since I can't reproduce it here with some specific one. It seems that I have to play a number of flvs and then it randomly lockups (or visit a number of sites that use flash).

For the mplayer I was using the xv renderer.
Comment 19 Michał Witkowski 2010-02-14 08:39:07 UTC
Alright, a fellow Arch user was able to provide a test case which causes the crash to appear:
http://bbs.archlinux.org/viewtopic.php?pid=708448#p708448

Basically, you need radeon KMS, KDE SC 4.4, 4x konsole open (with transparency) in composite in KWin effects (OpenGL, thumbnails only for shown windows, texture from pixmap, bilinear, direct rendering, use vsync, smooth scaling).

Steps to reproduce:
1. Open 4 konsole windows (preferably each on a separate virtual desktop) with transparency enabled
2. Open Present Windows
3. Move the mouse pointer in a circle (I did this anti-clockwise) over each window, so that they take turns being highlighted. Try complete five or six circles. If you suffer from the same kind of bug I do, your system is expected to crash here.
4. If it didn't crash, quit Present windows and go to 2.

korpenkraxar (the method reporter) managed to crash his system like this five times. I've tried it and 3 times out of 3 I managed to crash my system.

Could others please try it, so that we can be sure it's a 100% test case?

Devs, what would be the next steps? How to garther logs/anything?
Comment 20 Panagiotis Papadakos 2010-02-14 17:46:41 UTC
Well it seems that I can't reproduce it with the above steps (I also can't get transparencies to the konsoles, although I enable them in the profiles). Again I got a lockup (without compositing), by watching a flash video in the browser, with the keyboard leds blinking. This means a kernel Ooops if I am not wrong, but unfortunately I could not trace it in the messages or any other of my system's logs.

P.S.
1) Maybe the lockups are irrelevant to the radeon driver? syslog shows some weird messages like the one below:

TCP: Peer 77.49.168.178:26010/57630 unexpectedly shrunk window 2809809291:2809816491 (repaired),

which happens for example when watching a flash video from the web. Do you have such messages? Are your systems online during the lockup?

2) My next steps is to try and get the lockup with the laptop offline.
Comment 21 Andreas Wallberg 2010-02-14 23:37:45 UTC
@Panagiotis Papadakos: the KMS + compositing (+pointer interaction?) hard lock produced by the "KDE4" test case has, as far as I have noticed, never resulted in blinking keyboard leds on my Thinkpad. Maybe the flash video crash is something different?
Comment 22 Andreas Wallberg 2010-02-15 15:53:46 UTC
The "KDE4" lock-up occurs in rc8 + today's git as well.
Comment 23 Andreas Wallberg 2010-02-15 19:22:51 UTC
I had my first "flash" crash today, playing back a news video in flash in full-screen video over at http://www.svt.se . They have updated that particular video during the day so I can not point to an exact address. When this crash happened the sound continued stuttering/repeating what was in some small buffer over and over again. No blinking leds.
Comment 24 Michał Witkowski 2010-02-15 19:47:54 UTC
(In reply to comment #23)
> When this crash
> happened the sound continued stuttering/repeating what was in some small
> buffer
> over and over again. No blinking leds.

That's exactly what's happening here. Audio from Amarok is playing from a small buffer repeatedly, no blinking leds.
Comment 25 Shawn Starr 2010-02-15 20:33:12 UTC
GPU: RV635

I have been able to reproduce this using steps from Comment #19 every time, sometimes I have to switch away from Present Windows and back, but it will lockup GPU. I get no kernel panic, no flashing keyboard LEDs on laptop.

if audio is playing it repeat a buffer of some part of the audio over and over.
Comment 26 Jérôme Glisse 2010-02-16 14:24:36 UTC
I can't reproduce this (4 differents r6xx/r7xx tested so far), please attach your kernel configuration. I have been trying with KDE 4.4 which is in fedora 13.
Comment 27 Jérôme Glisse 2010-02-16 14:26:28 UTC
Also please report which version of ddx & xorg server you are using.
Comment 28 Panagiotis Papadakos 2010-02-16 16:36:56 UTC
Well, I got again the lockup, during some normal browsing (no flash this time,
and no blinking leds). This is with 2.6.33-rc8-git, todays git tree for ddx and xorg server 1.7.99.2. Well, since Michal says that he gets the lockup with rc6, I am going to test rc5.
Comment 29 Michał Witkowski 2010-02-16 17:35:54 UTC
Created attachment 25065 [details]
Kernel 2.6.33-rc7 configuration

As for other versions:


kernel26-rc 2.6.33rc7-2

xorg-apps 7.5-3
xorg-server 1.7.4.901-1
xorg-server-utils 7.5-3
xorg-utils 7.5-1
xorg-xinit 1.2.0-1
xorg-xkb-utils 7.5-2

xf86-video-ati-git 20100212-1
xf86-video-intel-git 20100212-1
glproto-git 20100212-1
libdrm-git 20100212-1
mesa-full 20100212-1 (AUR git package of mesa)

System is x86_64 with Radeon 3450. I've also had the crashes with packages from 20100119.
Comment 30 Jérôme Glisse 2010-02-16 22:21:31 UTC
Are you all using laptop ? if so please give the brand and model.
If you are using desktop computer please give motherboard brand model, cpu model and how much memory you have thanks.
Comment 31 Markus Trippelsdorf 2010-02-17 05:13:54 UTC
In my case it's a normal desktop computer:
Asus M4A78T-E motherboard, AMD Phenom II X4 955 Cpu, 4Gb Memory (non-ECC).

These GPU hangs happen 1-3 times per day here, mostly while running flash-
based browser applications (google-earth, videos, etc.).

[drm] Initialized drm 1.1.0 20060810
[drm] radeon defaulting to kernel modesetting.
[drm] radeon kernel modesetting enabled.
radeon 0000:01:05.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
radeon 0000:01:05.0: setting latency timer to 64
[drm] radeon: Initializing kernel modesetting.
[drm] register mmio base: 0xFBEE0000
[drm] register mmio size: 65536
ATOM BIOS: 113
[drm] Clocks initialized !
[drm] Detected VRAM RAM=192M, BAR=256M
[drm] RAM width 32bits DDR
[TTM] Zone  kernel: Available graphics memory: 1993064 kiB.
[drm] radeon: 192M of VRAM memory ready
[drm] radeon: 512M of GTT memory ready.
[drm] radeon: irq initialized.
[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] Loading RS780 Microcode
platform radeon_cp.0: firmware: using built-in firmware radeon/RS780_pfp.bin
platform radeon_cp.0: firmware: using built-in firmware radeon/RS780_me.bin
platform radeon_cp.0: firmware: using built-in firmware radeon/R600_rlc.bin
[drm] ring test succeeded in 1 usecs
[drm] radeon: ib pool ready.
[drm] ib test succeeded in 0 usecs
[drm] Enabling audio support
[drm] Radeon Display Connectors
[drm] Connector 0:
[drm]   VGA
[drm]   DDC: 0x7e40 0x7e40 0x7e44 0x7e44 0x7e48 0x7e48 0x7e4c 0x7e4c
[drm]   Encoders:
[drm]     CRT1: INTERNAL_KLDSCP_DAC1
[drm] Connector 1:
[drm]   DVI-D
[drm]   HPD3
[drm]   DDC: 0x7e50 0x7e50 0x7e54 0x7e54 0x7e58 0x7e58 0x7e5c 0x7e5c
[drm]   Encoders:
[drm]     DFP3: INTERNAL_KLDSCP_LVTMA
[drm] fb mappable at 0xD0141000
[drm] vram apper at 0xD0000000
[drm] size 7257600
[drm] fb depth is 24
[drm]    pitch is 6912
Comment 32 Panagiotis Papadakos 2010-02-17 10:04:23 UTC
In my case it is a Sony CR-21 laptop, with 2GB memory and an ATI X2300 card with
64M VRAM, running 64bit Ubuntu lucid.
Comment 33 Andreas Wallberg 2010-02-17 10:30:56 UTC
Lenovo Thinkpad W500
ATI FireGL 5700, RV635
4GB RAM

dmesg | grep "drm":

[drm] Initialized drm 1.1.0 20060810  
[drm] radeon kernel modesetting enabled.
[drm] radeon: Initializing kernel modesetting.
[drm] register mmio base: 0xCFFF0000          
[drm] register mmio size: 65536               
[drm] Clocks initialized !                    
[drm] Detected VRAM RAM=256M, BAR=256M        
[drm] RAM width 128bits DDR                   
[drm] radeon: 256M of VRAM memory ready       
[drm] radeon: 512M of GTT memory ready.       
[drm] radeon: using MSI.                      
[drm] radeon: irq initialized.                
[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] Loading RV635 Microcode                         
[drm] ring test succeeded in 1 usecs                  
[drm] radeon: ib pool ready.                          
[drm] ib test succeeded in 0 usecs                    
[drm] Enabling audio support                          
[drm] Radeon Display Connectors                       
[drm] Connector 0:                                    
[drm]   DVI-I                                         
[drm]   HPD3                                          
[drm]   DDC: 0x7e60 0x7e60 0x7e64 0x7e64 0x7e68 0x7e68 0x7e6c 0x7e6c
[drm]   Encoders:                                                   
[drm]     DFP1: INTERNAL_UNIPHY                                     
[drm] Connector 1:                                                  
[drm]   LVDS                                                        
[drm]   DDC: 0x7e40 0x7e40 0x7e44 0x7e44 0x7e48 0x7e48 0x7e4c 0x7e4c
[drm]   Encoders:                                                   
[drm]     LCD1: INTERNAL_KLDSCP_LVTMA                               
[drm] Connector 2:                                                  
[drm]   DisplayPort                                                 
[drm]   HPD1                                                        
[drm]   DDC: 0x7e20 0x7e20 0x7e24 0x7e24 0x7e28 0x7e28 0x7e2c 0x7e2c
[drm]   Encoders:                                                   
[drm]     DFP2: INTERNAL_UNIPHY                                     
[drm] Connector 3:                                                  
[drm]   VGA                                                         
[drm]   DDC: 0x7e50 0x7e50 0x7e54 0x7e54 0x7e58 0x7e58 0x7e5c 0x7e5c
[drm]   Encoders:                                                   
[drm]     CRT1: INTERNAL_KLDSCP_DAC1                                
[drm] fb mappable at 0xD0141000                                     
[drm] vram apper at 0xD0000000                                      
[drm] size 9216000                                                  
[drm] fb depth is 24                                                
[drm]    pitch is 7680

I use a DisplayPort->DVI adapter to hook my laptop to an external monitor and keep having these errors reported in dmesg, but I guess they are unrelated to lock-ups:

[drm:atom_dp_get_link_status] *ERROR* displayport link status failed
[drm:dp_link_train] *ERROR* clock recovery failed
Comment 34 Andreas Wallberg 2010-02-18 23:08:34 UTC
I have been experimenting with various kernel and userland versions this evening, but without much progress I must add. I can trigger the crash with:

Kernel:

kernel26-git: 2.6.33rc8 as of 20100227

Xorg: (this is what we have in Archlinux atm and the same versions as Michał posted before):

xorg-apps 7.5-3
xorg-server 1.7.4.901-1
xorg-server-utils 7.5-3
xorg-utils 7.5-1
xorg-xinit 1.2.0-1
xorg-xkb-utils 7.5-2

DRM and drivers:

xf86-video-ati-git 20100217
glproto-git 20100217
driproto-git 20100217
libdrm-git 20100217
mesa-full 20100217

mesa-full is an AUR git package of mesa which provides:
libgl mesa freeglut glut ati-dri intel-dri mach64-dri mga-dri r128-dri savage-dri tdfx-dri unichrome-dri

Swapping the kernel for an older 2.6.33rc5 version built on 20100126:
... crash!

Swapping other git packages to older versions:

xf86-video-ati-git 20100118
glproto-git 20100206
driproto-git 20100116
libdrm-git 20100118
mesa-full 20100118

and trying different kernels:

kernel26-git 20100227: ... crash!
kernel26-git 20100126: ... crash!

I do not remember having these crashes back in the second half of January so I wonder if there is something else going on here as well.

Do you all use KDE SC 4.4 and/or Kwin? Do you have these problems with other window managers as well?

Could it be that the recent builds of KDE have exposed an old bug that was there all along?
Comment 35 Panagiotis Papadakos 2010-02-19 14:36:30 UTC
Aa user has posted a backtrace using 2.6.32 in the phoronix post about this problem.

See http://www.phoronix.com/forums/showpost.php?p=113530&postcount=9
Comment 36 Michał Witkowski 2010-02-21 17:40:16 UTC
(In reply to comment #30)
> Are you all using laptop ? if so please give the brand and model.
> If you are using desktop computer please give motherboard brand model, cpu
> model and how much memory you have thanks.

Yes, I;m using a Lenovo Ideapad U330 laptop. It has 4GB DDR3 memory, Intel P8400 and Radeon 3450.

Jerome, do you need any more info? How can we help?
Comment 37 Michał Witkowski 2010-02-22 11:57:12 UTC
I've just had a similar crash to the one described above in Inkscape drawing some Bezier curves with KWin's compositing turned _off_. This crashing is a real problem in terms of reliability if you want to use the open-source KMS drivers for real work.
Comment 38 Jérôme Glisse 2010-02-22 12:20:12 UTC
I don't know how to guide someone to debug such issue. I don't think i will have anything until i am able to reproduce it on one of my configuration.
Comment 39 Jérôme Glisse 2010-02-22 12:22:26 UTC
How best thing you can do is try to find some automatic test case. For instance using gtkperf or any other program that can run automaticly and will trigger the lockup.
Comment 40 Michał Witkowski 2010-02-23 22:49:18 UTC
Both x11perf and gtkperf work flawlessly here :(
Any other tests I could try?
Comment 42 Andreas Wallberg 2010-02-24 00:15:42 UTC
patch -Np1 -i kms_fix.patch
patching file drivers/gpu/drm/ttm/ttm_tt.c
Hunk #1 FAILED at 467.
Hunk #2 FAILED at 486.
Hunk #3 FAILED at 510.
Hunk #4 FAILED at 522.
Hunk #5 FAILED at 544.
Hunk #6 FAILED at 556.
patch unexpectedly ends in middle of line
Hunk #7 FAILED at 582.
7 out of 7 hunks FAILED -- saving rejects to file drivers/gpu/drm/ttm/ttm_tt.c.rej

Could someone please provide the correct patch command for this one?
Comment 43 Pauli 2010-02-24 03:30:12 UTC
> --- Comment #42 from Andreas Wallberg <andreas.wallberg@gmail.com> 
> 2010-02-24 00:15:42 ---
> patch -Np1 -i kms_fix.patch
> patching file drivers/gpu/drm/ttm/ttm_tt.c
> Hunk #1 FAILED at 467.
> Hunk #2 FAILED at 486.
> Hunk #3 FAILED at 510.
> Hunk #4 FAILED at 522.
> Hunk #5 FAILED at 544.
> Hunk #6 FAILED at 556.
> patch unexpectedly ends in middle of line
> Hunk #7 FAILED at 582.
> 7 out of 7 hunks FAILED -- saving rejects to file
> drivers/gpu/drm/ttm/ttm_tt.c.rej
>
> Could someone please provide the correct patch command for this one?

git apply is more robust. But you will probably need newer kernel source.
Comment 44 Michał Witkowski 2010-02-24 17:42:08 UTC
Well, first of all I haven't touched the ttm_tt patch yet.

However, after upgrading userland mesa, libdrm, glproto and  xf86-video-ati to latest git from 20100217 to 20100222 I stopped having the crashes using the KWin Present Windows test case. Kernel is still 2.6.33-rc7. 

To be clear: with version 20100222 and kernel 2.6.33-rc7 I can't reproduce the crashes using the Present Windows test case. The test case caused crashes with version 20100217 with the same kernel.
Comment 45 Michał Witkowski 2010-02-24 18:35:03 UTC
(In reply to comment #44)
> Well, first of all I haven't touched the ttm_tt patch yet.
> 
> However, after upgrading userland mesa, libdrm, glproto and  xf86-video-ati
> to
> latest git from 20100217 to 20100222 I stopped having the crashes using the
> KWin Present Windows test case. Kernel is still 2.6.33-rc7. 
> 
> To be clear: with version 20100222 and kernel 2.6.33-rc7 I can't reproduce
> the
> crashes using the Present Windows test case. The test case caused crashes
> with
> version 20100217 with the same kernel.

It seems that it only stopped crashing in Present Windows. I've just had two crashes: in Inkscape and in a flash game.
Comment 46 Andreas Wallberg 2010-02-24 23:44:04 UTC
Well, having tried git apply on 2.6.33 stable all the way back to rc5 with this result:
 
git apply ttm_fix.patch
error: patch failed: drivers/gpu/drm/ttm/ttm_tt.c:467
error: drivers/gpu/drm/ttm/ttm_tt.c: patch does not apply

I have to admit I have no idea which version of ttm_tt.c this patch is meant for.
Comment 47 Andreas Wallberg 2010-02-26 10:39:08 UTC
(In reply to comment #44)
> Well, first of all I haven't touched the ttm_tt patch yet.
> 
> However, after upgrading userland mesa, libdrm, glproto and  xf86-video-ati
> to
> latest git from 20100217 to 20100222 I stopped having the crashes using the
> KWin Present Windows test case. Kernel is still 2.6.33-rc7. 
> 
> To be clear: with version 20100222 and kernel 2.6.33-rc7 I can't reproduce
> the
> crashes using the Present Windows test case. The test case caused crashes
> with
> version 20100217 with the same kernel.

Interesting, my system is just as crash-prone as before, running 2.6.33 stable and a graphics stack from 20100224.
Comment 48 Michał Witkowski 2010-03-06 11:00:41 UTC
I'm running:
kernel 2.6.33 final (Arch testing stock version)
mesa GIT 20100306
libdrm GIT 20100306
glproto GIT 20100306
xf86-video-ati GIT 20100306

I can now again reproduce the crashes from #19, however this time the screen doesn't go blank and there's not "no-signal". Now it's more of a freeze. I can see the poriginal image of the "Present Windows" frozen, but I can't do anything except moving mouse (I can't click).
Comment 49 Andreas Wallberg 2010-03-06 21:37:00 UTC
For me, the bug never went away, although I noticed that the test case may be taking a little longer to trigger recently.

As I wrote above, I am using a dual screen arrangement. When the crash happens, the external screen always goes black. This mostly happens with the internal laptop LCD too, but every now and then it may instead, over a couple of seconds, go from whatever was displayed on the screen into being completely white. At one occasion I also got thin green stripes on the white screen.
Comment 50 Michał Witkowski 2010-03-07 08:06:31 UTC
Oh yeah, I forgot to mention I've also got a laptop with an external monitor attached through HDMI.
Comment 51 Michał Witkowski 2010-03-07 09:44:47 UTC
At last some log info!

Tested using:
kernel 2.6.33 final (Arch testing stock version)
mesa GIT 20100306
libdrm GIT 20100306
glproto GIT 20100306
xf86-video-ati GIT 20100306

Crash occured during the #19 test case. This time the picture was off, and my external monitor (connected via HDMI) said "Mode not supported, recommended 1980x1200 60Hz". After that I needed to hard-off the machine.
I've checked the logs, /var/log/messages.log states:
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0: GPU softreset 
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003030
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000402
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003030
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar  7 10:37:03 hermes kernel: radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0: GPU softreset 
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003030
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000402
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003030
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar  7 10:37:06 hermes kernel: radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0: GPU softreset 
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003030
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000402
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003030
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar  7 10:37:08 hermes kernel: radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  7 10:37:11 hermes kernel: radeon 0000:01:00.0: GPU softreset 
Mar  7 10:37:11 hermes kernel: radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003030
Comment 52 Jérôme Glisse 2010-03-10 15:41:55 UTC
I pushed a tree with a rework of GPU reset and GPU lockup detection, some user indicate that this work better than the current code, please give it a try :

Full tree:
git://git.kernel.org/pub/scm/linux/kernel/git/glisse/drm-radeon.git
branch drm-radeon-next

Or apply this 3 patch:
http://patchwork.kernel.org/patch/84310/
http://patchwork.kernel.org/patch/84312/
http://patchwork.kernel.org/patch/84311/

Thanks for testing
Comment 53 Michał Witkowski 2010-03-11 09:44:37 UTC
Will these patches apply to 2.6.33 or 2.6.34-rc1?

Cheers,
Michal
Comment 54 Jérôme Glisse 2010-03-11 15:35:21 UTC
They should, haven't tested them against those
Comment 55 Michał Witkowski 2010-03-12 19:34:40 UTC
Created attachment 25492 [details]
message.log after applying the patch

I tried 2.6.34-rc1, KDE 4.4.1, mesa/libdrm/xf86-video-ati from 20100306. Tested with the Present Windows testcase.

Observations:
After a few seconds of moving the mouse around the presented windows, the screen goes out for a second but later comes back. Please note: there is _no more_ permanent freeze after this. Each "mini freeze" is noted in messages.log (given in attachment).

The above patches are a huge improvement :) They don't eliminate the bug, yet, but make the freezes not permament :) Awesome work Jerome!
Comment 56 Andreas Wallberg 2010-03-12 20:56:28 UTC
(In reply to comment #55)
> Created an attachment (id=25492) [details]
> message.log after applying the patch
> 
> I tried 2.6.34-rc1, KDE 4.4.1, mesa/libdrm/xf86-video-ati from 20100306.
> Tested
> with the Present Windows testcase.
> 
> Observations:
> After a few seconds of moving the mouse around the presented windows, the
> screen goes out for a second but later comes back. Please note: there is _no
> more_ permanent freeze after this. Each "mini freeze" is noted in
> messages.log
> (given in attachment).
> 
> The above patches are a huge improvement :) They don't eliminate the bug,
> yet,
> but make the freezes not permament :) Awesome work Jerome!

Wow, that sounds absolutely fantastic. Just for the record, could you please provide the commands to get this patched easily. We're both using Arch. Did you hack the PKGBUILD? Yes, I am git n00b, and lazy.
Comment 57 Andreas Wallberg 2010-03-13 00:59:46 UTC
Created attachment 25493 [details]
dmesg output regarding radeon_fence.c

This appears in log as video output goes black for about a second using patches applied to Linus' git tree as of 2009-03-13.
Comment 58 Michał Witkowski 2010-03-21 15:02:26 UTC
Jerome, will the fix be merged into the mainline?

Cheers,
Michal
Comment 59 Rafael J. Wysocki 2010-03-21 19:45:21 UTC
Handled-By : Jérôme Glisse <glisse@freedesktop.org>
Comment 60 Andreas Wallberg 2010-03-27 22:51:48 UTC
Created attachment 25736 [details]
dmesg output for two screen black-outs

I just updated the kernel/mesa/radeon-stack to current git versions and had the screen black out on me two times within a minute or so just after logging into KDE. The first one happened while scrolling in Firefox, the other happened when starting a Konsole session as the window appeared with the "Scale In" effect. Both times the graphics recovered. An output from dmesg is attached.
Comment 61 Andreas Wallberg 2010-03-27 23:20:52 UTC
Created attachment 25737 [details]
Rendering artifacts in Firefox when compositing is enabled in KDE/Kwin

Scrolling in Firefox produces rendering artifacts with duplicated chunks of text and graphics. This seems to be associated with more kernel messages:

[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation !
[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation !
CE: hpet increasing min_delta_ns to 15000 nsec
[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation !
[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation !

A screenshot is attached.
Comment 62 Jérôme Glisse 2010-03-29 09:39:12 UTC
Andreas please avoid mixing issue, last issue you report is userspace bug. See:
https://bugs.freedesktop.org/show_bug.cgi?id=27284

On the lockup side patch should hopefully soon hit Linus' tree. Note those patch don't fix the root issue of the lockup which is hopefully incorrect command stream but they are first to fixing the userspace issue.
Comment 63 Andreas Wallberg 2010-03-29 13:17:37 UTC
(In reply to comment #62)
> Andreas please avoid mixing issue, last issue you report is userspace bug.
> See:
> https://bugs.freedesktop.org/show_bug.cgi?id=27284
> 
> On the lockup side patch should hopefully soon hit Linus' tree. Note those
> patch don't fix the root issue of the lockup which is hopefully incorrect
> command stream but they are first to fixing the userspace issue.

Yeah, sorry. I posted too soon. I noticed this too when I started downgrading packages.
Comment 64 Michał Witkowski 2010-04-13 10:19:32 UTC
Hi,

What's the state of the lockup detection in 2.6.34? Any chance of getting this in? I successfully used your patches on 2.6.34-rc2, but I fail to apply them on 2.6.34-rc4. Are they already there? If not, could you please provide patches against 2.6.34-rc4?

Cheers,
Michal
Comment 65 Jérôme Glisse 2010-04-13 15:41:53 UTC
So it won't get in 2.6.34 it's considered too big and lead to issue for other GPU so it's scheduled for 2.6.35 inclusion. Not much i can do in the meantime, i am trying to fix issue for other GPU sorry.
Comment 66 Florian Mickler 2010-12-07 20:30:46 UTC
commit 225758d8ba4fdcc1e8c9cf617fd89529bd4a9596
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Tue Mar 9 14:45:10 2010 +0000

    drm/radeon/kms: fence cleanup + more reliable GPU lockup detection V4