Bug 44121 - Reproducible GPU lockup CP stall on Radeon HD 6450
Summary: Reproducible GPU lockup CP stall on Radeon HD 6450
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-01 19:54 UTC by Jean Delvare
Modified: 2012-07-06 13:13 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.5-rc5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
properly disable render backend (1.17 KB, patch)
2012-07-03 15:42 UTC, Jérôme Glisse
Details | Diff
properly disable render backend (5.89 KB, patch)
2012-07-03 17:09 UTC, Jérôme Glisse
Details | Diff
possible fix (5.90 KB, patch)
2012-07-03 17:31 UTC, Alex Deucher
Details | Diff
possible fix (6.31 KB, patch)
2012-07-03 19:03 UTC, Alex Deucher
Details | Diff

Description Jean Delvare 2012-07-01 19:54:00 UTC
With kernels 3.5-rc3 to 3.5-rc5, I hit a GPU lockup CP stall issue whenever I do some actions in Firefox: if I need to authenticate to access a given site, or when the download target selection window pops up. I'm running Gnome 3.2 on openSUSE 12.1.

When this happens, the whole Gnome interface freezes, with gnome-shell stuck at 100% CPU. In the kernel logs I see the following:

radeon 0000:08:00.0: GPU lockup CP stall for more than 10000msec
radeon 0000:08:00.0: GPU lockup (waiting for 0x00000000000113f3 last fence id 0x00000000000113f0)
radeon 0000:08:00.0: GPU softreset 
radeon 0000:08:00.0:   GRBM_STATUS=0xE55008A0
radeon 0000:08:00.0:   GRBM_STATUS_SE0=0xEC000001
radeon 0000:08:00.0:   GRBM_STATUS_SE1=0x00000007
radeon 0000:08:00.0:   SRBM_STATUS=0x200000C0
radeon 0000:08:00.0:   GRBM_SOFT_RESET=0x00007F6B
radeon 0000:08:00.0:   GRBM_STATUS=0x00003828
radeon 0000:08:00.0:   GRBM_STATUS_SE0=0x00000007
radeon 0000:08:00.0:   GRBM_STATUS_SE1=0x00000007
radeon 0000:08:00.0:   SRBM_STATUS=0x200000C0
radeon 0000:08:00.0: GPU reset succeed
[drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
radeon 0000:08:00.0: WB enabled
radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff88013557bc00
[drm] ring test on 0 succeeded in 0 usecs
[drm] ib test on ring 0 succeeded in 0 usecs

I have more samples if needed.

No problem when doing the same with kernel 3.4.4.

I ran "git bisect" and found that reverting the following commit fixes the problem:

commit 416a2bd274566a6f607a271f524b2dc0b84d9106
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Thu May 31 19:00:25 2012 -0400

    drm/radeon: fixup tiling group size and backendmap on r6xx-r9xx (v4)

    Tiling group size is always 256bits on r6xx/r7xx/r8xx/9xx. Also fix and
    simplify render backend map. This now properly sets up the backend map
    on r6xx-9xx which should improve 3D performance.

    Vadim benchmarked also:
    Some benchmarks on juniper (5750), fullscreen 1920x1080,
    first result - kernel 3.4.0+ (fb21affa), second - with these patches:

    Lightsmark:   91 fps => 123 fps    +35%
    Doom3:        74 fps => 101 fps    +36%

    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Jerome Glisse <jglisse@redhat.com>
    Signed-off-by: Dave Airlie <airlied@redhat.com>

Let me know if you need more debugging information, I'll do whatever I can to help.
Comment 1 Alex Deucher 2012-07-03 14:53:04 UTC
Can you dump the following registers using radeonreg or avivotool (http://cgit.freedesktop.org/~airlied/radeontool/) with the patch applied and reverted and attach both results?

CC_RB_BACKEND_DISABLE (0x98F4)
CC_SYS_RB_BACKEND_DISABLE (0x3F88)
GC_USER_RB_BACKEND_DISABLE (0x9B7C)
CC_GC_SHADER_PIPE_CONFIG (0x8950)
GB_BACKEND_MAP (0x98FC)

(as root):
radeonreg regmatch 0x98F4
etc.
Comment 2 Jean Delvare 2012-07-03 15:27:00 UTC
With 3.5-rc5 kernel (failing) :

0x98F4	0x00000001 (1)
0x3F88	0x00000001 (1)
0x9B7C	0x00000000 (0)
0x8950	0xfffcf001 (-200703)
0x98FC	0x00000000 (0)

With commit 416a2bd2 reverted (working) :

0x98F4	0x00000001 (1)
0x3F88	0x00000001 (1)
0x9B7C	0x00fe0000 (16646144)
0x8950	0xfffcf001 (-200703)
0x98FC	0x00000000 (0)

So, value of register GC_USER_RB_BACKEND_DISABLE (0x9B7C) differs.
Comment 3 Jérôme Glisse 2012-07-03 15:42:09 UTC
Created attachment 74671 [details]
 properly disable render backend

Does this patch fix it ?
Comment 4 Jean Delvare 2012-07-03 16:21:27 UTC
I tested the patch in comment #3 but unfortunately it doesn't solve the problem.
Comment 5 Jean Delvare 2012-07-03 16:39:19 UTC
With this patch applied, I get:

0x98F4	0x00000001 (1)
0x3F88	0x00000001 (1)
0x9B7C	0x00fe0000 (16646144)
0x8950	0xfffcf001 (-200703)
0x98FC	0x00000000 (0)
0x8954	0x00000000 (0)

So the value of register 0x9B7C is correct now, but this was not sufficient.
Comment 6 Jérôme Glisse 2012-07-03 17:09:41 UTC
Created attachment 74701 [details]
properly disable render backend

This one ?
Comment 7 Alex Deucher 2012-07-03 17:31:40 UTC
Created attachment 74711 [details]
possible fix

or this variant.  Although AFAIK, programming the USER register variants shouldn't be necessary as the default values (0) are valid.
Comment 8 Alex Deucher 2012-07-03 18:00:24 UTC
Does booting up a clean kernel without any patches applied or reverted work if you manually set the following registers to their "patch reverted" values using radeonreg? Just to be sure, write all of them even if the values are the same.  Do this without X running.

0x98F4
0x3F88
0x9B7C
0x8950
0x98FC
0x8954

e.g.,
radeonreg regset 0x8950 0xfffcf001
Comment 9 Jean Delvare 2012-07-03 18:08:29 UTC
Patch from comment #6 doesn't work, testing patch from comment #7 now.
Comment 10 Jean Delvare 2012-07-03 18:56:15 UTC
Patch from comment #7 did not work either. Then I followed the instructions from comment #8, but it also did not help.
Comment 11 Alex Deucher 2012-07-03 19:03:52 UTC
Created attachment 74771 [details]
possible fix

Another possible fix, but I don't think it will help as it touches things never previously touched.  I don't think the issue is the USER registers, but it's worth a shot I suppose.
Comment 12 Jean Delvare 2012-07-03 21:40:52 UTC
Patch from comment #11 didn't work at all, not only it didn't fix the original issue but it even caused additional trouble (gdm wouldn't even show up.)
Comment 13 Jean Delvare 2012-07-05 07:30:03 UTC
Reproducibility information:

* I cannot reproduce the GPU lockup on a Radeon HD 4350 card.

* On the Radeon HD 6450, I can reproduce the GPU lockup with applications other than Firefox. I was able to do so with Claws Mail for example. The parent window has to be maximized for it to happen. Then, as soon as a title-less dialog box is opened (for example by pressing Ctrl+S for "Save As..."), the GPU lockup happens.
Comment 14 Jean Delvare 2012-07-06 12:09:24 UTC
I managed to fix the problem with a user-space stack update. I updated:

* libdrm from version 2.4.26 to 2.4.33
* Mesa from version 7.11 to 8.0.3
* from xorg-x11-libX11 version 7.6 to libX11 version 1.5.0

and I no longer see the GPU lockup. So I guess I can close this bug as invalid, if the actual bug was in user-space.
Comment 15 Alex Deucher 2012-07-06 13:13:17 UTC
That's the problem with GPU drivers.  It's impossible to test every combination of userspace and kernel drivers and there can be very subtle bugs with certain combinations like this one that are almost impossible to track down.

Note You need to log in before you can comment on or make changes to this bug.