Bug 52491

Summary: radeon massive screen corruption BARTS HD6870
Product: Drivers Reporter: Bruno Jacquet (maxijac)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: high CC: alexdeucher, florian, glisse
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.8-rc Subsystem:
Regression: Yes Bisected commit-id:
Attachments: return to desktop after bad rendering
When the rendering is bad inside the game
kernel crash I got from older commits while bisecting
Exclude system placement

Description Bruno Jacquet 2013-01-08 18:42:53 UTC
Created attachment 90801 [details]
return to desktop after bad rendering

Hi,

I'm experiencing a major rendering corruption with linux 3.8 and my HD 6870.

software:
latest kernel from linus as of 2013-08-01
latest mesa git as of 2013-08-01
latest llvm from tstellar git as of 2013-08-01
latest DDX from git as of 2013-08-01
libdrm 2.4.40

Symptoms:

I triggered this several times running Heroes of Newerth. When a match starts, sometimes The textures are all black, or sometimes my cursor is missing.
(It looks like my LLVM-enabled for the glsl compiler builds of mesa trigger the black textures more often)

When this happens, quitting the game and returning to my desktop, everything is garbled, things do not refresh correctly. See screenshot.

keeping the same userland and just downgrading to linux 3.7 solves everything.

Nothing gets added to dmesg...

I don't have much time for bisecting this, I'll try asap but it won't be before some days, so if someone has similar hardware, please try to reproduce it. HoN is free to play and natively runs on linux. (http://www.heroesofnewerth.com)
Comment 1 Bruno Jacquet 2013-01-08 18:45:10 UTC
Created attachment 90811 [details]
When the rendering is bad inside the game

right before I quit the game and all the bad stuff happens.
Comment 2 Bruno Jacquet 2013-01-10 23:00:51 UTC
Still happening in rc3
Comment 3 Bruno Jacquet 2013-01-12 18:27:13 UTC
I bisected it. THough it looks like the behavior changed halfway, I was getting kernel crashes for this commit for example.

d2ead3eaf8a4bf92129eda69189ce18a6c1cc8bd is the first bad commit                                                    
commit d2ead3eaf8a4bf92129eda69189ce18a6c1cc8bd                                                                     
Author: Alex Deucher <alexander.deucher@amd.com>                                                                    
Date:   Thu Dec 13 09:55:45 2012 -0500                                                                              
                                                                                                                    
    drm/radeon/kms: add evergreen/cayman CS parser for async DMA (v2)                                               
                                                                                                                    
    Allows us to use the DMA ring from userspace.                                                                   
    DMA doesn't have a good NOP packet in which to embed the                                                        
    reloc idx, so userspace has to add a reloc for each                                                             
    buffer used and order them to match the command stream.                                                         
                                                                                                                    
    v2: fix address bounds checking                                                                                 
                                                                                                                    
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>                                                         
                                                                                                                    
:040000 040000 7183de0d56e5c01b40775244d5dc4b5441406786 f3abce52c375cc4598cd23739df825771e6fb46e M      drivers
Comment 4 Alex Deucher 2013-01-12 22:27:22 UTC
Probably a bad bisect.  That commit just enables userspace accel drivers to utilize the DMA engine, but the userspace drivers do not take advantage of that yet so the code is currently never called.
Comment 5 Bruno Jacquet 2013-01-12 22:41:42 UTC
Hm, I was expecting something like that.
I'm experiencing 2 bugs in 1.
This one was a plain crash, while the first one I opened the report about was a corruption but no crash.

I may re-bisect the kernel and flag "good" when it just crashes but no corruption ?
Comment 6 Bruno Jacquet 2013-01-12 23:15:25 UTC
After further testing, it seems that there is 2 bugs as I said.
The first one I'm seeing (and the one the report is about) seems to be caused by commit dd54fee7d440c4a9756cce2c24a50c15e4c17ccb.

And now the crash I also got is probably an older commit, but as the crash is pretty random, I got the bisect wrong.

Seeing the commit message of dd54fee7d440c4a9756cce2c24a50c15e4c17ccb, it fixes a kernel crash that looks like mine, I'll attach a screenshot of mine (poor quality :/)

==> So maybe dd54fef DID fix the kernel crash but replaced it with the corruption I'm seeing ?
Comment 7 Bruno Jacquet 2013-01-12 23:18:59 UTC
Created attachment 91101 [details]
kernel crash I got from older commits while bisecting
Comment 8 Michel Dänzer 2013-01-15 14:36:15 UTC
(In reply to comment #6)
> ==> So maybe dd54fef DID fix the kernel crash but replaced it with the
> corruption I'm seeing ?

Does the corruption also occur with dd54fee7d440c4a9756cce2c24a50c15e4c17ccb applied manually on top of 0d0b3e7443bed6b49cb90fe7ddc4b5578a83a88d?
Comment 9 Alex Deucher 2013-01-15 17:37:44 UTC
Does reverting the following commit fix the issue?

commit d025e9e2b890db679f1246037bf65bd4be512627
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Thu Nov 29 10:35:41 2012 -0500

    drm/radeon: do not move bo to different placement at each cs

    The bo creation placement is where the bo will be. Instead of trying
    to move bo at each command stream let this work to another worker
    thread that will use more advance heuristic.

    agd5f: remove leftover unused variable

    Signed-off-by: Jerome Glisse <jglisse@redhat.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Comment 10 Bruno Jacquet 2013-01-15 19:26:08 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > ==> So maybe dd54fef DID fix the kernel crash but replaced it with the
> > corruption I'm seeing ?
> 
> Does the corruption also occur with dd54fee7d440c4a9756cce2c24a50c15e4c17ccb
> applied manually on top of 0d0b3e7443bed6b49cb90fe7ddc4b5578a83a88d?

g0d0b3e7 with patch dd54fee7d I see no corruption
Comment 11 Bruno Jacquet 2013-01-15 19:38:05 UTC
(In reply to comment #9)
> Does reverting the following commit fix the issue?
> 
> commit d025e9e2b890db679f1246037bf65bd4be512627
> Author: Jerome Glisse <jglisse@redhat.com>
> Date:   Thu Nov 29 10:35:41 2012 -0500
> 
>     drm/radeon: do not move bo to different placement at each cs
> 
>     The bo creation placement is where the bo will be. Instead of trying
>     to move bo at each command stream let this work to another worker
>     thread that will use more advance heuristic.
> 
>     agd5f: remove leftover unused variable
> 
>     Signed-off-by: Jerome Glisse <jglisse@redhat.com>
>     Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

It does fix the corruption.
Comment 12 Alex Deucher 2013-01-15 20:57:27 UTC
Same issue as:
https://bugs.freedesktop.org/show_bug.cgi?id=58659
Comment 13 Jérôme Glisse 2013-01-16 22:31:46 UTC
Created attachment 91421 [details]
Exclude system placement

Does applying this patch without reverting anything fix the issue ?
Comment 14 Jérôme Glisse 2013-01-17 00:20:44 UTC
Better to try this patch instead first :
http://people.freedesktop.org/~glisse/0001-drm-radeon-exclude-system-placement-when-validating-.patch
Comment 15 Bruno Jacquet 2013-01-17 19:26:36 UTC
(In reply to comment #14)
> Better to try this patch instead first :
>
> http://people.freedesktop.org/~glisse/0001-drm-radeon-exclude-system-placement-when-validating-.patch

With this patch, my game froze before I could even check the rendering. My cursor still moved, I could switch to tty1.
I checked dmesg : nothing added. I went back to tty7 (X) and then it was stuck there.
Comment 16 Florian Mickler 2013-01-26 10:50:36 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc5:

commit 20707874fd4fd37e09513f508e642fa8bd06365a
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Thu Jan 17 13:10:50 2013 -0500

    Revert "drm/radeon: do not move bo to different placement at each cs"
Comment 17 Bruno Jacquet 2013-01-27 14:04:31 UTC
Indeed, linux 3.8-rc5 with no patch applied is working now, I see no corruption.
Comment 18 Bruno Jacquet 2013-03-05 14:41:47 UTC
Final 3.8 is working