Bug 75701

Summary: Radeon: GPU recovery is unable to recover from GPU lockups (HD5770 - OpenCL example).
Product: Drivers Reporter: t3st3r
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: mirh, szg00000
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.15-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: Unsuccessful GPU recovery attempt - kernel log

Description t3st3r 2014-05-08 01:46:51 UTC
Created attachment 135351 [details]
Unsuccessful GPU recovery attempt - kernel log

There are some cases when Radeon GPUs can lock up on some MESA errors and so on. While it MESA bugs and somesuch, there is what I believe to be kernel side bug as well. 

Kernel side problem is how kernel handles GPU recovery procedure. Right now GPU recovery would fail most of time on virtually any MESA bug and any GPUm, system would be left in completely unusable state due to lack of graphic output. 


Couple of recent examples would be filed for 2 GPU families.
*This* bug is for GPU deadlock on HD5770 (Evergreen - JUNIPER) on bugged MESA OpenCL operations.

To reproduce:
1) Install Ubuntu 14.04.
2) Add "oibaf PPA" to get recent MESA-based drivers. 
3) Update GPU drivers from Oibaf PPA.
4) Install mesa-opencl-icd library for OpenCL (icd based) support.
5) Boot with 3.15-rc4 kernel (can be self-compiled or taken from kernel PPA, does not affects bug).
6) Get "Clpeak" tool (https://github.com/krrishnarraj/clpeak.git) and build it (OpenCL VRAM benchmark tool). 
7) Try to run it.
8) Program will do some benchmark. Then GPU would lock up.
9) Then kernel part would try recovery. It would fail all the time.

Result:
 GPU locks up. Recovery fails. System left in unusable state due to lack of graphic output.

Expected:
 More or less sane GPU recovery. Some data could be lost, picture can be distorted, some opencl/opengl calls can return errors, some programs can crash. But leaving GPU in faulity state and trying to restore the very same faulty state (without success obviously) isn't a option. What happens now is absolultely worst GPU recovery at all as it leaves system in unusable state with GPU which can't be brought back without reboot (there is no screen output at this point).