Bug 18242

Summary: kernel should forbid app from using GPU if it causes lockups
Product: Drivers Reporter: Török Edwin (edwin+bugs)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: CLOSED INVALID    
Severity: enhancement CC: alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description Török Edwin 2010-09-11 07:28:52 UTC
During the testing of r600g (still in development) it happens that some mesa demos/games lockup the GPU. 
Most of the time the kernel recovers from this nicely, however the app will lockup the GPU over and over again.

It doesn't even know the correct process, for example it was the gloss demo (using direct rendering) this time:
[ 3070.276433] GPU lockup (waiting for 0x0003708B last fence id 0x00037088)
...
[ 3070.276467] Pid: 4274, comm: Xorg Not tainted 2.6.36-rc3-phenom #96

It could have been an app using LIBGL_ALWAYS_INDIRECT=1 though, so just killing the app causing the lockup is not a good idea (could easily lead to getting X killed).

I think the kernel should forbid the app from sending any more GPU commands (perhaps by returning failure for every ioctl it does?) once it determines it locked up. So in this case it'd first forbid Xorg, then see there is another lockup, then forbid gloss.
It should probably print a message like:
Process '<processname>' (pid <pid>) caused a GPU lockup, forbidding GPU commands for 'N minutes'. To reenable do 'echo <pid> >/sys/kernel/..../gpu_reenable'.



Of course it'd be best if the kernel wouldn't accept the GPU commands leading to a GPU lockup, but that might not be possible to determine in general (whether certain GPU instructions will cause a lockup or not).
Comment 1 Alan 2012-05-12 15:48:53 UTC
Closing ..  doesn't tell anyone anything they don't already know