Bug 69901

Summary: intel ivy bridge/radeonsi PRIME hang since 3.14
Product: Drivers Reporter: Christoph Haag (haagch.christoph)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: chris, intel-gfx-bugs, thellstrom
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.14-rc1 Subsystem:
Regression: No Bisected commit-id:
Attachments: sysprof output: X hanging after rendering with PRIME
sysprof output: kwin hanging after rendering with PRIME
Patch that may fix the problem

Description Christoph Haag 2014-02-03 13:50:26 UTC
I have these two gpus in my laptop:

00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M]

With xrandr --setprovideroffloadsink radeon Intel and then DRI_PRIME=1 glxgears it works fine on 3.13.

On 3.14-rc1 it works for a short time, but then hangs occur.

When trying with kwin it is kwin who is hanging, when trying with compton it is actually X.
Hanging means they are unkillable, use 100% CPU (all red in htop detailed view) and the graphical output in X is completely blocked. It seems to be luck whether switching to a tty works.

I filed this with intel because sysprof showed the cpu usage to originate from libdrm_intel.so when kwin hang. In the other sysprof log I did not see anything from intel, so maybe it's not actually intel's problem.
Comment 1 Christoph Haag 2014-02-03 13:51:39 UTC
Created attachment 124311 [details]
sysprof output: X hanging after rendering with PRIME
Comment 2 Christoph Haag 2014-02-03 13:52:23 UTC
Created attachment 124321 [details]
sysprof output: kwin hanging after rendering with PRIME
Comment 3 Chris Wilson 2014-02-03 21:21:44 UTC
It looks to be memory corruption striking the shmemfs used to back swappable GEM objects (in both drivers). In both profiles, it is a deferred file cleanup hitting an infinite loop (my guess is that the cleanup itself is started by an OOPS and SIGKILL). So, it looks like the stuck CPU is another symptom. Please enable all the mm/vm and lockdep kernel debugging options and see if that generates clue.
Comment 4 Christoph Haag 2014-02-04 16:22:56 UTC
I first did a bisect and I think (!) this is the result:

58aa6622d32af7d2c08d45085f44c54554a16ed7 is the first bad commit
Comment 5 Thomas Hellstrom 2014-02-04 17:25:26 UTC
This is probably TTM clearing page::mapping and page::index members of the Intel pages. I don't have time to put together a patch tonight, but probably tomorrow.

/Thomas
Comment 6 Thomas Hellstrom 2014-02-05 08:24:00 UTC
Created attachment 124621 [details]
Patch that may fix the problem

Could you try the attached patch out to see if it fixes the problem?
Comment 7 Christoph Haag 2014-02-05 11:32:17 UTC
Yes it fixes it, no lock ups anymore.
Comment 8 Thomas Hellstrom 2014-02-05 12:00:30 UTC
Great. I'll include the patch in my next pull request.
Comment 9 Christoph Haag 2014-02-10 11:39:49 UTC
Thanks, fixed in rc2.