Bug 93711

Summary: BUG in i915_gem.c freezed the console
Product: Drivers Reporter: Martin Ziegler (ziegler)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED CODE_FIX    
Severity: high CC: igor.raits, intel-gfx-bugs, nickross, xry111
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.0.0-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: from the syslog
from the syslog (2)
from the syslog
Linus' patch
my patch
my patch (2)
damien's patch
Daniel's patch
Josh's patch

Description Martin Ziegler 2015-02-23 20:03:12 UTC
Created attachment 168081 [details]
from the syslog

When I switched from X to the console, the
console became unresponsive. I had to use SysRq to reboot.
I attach the part of the syslog.

Regards

Martin
Comment 1 Martin Ziegler 2015-02-28 23:50:47 UTC
Created attachment 168501 [details]
from the syslog (2)
Comment 2 Martin Ziegler 2015-02-28 23:52:02 UTC
The Bug is still there after

commit ae1aa797e0ace9bbce055e31de1f641e422a082a
Merge: a015d33 21689a4
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat Feb 28 10:36:48 2015 -0800

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

I attach the syslog(2)

regards
Martin
Comment 3 Ruoyao Xi 2015-03-02 02:02:05 UTC
Created attachment 168531 [details]
from the syslog
Comment 4 Ruoyao Xi 2015-03-02 02:04:23 UTC
I met the bug too. I've attached the syslog which seems similar to Martin's syslog.
Comment 5 Ruoyao Xi 2015-03-03 02:47:32 UTC
Linus has fixed the bug. I think the patch will be merged into the kernel tree soon.

In fact the bug is not in i915_gem.c but in intel_atomic_plane.c. See <https://people.freedesktop.org/patch/43712/>.
Comment 6 Ruoyao Xi 2015-03-03 02:52:08 UTC
Created attachment 168621 [details]
Linus' patch
Comment 7 Ruoyao Xi 2015-03-03 08:59:51 UTC
My bad. After the actual testing, I found that all patches on freedesktop.org can't solve the problem.
Comment 8 Ruoyao Xi 2015-03-04 08:10:46 UTC
Still occur in 4.0.0-rc2.
Comment 9 Martin Ziegler 2015-03-05 00:52:55 UTC
I confirm that: The bug is still in 4.0.0-rc2
Comment 10 Ruoyao Xi 2015-03-06 14:55:41 UTC
After some debug I found out the problem:

In drivers/gpu/drm/i915/intel_display.c: 9733 intel_crtc_page_flip, the route do following things:
1) Add a kernel work to workqueue of drm devices. This work will unpin old framebuffer object (crtc->primary->fb).
2) Try to lock the mutex of drm device.
3) Assign crtc->primary->fb to another framebuffer object (the argument fb).
4) Pin newly assigned crtc->primary->fb.
5) Unlock the mutex of drm device.

Unfortunately, in step 2, if the mutex has been locked, the route will sleep. Then the kernel may run the work created in step 1. This may make pin_count of old framebuffer object zero. However, since step 3 is not processed, crtc->primary->fb is still assigned to the old framebuffer. If we are unlucky, the routes switching X to console will run in such a "intermediate" situation. It will tries to unpin the old framebuffer, again. Then a kernel bug occurs.

I am trying to move step 2 before step 1 to solve the problem. However I am afraid of deadlock so I will analysis the kernel code more.
Comment 11 Ruoyao Xi 2015-03-08 01:20:03 UTC
My previous comment seems not correct since I am not a professional display driver developer :(

But now I've found and fixed the bug:

In intel_crtc_page_flip, we changed crtc->primary->fb, but forgot to change crtc->primary->state->fb.
The fixing is simple, call drm_atomic_set_fb_for_plane to change crtc->primary->state->fb, keep it same to crtc->primary->fb.

To administrator: (1) Mark this as FIXED. (2) Fix it in kernel tree.
Comment 12 Ruoyao Xi 2015-03-08 01:23:24 UTC
Created attachment 169661 [details]
my patch
Comment 13 Ruoyao Xi 2015-03-08 01:26:03 UTC
Sorry. The previous patch includes some debug output code I added so it will produce some warning applying to origin kernel. I'll upload a new patch.
Comment 14 Ruoyao Xi 2015-03-08 01:27:56 UTC
Created attachment 169671 [details]
my patch (2)
Comment 15 Martin Ziegler 2015-03-08 11:36:03 UTC
Thanks. I will test the patch.
Martin
Comment 16 Jani Nikula 2015-03-13 13:42:36 UTC
Fixed by

commit 2dccc9898d45cd552f372c3f0b4a7f42126312f1
Author: Xi Ruoyao <xry111@outlook.com>
Date:   Thu Mar 12 20:16:32 2015 +0800

    drm/i915: Ensure plane->state->fb stays in sync with plane->fb

in drm-intel-fixes. Will reach some v4.0-rcN.

Thanks for the report and the fix.
Comment 17 Nick Ross 2015-03-24 19:55:34 UTC
4.0.0-rc5 won't boot for me on an intel core i7 4770. I get eight penguins and no further. I have bisected the issue to (on Linus's tree):
commit 319c1d420a0b62d9dbb88104afebaabc968cdbfa
Author: Xi Ruoyao <xry111@outlook.com>
Date:   Thu Mar 12 20:16:32 2015 +0800

which seems to be the above patch. Any ideas? Anything I can try to help?
Comment 18 Ruoyao Xi 2015-03-25 03:30:12 UTC
(In reply to Nick Ross from comment #17)
> 4.0.0-rc5 won't boot for me on an intel core i7 4770. I get eight penguins
> and no further. I have bisected the issue to (on Linus's tree):
> commit 319c1d420a0b62d9dbb88104afebaabc968cdbfa
> Author: Xi Ruoyao <xry111@outlook.com>
> Date:   Thu Mar 12 20:16:32 2015 +0800
> 
> which seems to be the above patch. Any ideas? Anything I can try to help?

I received many emails about this today. Daniel recommended to cherry-pick
commit f55548b5af87ebfc586ca75748947f1c1b1a4a52
Author: Damien Lespiau <damien.lespiau@intel.com>
Date:   Thu Feb 5 18:30:20 2015 +0000

    drm/i915: Don't try to reference the fb in get_initial_plane_config()

From linux-next.

I haven't build and test rc5 yet. But in rc4+ my patch works well (on my machine).
I'll build an rc5 and test again immediately.
Comment 19 Ruoyao Xi 2015-03-25 04:14:21 UTC
> I haven't build and test rc5 yet. But in rc4+ my patch works well (on my
> machine).
> I'll build an rc5 and test again immediately.

rc5 still works well on my machine. But I found some WARNINGs in kernel
log which may be related to your problem.

I tried Damien's solution. It solved the WARNING.
Comment 20 Ruoyao Xi 2015-03-25 04:16:14 UTC
Created attachment 172221 [details]
damien's patch
Comment 21 Nick Ross 2015-03-25 06:02:05 UTC
Damien's patch fixes my boot problem. Thank you.
Comment 22 Nick Ross 2015-03-25 06:11:23 UTC
Damien's patch fixes my problem. Thank you.On 25 Mar 2015 04:16, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=93711 
>
> --- Comment #20 from Ruoyao Xi <xry111@outlook.com> --- 
> Created attachment 172221 [details] 
>   --> https://bugzilla.kernel.org/attachment.cgi?id=172221&action=edit 
> damien's patch 
>
> -- 
> You are receiving this mail because: 
> You are on the CC list for the bug.
Comment 23 Martin Ziegler 2015-03-26 07:00:13 UTC
Ich applied Damien's patch to 4.0-rc5.

Before the patch I had two warnings druring boot:

   WARNING: CPU: 3 PID: 1 at include/linux/kref.h:47 drm_framebuffer_reference+0x56/0x5f()
   WARNING: CPU: 2 PID: 6 at drivers/gpu/drm/drm_atomic.c:482 drm_atomic_check_only+0x3a9/0x3cf()

After the patch only the following warning:

   Mar 25 21:46:01 zertz kernel: [drm] GMBUS [i915 gmbus dpb] timed out, falling back to bit banging on pin 5
   ...
   fbcon: inteldrmfb (fb0) is primary device
   ------------[ cut here ]------------
   WARNING: CPU: 0 PID: 6 at drivers/gpu/drm/drm_atomic.c:482 drm_atomic_check_only+0x3a9/0x3cf()
   Modules linked in:
   CPU: 0 PID: 6 Comm: kworker/u8:0 Tainted: G     U          4.0.0-rc5-00003-gad5de1d #77
   Hardware name: LENOVO 4349WK7/4349WK7, BIOS 6MET81WW (1.41) 10/26/2010
   Workqueue: events_unbound async_run_entry_fn
   0000000000000000 0000000000000009 ffffffff813d2040 0000000000000000
   ffffffff81036594 ffff8801334c2710 ffffffff81241786 0000000000000088
   ffff8800b6c66180 ffff8800b6c93000 0000000000000002 ffff8800b6c5a180
   Call Trace:
   [<ffffffff813d2040>] ? dump_stack+0x40/0x50
   [<ffffffff81036594>] ? warn_slowpath_common+0x93/0xab
   [<ffffffff81241786>] ? drm_atomic_check_only+0x3a9/0x3cf
   [<ffffffff81241786>] ? drm_atomic_check_only+0x3a9/0x3cf
   [<ffffffff812417ba>] ? drm_atomic_commit+0xe/0x4d
   [<ffffffff812264da>] ?
   drm_atomic_helper_plane_set_property+0x68/0xa3
   [<ffffffff81240c43>] ? modeset_lock+0x8f/0xf2
   [<ffffffff812345ee>] ? drm_mode_plane_set_obj_prop+0x28/0x49
   [<ffffffff81227a8c>] ? restore_fbdev_mode+0x5f/0xc7
   [<ffffffff8122934c>] ? drm_fb_helper_restore_fbdev_mode_unlocked+0x1e/0x54
   [<ffffffff812293b0>] ? drm_fb_helper_set_par+0x2e/0x32
   [<ffffffff8129fc1b>] ? intel_fbdev_set_par+0x11/0x55
   [<ffffffff811bc715>] ? fbcon_init+0x327/0x442
   [<ffffffff81209ec1>] ? visual_init+0xaf/0x102
   [<ffffffff8120b422>] ? do_bind_con_driver+0x18e/0x295
   [<ffffffff8120b9db>] ? do_take_over_console+0x150/0x179
   [<ffffffff811b90df>] ? do_fbcon_takeover+0x58/0x94
   [<ffffffff8104a8de>] ? notifier_call_chain+0x35/0x59
   [<ffffffff8104aaff>] ?
   __blocking_notifier_call_chain+0x42/0x5d
   [<ffffffff811c0017>] ? register_framebuffer+0x281/0x2b4
   [<ffffffff81229632>] ? drm_fb_helper_initial_config+0x27e/0x330
   [<ffffffff81001684>] ? __switch_to+0x1ff/0x45d
   [<ffffffff8104bb5a>] ? async_run_entry_fn+0x2d/0xbf
   [<ffffffff810467a5>] ? process_one_work+0x142/0x214
   [<ffffffff81046ce3>] ? worker_thread+0x1c3/0x26d
   [<ffffffff81046b20>] ? rescuer_thread+0x284/0x284
   [<ffffffff8104a0c8>] ? kthread+0xab/0xb3
   [<ffffffff81040000>] ? get_signal+0x2e0/0x4ce
   [<ffffffff8104a01d>] ? __kthread_parkme+0x5d/0x5d
   [<ffffffff813d60c8>] ? ret_from_fork+0x58/0x90
   [<ffffffff8104a01d>] ? __kthread_parkme+0x5d/0x5d
   ---[ end trace 8d8be2074054d8dc ]---

Martin
Comment 24 Ruoyao Xi 2015-03-26 08:17:17 UTC
The fix of 93711 in rc5 is a cherry-picking from linux-next. But it caused a lot of trouble.
Now we've found other fixes from linux-next needed to keep all things work correctly (>_<).

The solution in intel-gfx mailing list is:
(1) Cherry-picking Daniel Vetter's commit 
    8218c3f4df3bb1c637c17552405039a6dd3c1ee1
    drm: Fixup racy refcounting in plane_force_disable
from linux-next.

(2) Cherry-picking Damien Lespiau's commit 
    f55548b5af87ebfc586ca75748947f1c1b1a4a52
    drm/i915: Don't try to reference the fb in get_initial_plane_config()
from linux-next.

(3) Apply Josh Boyer's patch from intel-gfx mailing list.

I'll upload patches from Daniel and Josh.

Maintain something between mainline tree and linux-next always causes trouble... I hope
this package solution can end up all this mess things!
Comment 25 Ruoyao Xi 2015-03-26 08:18:01 UTC
Created attachment 172411 [details]
Daniel's patch
Comment 26 Ruoyao Xi 2015-03-26 08:18:21 UTC
Created attachment 172421 [details]
Josh's patch
Comment 27 Martin Ziegler 2015-03-26 12:30:12 UTC
No more warnings during boot after the three patches.

Thanks
Martin
Comment 28 Jani Nikula 2015-03-26 13:42:40 UTC
Also fixed in drm-intel-fixes hopefully on its way to v4.0-rc6.
Comment 29 Jani Nikula 2015-06-16 10:18:28 UTC
*** Bug 93991 has been marked as a duplicate of this bug. ***