Bug 71701 - [ivb dp aux dp-to-vga dongle regression] i915 DisplayPort link training via DP-to-VGA dongle, blank screen
Summary: [ivb dp aux dp-to-vga dongle regression] i915 DisplayPort link training via D...
Status: CLOSED DOCUMENTED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: intel-gfx-bugs@lists.freedesktop.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-07 22:31 UTC by Kris Karas
Modified: 2023-05-26 14:23 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.13.6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output booting 3.13.5, drm.debug=0xe (57.93 KB, text/plain)
2014-03-10 17:32 UTC, Kris Karas
Details
dmesg output booting 3.13.6, drm.debug=0xe (65.63 KB, text/plain)
2014-03-10 17:39 UTC, Kris Karas
Details
Patch against 3.13.x to extend timeout to 100ms (2.67 KB, patch)
2014-03-28 20:08 UTC, Kris Karas
Details | Diff
Updated patch against 3.15.x (2.77 KB, patch)
2014-06-14 00:31 UTC, Kris Karas
Details | Diff

Description Kris Karas 2014-03-07 22:31:51 UTC
A regression introduced in mainline kernel 3.13.6 (not present in simultaneously-released 3.10.33) has caused the i915 mode-setting driver to fail to properly initialize, disabling one of two monitors in a dual-head setup.

After X.org 1.14.3 takes over the display, the monitor remains disabled, even though the X.org log says it found the monitor and EDID and ostensibly set it up for use.  Using KDE's monitor-configuration tool, manually disabling the monitor and re-enabling it will bring it back to life - temporarily, until the next boot of course.

Chipset: Intel Core i7-3770 internal graphics controller.
         ("Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)")
Hardware: Dell OptiPlex 9010 SFF desktop, one VGA connector, two DisplayPort
Monitor 1 (left) : Port VGA1, unaffected by bug
Monitor 2 (right): Port DP2, affected by bug

As an aside, the machine boots to BIOS (or LILO prompt) with port DP2 (right monitor) enabled, and VGA1 disabled.  After passing control to kernel 3.13.6, left monitor powers up and right monitor enters sleep.

Kernel boot (dmesg) log says quite a bit about the issue:

Linux agpgart interface v0.103
[drm] Initialized drm 1.1.0 20060810
[drm] Memory usable by graphics device = 2048M
i915 0000:00:02.0: irq 40 for MSI/MSI-X
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] Driver supports precise vblank timestamp query.
vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+me
fbcon: inteldrmfb (fb0) is primary device
[drm:intel_dp_aux_native_write] *ERROR* too many retries, giving up
[drm:intel_dp_start_link_train] *ERROR* failed to enable link training
[drm:intel_dp_aux_native_write] *ERROR* too many retries, giving up
[drm:intel_dp_complete_link_train] *ERROR* failed to start channel equalization
[drm:cpt_verify_modeset] *ERROR* mode set failed: pipe B stuck
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/i915/intel_display.c:9226 intel_modese
encoder's hw state doesn't match sw tracking (expected 1, found 0)
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.6 #1
Hardware name: Dell Inc. OptiPlex 9010/051FJ8, BIOS A08 09/19/2012
 0000000000000009 ffffffff816bfa8b ffff88011929f7a8 ffffffff810883ed
 ffff88011929f810 ffff88011929f7f8 0000000000000001 ffff880118830b10
 ffff880118c71000 ffffffff81088457 ffffffff818766b8 ffff880100000028
Call Trace:
 [<ffffffff816bfa8b>] ? dump_stack+0x41/0x51
 [<ffffffff810883ed>] ? warn_slowpath_common+0x6d/0x90
 [<ffffffff81088457>] ? warn_slowpath_fmt+0x47/0x50
 [<ffffffff813931fd>] ? intel_modeset_check_state+0x5fd/0x750
 [<ffffffff813933dd>] ? intel_set_mode+0x1d/0x30
 [<ffffffff81393c4d>] ? intel_crtc_set_config+0x77d/0x970
 [<ffffffff8134bef5>] ? drm_mode_set_config_internal+0x55/0xd0
 [<ffffffff8133d2e6>] ? drm_fb_helper_set_par+0x66/0xe0
 [<ffffffff812d1f6c>] ? fbcon_init+0x50c/0x5a0
 [<ffffffff8131eb6e>] ? visual_init+0xae/0x110
 [<ffffffff81321125>] ? do_bind_con_driver+0x155/0x330
 [<ffffffff813218a3>] ? do_take_over_console+0xf3/0x1a0
 [<ffffffff812cdbe3>] ? do_fbcon_takeover+0x53/0xb0
 [<ffffffff810a74a4>] ? notifier_call_chain+0x44/0x70
 [<ffffffff810a7742>] ? __blocking_notifier_call_chain+0x42/0x60
 [<ffffffff812c5d95>] ? register_framebuffer+0x1d5/0x300
 [<ffffffff8133cfbd>] ? drm_fb_helper_initial_config+0x35d/0x530
 [<ffffffff8139635d>] ? intel_modeset_setup_hw_state+0x89d/0xb30
 [<ffffffff813c4c0d>] ? gen6_write32+0x1d/0x80
 [<ffffffff8135db68>] ? i915_driver_load+0xe38/0xe90
 [<ffffffff81346461>] ? drm_get_minor+0x181/0x1e0
 [<ffffffff81346556>] ? drm_dev_register+0x96/0x1d0
 [<ffffffff813480b4>] ? drm_get_pci_dev+0x84/0x130
 [<ffffffff812b4309>] ? local_pci_probe+0x19/0x60
 [<ffffffff812b507a>] ? pci_device_probe+0xca/0x120
 [<ffffffff813d01c1>] ? really_probe+0x51/0x200
 [<ffffffff813d0441>] ? __driver_attach+0x81/0x90
 [<ffffffff813d03c0>] ? __device_attach+0x50/0x50
 [<ffffffff813ce5e3>] ? bus_for_each_dev+0x53/0x90
 [<ffffffff813cf9d8>] ? bus_add_driver+0x168/0x220
 [<ffffffff819d350c>] ? drm_core_init+0x104/0x104
 [<ffffffff813d0a16>] ? driver_register+0x56/0xd0
 [<ffffffff819d350c>] ? drm_core_init+0x104/0x104
 [<ffffffff8100030a>] ? do_one_initcall+0xea/0x130
 [<ffffffff810a1da2>] ? parse_args+0x1f2/0x320
 [<ffffffff819b1e07>] ? kernel_init_freeable+0x106/0x188
 [<ffffffff819b17c5>] ? do_early_param+0x81/0x81
 [<ffffffff816b8120>] ? rest_init+0x70/0x70
 [<ffffffff816b8125>] ? kernel_init+0x5/0x120
 [<ffffffff816c767c>] ? ret_from_fork+0x7c/0xb0
 [<ffffffff816b8120>] ? rest_init+0x70/0x70
---[ end trace 5e14cc86efb031cf ]---
[drm:intel_pipe_config_compare] *ERROR* mismatch in has_dp_encoder (expected 1, found 0)


The same BUG output as above repeats once more (elided here) before the kernel finishes its boot sequence.
Comment 1 Daniel Vetter 2014-03-08 15:17:35 UTC
Since this is a regression, what is the last working kernel? Can you attempt to do a git bisect?

Also please boot with drm.debug=0xe added to your kernel cmdline on both good and bad kernels and attach dmesg.
Comment 2 Kris Karas 2014-03-10 17:32:14 UTC
Created attachment 128851 [details]
dmesg output booting 3.13.5, drm.debug=0xe
Comment 3 Kris Karas 2014-03-10 17:39:00 UTC
Created attachment 128861 [details]
dmesg output booting 3.13.6, drm.debug=0xe

Kernel 3.13.5 is the most recent working kernel.
Kernel 3.13.6 introduces the regression.

I was hoping, given the small change-set between 3.13.5 and 3.13.6 versus i915 drm, that the patch submitter would have a good hunch as to which patch was the culprit.  If the attached dmesg output doesn't help the submitter, I can bisect, but will have to load all of the kernel git onto my home machine and transfer to work.
Comment 4 Kris Karas 2014-03-10 19:54:16 UTC
OK, bisecting isn't really necessary, as there are only four commits in patch-3.13.6 related to i915.  Easy enough to try 'em all!  :-)

Got it!  The problem is split across two commits:
	05524f5b61c3c15859477a2165cff81e3d682743, and
	12b96a3843431abcb29fe8f2cd04f17e32366de0
Backing out the first is sufficient to restore my dual-head display to proper function.  I have no idea why adjusting cache-line alignment (presumably we mean "aligning" the cache-line?) would break things.  I guess the real issue is somewhat deeper in the logic.

I'm typing this on 3.13.6 with the first of those commits backed out, both displays working nicely.

As the commit is fairly small, I'll post it here in its entirety:

===============================================================================

From 05524f5b61c3c15859477a2165cff81e3d682743 Mon Sep 17 00:00:00 2001
From: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date: Tue, 11 Feb 2014 17:52:06 +0000
Subject: drm/i915: Prevent MI_DISPLAY_FLIP straddling two cachelines on IVB

commit f66fab8e1cd6b3127ba4c5c0d11539fbe1de1e36 upstream.

According to BSpec the entire MI_DISPLAY_FLIP packet must be contained
in a single cacheline. Make sure that happens.

v2: Use intel_ring_begin_cacheline_safe()
v3: Use intel_ring_cacheline_align() (Chris)

Cc: Bjoern C <lkml@call-home.ch>
Cc: Alexandru DAMIAN <alexandru.damian@intel.com>
Cc: Enrico Tagliavini <enrico.tagliavini@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=74053
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 2bde35d..3c5ff7a 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -8335,6 +8335,20 @@ static int intel_gen7_queue_flip(struct drm_device *dev,
        if (ring->id == RCS)
                len += 6;                                                       
                                                                                
+       /*                                                                      
+        * BSpec MI_DISPLAY_FLIP for IVB:                                       
+        * "The full packet must be contained within the same cache line."      
+        *                                                                      
+        * Currently the LRI+SRM+MI_DISPLAY_FLIP all fit within the same        
+        * cacheline, if we ever start emitting more commands before            
+        * the MI_DISPLAY_FLIP we may need to first emit everything else,       
+        * then do the cacheline alignment, and finally emit the                
+        * MI_DISPLAY_FLIP.                                                     
+        */                                                                     
+       ret = intel_ring_cacheline_align(ring);                                 
+       if (ret)                                                                
+               goto err_unpin;                                                 
+                                                                               
        ret = intel_ring_begin(ring, len);                                      
        if (ret)                                                                
                goto err_unpin;                                                 
--                                                                              
cgit v0.9.2
Comment 5 Chris Wilson 2014-03-10 21:23:16 UTC
That's an unexpected result. There is one issue in the patch, but it should be harmless:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 10fd19f..bc7abf4 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1638,14 +1638,16 @@ int intel_ring_begin(struct intel_ring_buffer *ring,
 }
 
 /* Align the ring tail to a cacheline boundary */
+#define CACHELINE_ALIGN 64
 int intel_ring_cacheline_align(struct intel_ring_buffer *ring)
 {
-       int num_dwords = (64 - (ring->tail & 63)) / sizeof(uint32_t);
+       int num_dwords = (ring->tail & (CACHELINE_ALIGN - 1)) / sizeof(uint32_t);
        int ret;
 
        if (num_dwords == 0)
                return 0;
 
+       num_dwords = CACHELINE_ALIGN / sizeof(uint32_t) - num_dwords;
        ret = intel_ring_begin(ring, num_dwords);
        if (ret)
                return ret;
Comment 6 Kris Karas 2014-03-11 22:46:05 UTC
Please nix comment #4.
It turns out that when booting 3.13.6, the BUG/WARNING only shows up 86% of the time.  And when the warning is issued, the screen may still work one out of every few times.  As luck would have it, when I booted after backing out the commit in comment #4, the bug/warning was not issued and the screen worked; luckier still, rebooting once more to confirm resulted in the screen working yet again (though the bug/warning actually was issued, I just didn't see it).

Commit 05524f5b61c3c15859477a2165cff81e3d682743 is not at fault.
Chris Wilson's test patch had no effect; failure rate with his patch was 43 out of 50 boots, coincidentally identical to vanilla 3.13.6 (86 failures out of 100 boots).

The REAL problem commit is 87c3c227f580e2965feceae5f8973e9a38644963
                          "drm/i915/dp: add native aux defer retry limit"
Backing that out, I get 50 good boots out of 50 tries, or 100%.

Narrowing it down, I alternately applied the first and second halves of the patch (turning for(;;){} into for(retry=0;retry<7;++retry){}) and the culprit is the first half, the function intel_dp_aux_native_write()
Patching the second function, intel_dp_aux_native_read(), does not trigger the bug.

The bug is that we've gone from an infinite loop waiting for AUX_NATIVE_REPLY_ACK to be set in intel_dp, to a paltry 3500 microseconds.  I don't know what the hardware does with this bit, so I'm guessing into the wind here. If the "ack" awaits a status change in the displayport hardware, and the hardware is connected to a possibly-sleeping monitor, does the ACK wait for the monitor to wake up or communicate in some way?  If so, the timeout is orders of magnitude short.
Comment 7 Kris Karas 2014-03-11 23:03:49 UTC
[fiddle, fiddle]  Emacs [reboot]

*RETRYCOUNT-READ* = 0
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 0
*RETRYCOUNT-WRITE* = 0
*RETRYCOUNT-WRITE* = 0
*RETRYCOUNT-WRITE* = 0
*RETRYCOUNT-WRITE* = 51             <= Gotcha!
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 1
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 2
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 2
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 2
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 2
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 2
*RETRYCOUNT-READ* = 0
*RETRYCOUNT-WRITE* = 0
Comment 8 Jani Nikula 2014-03-12 11:36:34 UTC
Please try this:

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index bb66f9301cd9..faf377443ef5 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -597,7 +597,7 @@ intel_dp_aux_native_write(struct intel_dp *intel_dp,
 		if ((ack & DP_AUX_NATIVE_REPLY_MASK) == DP_AUX_NATIVE_REPLY_ACK)
 			return send_bytes;
 		else if ((ack & DP_AUX_NATIVE_REPLY_MASK) == DP_AUX_NATIVE_REPLY_DEFER)
-			usleep_range(400, 500);
+			usleep_range((retry + 1) * 400, (retry + 1) * 500);
 		else
 			return -EIO;
 	}
@@ -652,7 +652,7 @@ intel_dp_aux_native_read(struct intel_dp *intel_dp,
 			return ret - 1;
 		}
 		else if ((ack & DP_AUX_NATIVE_REPLY_MASK) == DP_AUX_NATIVE_REPLY_DEFER)
-			usleep_range(400, 500);
+			usleep_range((retry + 1) * 400, (retry + 1) * 500);
 		else
 			return -EIO;
 	}
Comment 9 Kris Karas 2014-03-12 16:11:32 UTC
There is no point in trying the patch suggested in comment #8, as mathematically, the total delay imparted by 1*500 + 2*500 + 3*500 ... + 7*500 = 28*500 < 51*500 as per comment #7
Comment 10 Jani Nikula 2014-03-12 16:57:08 UTC
(In reply to Kris Karas from comment #9)
> There is no point in trying the patch suggested in comment #8, as
> mathematically, the total delay imparted by 1*500 + 2*500 + 3*500 ... +
> 7*500 = 28*500 < 51*500 as per comment #7

While that is true, I'm interested in the impact of increased timeout *between* retries.
Comment 11 Kris Karas 2014-03-12 17:46:53 UTC
OK, fair enough, patchlet from comment #8 applied.
Failure rate dropped from 86% down to 55%.

Note that the "51 loops" I measured in comment #7 was just one sample point.  The value seems to vary from boot to boot, albeit not by much.

If the 25ms time required is an average, then the code should accommodate values at least a couple standard deviations out from the curve of required time.  Perhaps adjust the code to continue to use usleep((retry + 1) * 500) but up the for() to be for(retry=0; retry < 13; +retry) {...}
Comment 12 Kris Karas 2014-03-12 18:45:36 UTC
More experiments:

Using Jani Nikula's logic from comment #8, but without imposing any upper limit on loop count, the loop exits with retry == 14 most of the time.  And again, that value 14 is probably specific to my monitor's DisplayPort interface circuitry.

Next, I tried an exponential delay series rather than geometric:
	for (retry = 1; retry < 50; ++retry) {
	  ...
	  usleep_range(retry*retry*400, retry*retry*500);
	}
This failed spectacularly, many BUG/WARNING in kernel log about failures in link training.  Puzzling.  That made me think about the delta in the usleep range.

My next experiment was to use the exponential delay series, but limit the delta in the usleep range, like so:
	for (retry = 1; retry < 50; ++retry) {
	  ...
	  usleep_range(retry*retry*400, retry*retry*400 + 100);
	}
This succeeds repeatedly, though I don't know why relative to experiment #2.
Approximately 40% of the time, the loop exits with retry == 3, and the other 60% of the time, the loop exits with retry == 8.
Comment 13 Jani Nikula 2014-03-13 12:46:38 UTC
Just to double check, it's a native display port monitor, without adapters in between?
Comment 14 Kris Karas 2014-03-13 15:29:03 UTC
There is an adapter.  Both monitors are VGA only, so the first plugs into port VGA1, and the second plugs into DB2 via an adapter dongle.  The name on the dongle is BizLink, but I don't have a part number handy as I'm out-of-office at the moment.
Comment 15 Kris Karas 2014-03-13 17:33:38 UTC
FWIW, the P/N on the DB->VGA dongle is "BizLink ORN699"
Comment 16 Kris Karas 2014-03-28 20:08:57 UTC
Created attachment 130941 [details]
Patch against 3.13.x to extend timeout to 100ms
Comment 17 Kris Karas 2014-03-28 20:28:22 UTC
As no activity has been seen on this bug in more than 2 weeks, I am taking the liberty of submitting a tested patch, attachment 130941 [details] above, which fixes the regression.

Apparently, https://bugs.freedesktop.org/show_bug.cgi?id=71267 reported a case wherein DisplayPort handshaking never completes for some combination of Haswell chipset and a docking station.  The fix for that bug was to introduce a timeout into the aux_native_read and aux_native_write routines.  Unfortunately, the "fix" used 2,800 microseconds as a timeout, sufficient perhaps for the Haswell chipset itself but far too aggressive to deal with other devices that may be connected to the DisplayPort.  In particular, a Dell E176 VGA monitor connected via a BizLink ORN699 DP->VGA adapter dongle required an average of 25 milliseconds for the ACK to be received.

The patch submitted above extends the timeout delay to 100ms, hopefully sufficient to accommodate an esoteric assortment of downstream hardware.

Last but not least, as the cause for this bug is now well known, I am changing the status of this patch to VERIFIED.
Comment 18 Jani Nikula 2014-03-31 07:56:40 UTC
Kris, thanks for being active on this one.

We keep bugs open until the fixes have actually been merged to our code base. For submitting code to our driver, please mail the patches with proper commit message and signed-off-by to the intel-gfx mailing list. Thanks.
Comment 19 Kris Karas 2014-04-01 19:24:12 UTC
Hi Jani,

I confess I don't even know where the intel-gfx mailing list is.  :-)
I am not on it, and don't have a git tree as I am not a regular kernel devo.
Perhaps you could forward my patch to intel-gfx on my behalf?
The patch (attachment 130941 [details]) includes a description and a Signed-Off-By
and a Tested-By, so I think it's in the right format, albeit not with any GIT refs.  Thanks.

Kris
Comment 20 Jani Nikula 2014-04-04 05:49:21 UTC
Kris, please try this patch:

http://mid.gmane.org/1396575332-26898-1-git-send-email-airlied@gmail.com
Comment 21 Jani Nikula 2014-04-04 06:13:14 UTC
(In reply to Jani Nikula from comment #20)
> Kris, please try this patch:
> 
> http://mid.gmane.org/1396575332-26898-1-git-send-email-airlied@gmail.com

Nevermind, -ENOCOFFEE.
Comment 22 Daniel Vetter 2014-04-05 10:22:00 UTC
We've reworked the dp aux code completely and are now using the shared code in drm/drm_dp_helper.c.

Can you please retest with latest drm-intel-nightly branch from http://cgit.freedesktop.org/drm-intel

and if the bug persists update your patch?

For patch submission simply send the git patch to intel-gfx@lists.freedesktop.org with git send-email:

$ git send-email -1 <sha1-of-patch> --to intel-gfx@lists.freedesktop.org

List is moderated but I clean up the queue roughly every day, so no need to subscribe at all.
Comment 23 Jani Nikula 2014-05-07 12:13:23 UTC
(In reply to Daniel Vetter from comment #22)
> We've reworked the dp aux code completely and are now using the shared code
> in drm/drm_dp_helper.c.
> 
> Can you please retest with latest drm-intel-nightly branch from
> http://cgit.freedesktop.org/drm-intel

Please retest with current drm-intel-nightly.
Comment 24 Kris Karas 2014-05-13 02:15:40 UTC
Sorry for the delay in testing.  Pulling a whole GIT repository proved quaintly problematic here, and had to be deferred.

The new drm-intel-nightly works!  I'm seeing 100 good out of 100 boots.  New AUX code is dealing with dongle delays just fine.

However, this appears to be new code suitable for mainline kernels 3.15 or even 3.16.  We need a simple patch to address the regression in the old code from 3.13.6 onwards.  And from what I gather, this new code is not likely to be backported into those stable trees.  Hence, the patch which I proposed above, in comment #19 is, I think, the simplest fix for those old trees, and really should be included by Greg KH; failing that, the ideologically correct fix would be to revoke the regression (87c3c227f580e2965feceae5f8973e9a38644963) until such time as a non-regressing patch were applied.
Comment 25 Jani Nikula 2014-05-13 07:14:28 UTC
(In reply to Kris Karas from comment #24)
> The new drm-intel-nightly works!  I'm seeing 100 good out of 100 boots.  New
> AUX code is dealing with dongle delays just fine.

Great! ...although I am not sure I understand why. The code should be similar to what we had in the driver. :o

> However, this appears to be new code suitable for mainline kernels 3.15 or
> even 3.16.

Please try 3.15-rc5 for completeness. Now that you have the git tree, it should be simple.

> We need a simple patch to address the regression in the old code
> from 3.13.6 onwards.  And from what I gather, this new code is not likely to
> be backported into those stable trees.  Hence, the patch which I proposed
> above, in comment #19 is, I think, the simplest fix for those old trees, and
> really should be included by Greg KH; failing that, the ideologically
> correct fix would be to revoke the regression
> (87c3c227f580e2965feceae5f8973e9a38644963) until such time as a
> non-regressing patch were applied.

I'm afraid we can't backport your fix to stable, because it's not upstream, and really there's not even a similar fix upstream. We need to figure out what it is about the current upstream that fixes your issue, and see if we can backport a similar fix.
Comment 26 Daniel Vetter 2014-05-15 15:21:42 UTC
I think it would be good to attempt a reverse bisect between working and broken kernels to figure this out. Then we can decide what to do about this bug.

But first testing 3.15-rc seems like the right option.
Comment 27 Kris Karas 2014-05-23 00:29:20 UTC
I just finished bisecting drm-intel-nightly to see what it is that makes it "work" in my case.  Seems that waiting for vblank at the right moments is the solution.   I backported the commit into mainline 3.14.0 and it fixed the problem for me.  In any case, here it is:

From be6a6f8ec707f2e446e445ad4b8cc93cc85d5d54 Mon Sep 17 00:00:00 2001
From: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Tue, 15 Apr 2014 18:41:22 +0200
Subject: [PATCH] drm/i915: Don't vblank wait on ilk-ivb after pipe enable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Like on hsw/bdw the pipe isn't actually running yet at this point.              
This holds for both pch ports and the cpu edp port according to my              
testing on ilk, snb and ivb.                                                    
                                                                                
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=77297                    
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>                      
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>                           
---                                                                             
 drivers/gpu/drm/i915/intel_display.c | 14 ++++----------                       
 1 file changed, 4 insertions(+), 10 deletions(-)
Comment 28 Kris Karas 2014-05-23 01:08:19 UTC
I just updated my drm-intel-nightly to May 22, which has kernel version 3.15.0-rc3, and it too works just fine in my case.

Per Daniel Vetter's suggestion, I also tested mainline 3.15.0-rc6; it fails.
Comment 29 Kris Karas 2014-06-02 20:37:45 UTC
Updated the "Subject:" to better reflect this bug's actual cause.

Made two patches based on comment #27, one for mainline 3.13.6+ and the other 3.14.0+.  Went out for lunch, came back to discover that the second panel (our affected one) would not unblank from screen-save.  Shut down X into console mode, and the second panel came back.  So, it seems as if the intel_wait_for_vblank() in i9xx_crtc_enable() and valleyview_crtc_enable() is only a partial fix.

Alas, I am back to running my patch from comment #19, which works nicely.  Any suggestions from the Intel folks on what else to try?
Comment 30 Kris Karas 2014-06-14 00:31:23 UTC
Created attachment 139681 [details]
Updated patch against 3.15.x

Although mainline 3.15 has a good bit of rework in the drm/i915 area, it still suffers from the regressed logic introduced in 3.13.6 - a loop that times out prior to completion of link training, evidently due to delays introduced by a DP=>VGA dongle.

This patch merely adds a more sensible timeout value of roughly 100 milliseconds to the loop, longer than the 2,800 microseconds introduced in 3.13.6, and shorter than the infinite timeout of 3.13.5.
Comment 31 Kris Karas 2014-06-14 00:35:06 UTC
Changing status back to "verified" because we know what the cause is and how to mitigate it, even if the logic inside the dongle is somewhat obscure.
Comment 32 Jani Nikula 2014-06-16 08:08:04 UTC
(In reply to Kris Karas from comment #31)
> Changing status back to "verified" because we know what the cause is and how
> to mitigate it, even if the logic inside the dongle is somewhat obscure.

Let's keep this open for now, in case there are other dongles hitting the same issue.
Comment 33 Kris Karas 2014-09-09 19:20:29 UTC
Recently, Slackware upgraded its XFree video driver to xf86-video-intel-2.99.914, which it would seem also suffers from the same bug affecting the kernel console driver.  In particular, after the monitor is powered off due to keyboard inactivity, it will not unblank / power on.

Fortunately, my kernel is patched to fix this bug; a Ctrl-Alt-F1 to get to a console awakens the dongle-connected monitor, followed by a Ctrl-Alt-F7 to get back to X11.
Comment 34 Kris Karas 2014-09-25 00:22:46 UTC
As an addendum to comment #33, it appears that the Intel X11 driver is able to unblank the monitor roughly 50% of the time, which is statistically better than the kernel console driver.

Running kernel 3.16, my kernel syslog is peppered with output as shown, below.  

 [drm:ironlake_disable_pch_transcoder] *ERROR* failed to disable transcoder B
 [drm:cpt_set_fifo_underrun_reporting] *ERROR* uncleared pch fifo underrun on pch transcoder B
 [drm:cpt_serr_int_handler] *ERROR* PCH transcoder B FIFO underrun
 [drm:cpt_set_fifo_underrun_reporting] *ERROR* uncleared pch fifo underrun on pch transcoder B
 [drm:cpt_serr_int_handler] *ERROR* PCH transcoder B FIFO underrun
 [drm:intel_dp_start_link_train] *ERROR* failed to enable link training
 [drm:intel_dp_complete_link_train] *ERROR* failed to start channel equalization
 [drm:cpt_verify_modeset] *ERROR* mode set failed: pipe B stuck
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 1303 at drivers/gpu/drm/i915/intel_display.c:10101 intel_modeset_check_state+0x595/0x740()
 encoder's hw state doesn't match sw tracking (expected 1, found 0)
 Modules linked in:
 CPU: 0 PID: 1303 Comm: X Tainted: G        W     3.16.3 #1
 Hardware name: Dell Inc. OptiPlex 9010/051FJ8, BIOS A08 09/19/2012
  0000000000000009 ffffffff8170cbd0 ffff8800d838fc28 ffffffff810a290d
  ffff8800d838fc90 ffff8800d838fc78 0000000000000001 ffff8800da0f1b38
  ffff8800da0f5800 ffffffff810a2977 ffffffff818e0e28 ffff880000000028
 Call Trace:
  [<ffffffff8170cbd0>] ? dump_stack+0x41/0x51
  [<ffffffff810a290d>] ? warn_slowpath_common+0x6d/0x90
  [<ffffffff810a2977>] ? warn_slowpath_fmt+0x47/0x50
  [<ffffffff813d8385>] ? intel_modeset_check_state+0x595/0x740
  [<ffffffff813d85bd>] ? intel_set_mode+0x1d/0x30
  [<ffffffff813d942f>] ? intel_crtc_set_config+0x8cf/0xd50
  [<ffffffff81378629>] ? drm_mode_set_config_internal+0x59/0xd0
  [<ffffffff8137bc00>] ? drm_mode_setcrtc+0xd0/0x560
  [<ffffffff8136d998>] ? drm_ioctl+0x178/0x580
  [<ffffffff8117ab9f>] ? do_vfs_ioctl+0x2cf/0x4b0
  [<ffffffff810cb7f7>] ? vtime_account_user+0x37/0x50
  [<ffffffff8117adb6>] ? SyS_ioctl+0x36/0x80
  [<ffffffff817155ec>] ? tracesys+0x7e/0xe2
  [<ffffffff8171564b>] ? tracesys+0xdd/0xe2
 ---[ end trace dce3bc891453170d ]---
 [drm:intel_pipe_config_compare] *ERROR* mismatch in has_dp_encoder (expected 1, found 0)
Comment 35 Jani Nikula 2015-01-28 14:16:56 UTC
This patch against current kernels might be helpful

http://mid.gmane.org/1422373429-10137-1-git-send-email-simon.farnsworth@onelan.co.uk
Comment 36 Kris Karas 2015-02-26 02:33:29 UTC
The patch suggest by Jani in Comment #35 is getting close.

Here's a status of what works and what doesn't by kernel version:
    Linux <= 3.13.5
        Bug-free, no warnings.
    Linux >= 3.13.6 < 3.17.x
        Better initialization on boot most of the time.
    Linux >= 3.18.x
        Initialized correctly on boot all of the time.
        Fails to re-initialize after VESA power-off about 50% of the time.
        Switching from X-server to console and back to X (c-a-F1 + c-a-F7)
        will correctly re-initialize all the time.
    Linux 3.18.7 with patch from Comment #35
        No visual bugs, display (re)initializes 100% of the time.
        About 75% of the time, re-initializing after VESA power-off shows:
        "[drm:intel_dp_start_link_train] *ERROR* failed to enable link training"
        in the kernel log, although the screen/dongle appears to work properly.
Comment 37 Kris Karas 2015-02-26 02:37:54 UTC
Whoops, I mis-edited Comment #36
It should have said that Linux kernel version 3.17.x had "better initialization on boot most of the time" but that Kernel version 3.13.6 through 3.16.x failed nearly all of the time.
Comment 38 Jani Nikula 2015-06-16 09:33:22 UTC
There's been a number of fixes in the DP aux code since the kernels tested here, please try v4.1-rc8.
Comment 39 Kris Karas 2015-06-18 20:50:10 UTC
Kernel 4.1.0-rc8 shows the same behavior as kernels 3.19.x and above.
Screen (re)initializes visually, but kernel logs continue to be peppered with:
    [drm:intel_dp_start_link_train] *ERROR* failed to enable link training

Although there may be a forthcoming fix in the new codebase that will work for kernel 4.1.x and above, this does not change the fact that this is a regression introduced in kernel 3.13.6.  A patch to fix that regression has been available for more than one year, and still remains unapplied!

Kernels 3.14 and 3.18 are both affected by this bug.  And they are in long-term maintenance, requiring a patch.  Kernel 3.14's i915 code is sufficiently different from 4.1 that a backport looks unlikely, necessitating a different patch - which is, again, available and well-tested.

Note You need to log in before you can comment on or make changes to this bug.