Bug 208947 - amdgpu DisplayPort won't recognize all display modes after 5.9 merges
Summary: amdgpu DisplayPort won't recognize all display modes after 5.9 merges
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-17 21:02 UTC by Coleman Kane
Modified: 2021-01-10 22:59 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.9 staging-testing, rc1, and latest linux.git
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Working dmesg from Arch "linux" kernel 5.8.1-1 (86.29 KB, text/plain)
2020-08-18 18:22 UTC, Coleman Kane
Details
dmesg on faliure case from linux-next.git as of 2020-08-18 00:00:00 (pending-fixes branch) (86.12 KB, text/plain)
2020-08-18 18:24 UTC, Coleman Kane
Details
Xorg.log from working case (Arch 5.8.1-1 kernel) (57.85 KB, text/plain)
2020-08-18 18:27 UTC, Coleman Kane
Details
Xorg log from failure case (same 5.9-next kernel) (100.41 KB, text/plain)
2020-08-18 18:29 UTC, Coleman Kane
Details
Proposed patch (add amdgpu.dp_cll_status bool, default: on) (3.55 KB, patch)
2020-08-22 22:52 UTC, Coleman Kane
Details | Diff
dmesg from 5.10.5 kernel (110.99 KB, text/plain)
2021-01-10 22:59 UTC, JerryD
Details

Description Coleman Kane 2020-08-17 21:02:51 UTC
I've got an AMD Ryzen 7 1700 and a POLARIS10 Radeon RX 580M GPU in an ASUS ROG GL702ZC laptop.

I'm using the regular 5.8.1 kernel on Arch linux, and also have been tracking in-development kernel changes via the "staging-testing" branch on the gregkh/staging.git Linux tree as well as the torvalds/linux.git tree.

Sometime during the course of the 5.9 updates getting merged into those trees, the 5.9 kernels I build no longer recognize all of the graphics modes for my 4K (3840x2160) monitor when it is plugged into the DisplayPort. Instead, the highest mode reported available is 1024x768, and Xorg also gets limited to these choices.

This behavior appears limited to the monitor being plugged in via its DisplayPort input. When I use an HDMI input, all of the supported graphics modes are reported properly. I am using /sys/class/drm/card0-DP-1/modes to display the available modes, and also starting Xorg to test it as well.

Would be happy to test patches if this is a known regression, or would appreciate some help/direction in tracking down potential likely culprits of the problem so I can try to diagnose it myself. I am somewhat familiar with kernel development, so I'm capable of hacking on it a bit if anyone can point me in a good direction. I'm just not too familiar with the inner workings of the AMDGPU, KMS, or DRM code. Not even certain if this is an issue specific to AMDGPU or something agnostic to the specific video hw.

I also tried dumping the /sys/class/drm/card0-DP-1/edid after booting into 5.8.1 and then 5.9-latest and I get exactly the same data for both, so it seems like the EDID data is at least being fetched properly. I also tried saving that EDID data to a *.bin file and manually loading that via kernel arguments (drm.edid_firmware): verified it was loading properly for DP-1, but I still got the same results (no high-resolution video modes).

Let me know what data would be helpful to attach to this issue, as well.

Thanks,
Coleman Kane
Comment 1 Alex Deucher 2020-08-18 13:28:57 UTC
Please attach your dmesg output and xorg log (if using X) in both the working and non-working cases.  Can you bisect?
Comment 2 Coleman Kane 2020-08-18 18:22:09 UTC
Created attachment 292011 [details]
Working dmesg from Arch "linux" kernel 5.8.1-1

Working dmesg from Arch "linux" kernel 5.8.1-1
Comment 3 Coleman Kane 2020-08-18 18:24:56 UTC
Created attachment 292013 [details]
dmesg on faliure case from linux-next.git as of 2020-08-18 00:00:00 (pending-fixes branch)

I built a bootable kernel from the linux-next "pending-fixes" based upon sources synchronized with upstream git repo as of 2020-08-18 morning. Did this to verify that there is no pending fix to this issue.
Comment 4 Coleman Kane 2020-08-18 18:27:02 UTC
Created attachment 292015 [details]
Xorg.log from working case (Arch 5.8.1-1 kernel)

Xorg from working case on Arch 5.8.1-1 kernel package.

You can see from the following line that it is booting to 3840x2160 resolution:
[    25.694] (II) AMDGPU(0): Output DisplayPort-0 using initial mode 3840x2160 +1920+0
Comment 5 Coleman Kane 2020-08-18 18:29:14 UTC
Created attachment 292017 [details]
Xorg log from failure case (same 5.9-next kernel)

Xorg.log from trying to run Xorg under kernel that fails to set resolutions properly. The DisplayPort-0 displays at 1024x768, and I can not set the resolution any higher than that from within Xorg (using the enlightenment "screen" settings panel).

You can see from the following line that it boots into 1024x768:
[    38.608] (II) AMDGPU(0): Output DisplayPort-0 using initial mode 1024x768 +0+0
Comment 6 Coleman Kane 2020-08-18 18:30:55 UTC
Thanks Alex - I have added the four attachments you requested. I haven't done a git bisect before, but I am very familiar with using git and understand the concept of a bisect, so I could probably figure it out using some documentation such as https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html.

Now is as good a time as any to learn!
Comment 7 Coleman Kane 2020-08-18 23:30:58 UTC
Thanks for the tip - I think I understand how this works now, so I will go through the cycles of bisecting and follow up with what it finds, or a patch if I think I'm able to address it myself.
Comment 8 Coleman Kane 2020-08-19 22:22:11 UTC
Update - below is the finding from git bisect:

471c1dd9546df81d259664ac3e2ab0e99169f755 is the first bad commit
commit 471c1dd9546df81d259664ac3e2ab0e99169f755
Author: Reza Amini <Reza.Amini@amd.com>
Date:   Wed Jul 15 11:33:23 2020 -0400

    drm/amd/display: Allow asic specific FSFT timing optimization

    [Why]
    Each asic can optimize best based on its capabilities

    [How]
    Optimizing timing for a new pixel clock

    Signed-off-by: Reza Amini <Reza.Amini@amd.com>
    Reviewed-by: Anthony Koo <Anthony.Koo@amd.com>
    Acked-by: Eryk Brol <eryk.brol@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/display/dc/core/dc_stream.c    | 18 +++++++--------
 drivers/gpu/drm/amd/display/dc/dc_stream.h         |  4 ++--
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c | 27 ++++++++++++++++++++++
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.h |  5 ++++
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_init.c  |  3 +++
 drivers/gpu/drm/amd/display/dc/dcn21/dcn21_init.c  |  3 +++
 drivers/gpu/drm/amd/display/dc/inc/hw_sequencer.h  |  5 ++++
 .../drm/amd/display/modules/freesync/freesync.c    |  5 +++-
 8 files changed, 57 insertions(+), 13 deletions(-)
Comment 9 Coleman Kane 2020-08-19 22:53:28 UTC
I'm going to venture a guess that if I were to reverse the following part of the change that it would likely "undo" the cause of the problem, but it doesn't really solve the problem.

One question I'd have though:
return (dc->hwss.optimize_timing_for_fsft &&
              dc->hwss.optimize_timing_for_fsft(dc, &pStream->timing, max_input_rate_in_khz));

Looks like if dc->hwss.optimize_timing_for_fsft is NULL (function pointer nonexistent) then it would always return false, rather than using the old behavior. It looks like the structure should be populated with that function pointer, based upon changes in dcn20_init.c and dcn21_init.c, but it's probably worth verifying whether its ending up NULL at that point or not.

Aside from that, the only other "return false" I see is here in dcn20_hwseq.c:
+       if (max_input_rate_in_100hz < timing->pix_clk_100hz)
+               return false;

I'll try to also get a read out of what values are being populated for these two values as well.

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_stream.c b/drivers/gpu/drm/amd/display/dc/core/dc_stream.c
index 10d69ada88e3..0257a900fe2b 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_stream.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_stream.c
@@ -246,20 +246,18 @@ struct dc_stream_status *dc_stream_get_status(

 #ifndef TRIM_FSFT
 /**
- * dc_optimize_timing() - dc to optimize timing
+ * dc_optimize_timing_for_fsft() - dc to optimize timing
  */
-bool dc_optimize_timing(
-       struct dc_crtc_timing *timing,
+bool dc_optimize_timing_for_fsft(
+       struct dc_stream_state *pStream,
        unsigned int max_input_rate_in_khz)
 {
-       //optimization is expected to assing a value to these:
-       //timing->pix_clk_100hz
-       //timing->v_front_porch
-       //timing->v_total
-       //timing->fast_transport_output_rate_100hz;
-       timing->fast_transport_output_rate_100hz = timing->pix_clk_100hz;
+       struct dc  *dc;

-       return true;
+       dc = pStream->ctx->dc;
+
+       return (dc->hwss.optimize_timing_for_fsft &&
+               dc->hwss.optimize_timing_for_fsft(dc, &pStream->timing, max_input_rate_in_khz));
 }
 #endif
Comment 10 Coleman Kane 2020-08-20 03:20:55 UTC
Trying to do that small change didn't seem to work, so I'm just going to try seeing if it will work if I just remove that whole commit from the newer staging-testing I've been working off of, and see if it works and the problem goes away.

It looks like this particular commit is part of a long series of changes that went in however...
Comment 11 Coleman Kane 2020-08-20 05:01:47 UTC
Ok scratch those last three messages - I must have mislabeled one of the bisect cycles, or some other screw up. I still narrowed my search space down, but it looks like the commit it finished on is still ahead of the one introducing the bug, so I'm going to attempt it again with different starting points.
Comment 12 Coleman Kane 2020-08-20 23:22:41 UTC
Ok I re-did the bisect and I think I found the actual commit that introduced the regression. I have confirmed this by generating a reverse-patch and then applying that to the latest staging-testing branch I had cloned at the beginning of the week, and I was able to use the newer kernel without the bug occurring.

The bug appears to have been introduced by commit 3fd20292c2352660155bbc11736dd014b2fc6e98:

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
index 5cb7b834e459..1a3dbed3becb 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
@@ -1133,6 +1133,45 @@ static inline enum link_training_result perform_link_training_int(
        return status;
 }

+static enum link_training_result check_link_loss_status(
+       struct dc_link *link,
+       const struct link_training_settings *link_training_setting)
+{
+       enum link_training_result status = LINK_TRAINING_SUCCESS;
+       unsigned int lane01_status_address = DP_LANE0_1_STATUS;
+       union lane_status lane_status;
+       uint8_t dpcd_buf[4] = {0};
+       uint32_t lane;
+
+       core_link_read_dpcd(
+               link,
+               lane01_status_address,
+               (uint8_t *)(dpcd_buf),
+               sizeof(dpcd_buf));
+
+       /*parse lane status*/
+       for (lane = 0; lane < link->cur_link_settings.lane_count; lane++) {
+               /*
+                * check lanes status
+                */
+               lane_status.raw = get_nibble_at_index(&dpcd_buf[0], lane);
+
+               if (!lane_status.bits.CHANNEL_EQ_DONE_0 ||
+                       !lane_status.bits.CR_DONE_0 ||
+                       !lane_status.bits.SYMBOL_LOCKED_0) {
+                       /* if one of the channel equalization, clock
+                        * recovery or symbol lock is dropped
+                        * consider it as (link has been
+                        * dropped) dp sink status has changed
+                        */
+                       status = LINK_TRAINING_LINK_LOSS;
+                       break;
+               }
+       }
+
+       return status;
+}
+
 static void initialize_training_settings(
         struct dc_link *link,
        const struct dc_link_settings *link_setting,
@@ -1372,6 +1411,9 @@ static void print_status_message(
        case LINK_TRAINING_LQA_FAIL:
                lt_result = "LQA failed";
                break;
+       case LINK_TRAINING_LINK_LOSS:
+               lt_result = "Link loss";
+               break;
        default:
                break;
        }
@@ -1531,6 +1573,14 @@ enum link_training_result dc_link_dp_perform_link_training(
                                status);
        }

+       /* delay 5ms after Main Link output idle pattern and then check
+        * DPCD 0202h.
+        */
+       if (link->connector_signal != SIGNAL_TYPE_EDP && status == LINK_TRAINING_SUCCESS) {
+               msleep(5);
+               status = check_link_loss_status(link, &lt_settings);
+       }
+
        /* 6. print status message*/
        print_status_message(link, &lt_settings, status);

diff --git a/drivers/gpu/drm/amd/display/include/link_service_types.h b/drivers/gpu/drm/amd/display/include/link_service_types.h
index 4869d4562e4d..550f46e9b95f 100644
--- a/drivers/gpu/drm/amd/display/include/link_service_types.h
+++ b/drivers/gpu/drm/amd/display/include/link_service_types.h
@@ -66,6 +66,8 @@ enum link_training_result {
        /* other failure during EQ step */
        LINK_TRAINING_EQ_FAIL_EQ,
        LINK_TRAINING_LQA_FAIL,
+       /* one of the CR,EQ or symbol lock is dropped */
+       LINK_TRAINING_LINK_LOSS,
 };

 struct link_training_settings {
Comment 13 Coleman Kane 2020-08-20 23:31:05 UTC
Interestingly enough, the description of a problem attempting to be addressed with this fix sounds similar to behavior I experience with this system:

Namely, the Radeon RX 580M connected over DisplayPort advertises 3840x2160@60Hz but I've never gotten the card to successfully link up to the monitor at that resolution. In console mode, this results in the "black screen" where only the built-in display for the laptop works and the external monitor just stays black (falls asleep). In Xorg, if I set it to 3840x2160@29.97Hz, it works however. Additionally, setting the higher resolutions can take a noticeable moment to display.

I have another laptop with another graphics card and was able to test that the display does work at 3840x2160@60Hz.
Comment 14 Coleman Kane 2020-08-21 02:18:10 UTC
So diving a bit further it looks like a contributor to this behavior might be my misbehaving video hardware.

After some reading I did determine that I can add the following to my linux command line in GRUB and Xorg will recognize the video mode, as well as all the other video modes supported by the monitor:
video=DP-1:3840x2160@29.97me

So I think at this point, I have a solution that will address the problem for me. I am not certain whether you think it would make sense to modify the code to account for this situation or not. The old behavior was still able to get the video modes from the display despite it coming up "inactive" on boot, but it seems like the new behavior causes it to throw away all the video modes and substitute them with some basic SVGA modes.
Comment 15 Coleman Kane 2020-08-21 02:51:09 UTC
Spoke too soon - that addition to the linux command line merely makes it work intermittently. Sometimes I'll get the old behavior (max 1024x768), sometimes I'll get the correct behavior, and sometimes I'll have the video modes available but the display won't turn on if I select them.

So I'm back to patching over the problem by commenting out the call to check_link_loss_status and its assignment to status in drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c

I'll report back after that.
Comment 16 Coleman Kane 2020-08-21 19:52:38 UTC
Pulling the call / reassignment of "status" added in that commit seems to revert the undesirable behavior, including in the latest "staging-testing" code that I've tested on.

My hardware situation seems to be that the "native" resolution of my DP display (3840x2160@60Hz) is also advertised as "supported" by the RX 580m on my laptop, but for whatever reason the GPU is unable to actually drive that resolution. I don't know enough about the hardware to speculate as to why this is, though. Using the "video=" option to the kernel, I can force a functioning video mode, which seems to be consistent with the intent of that variable, when this call is commented out. However, if the call to check_link_loss_status happens, then the inconsistent behavior occurs that I described above where it randomly might work, might only give me 1024x768, or might never display on my DP-connected monitor regardless of the resolution chosen.

That said, I wonder if it makes sense to wrap this call with another kernel command-line boolean variable, so the behavior can be disabled in the event someone has a misbehaving GPU or monitor like I do - unless you think there might be a graceful way to auto-identify this type of defect. Not sure how the auto-selection works for KMS console resolutions, but it would be helpful too if it detected that an attempt at a high refresh rate didn't work, it automatically tried lower refresh rates.
Comment 17 Coleman Kane 2020-08-22 22:52:15 UTC
Created attachment 292061 [details]
Proposed patch (add amdgpu.dp_cll_status bool, default: on)

Attaching a proposed patch to provide a module parameter that can allow a user to control the behavior of the new link status detection code.
Comment 18 JerryD 2021-01-02 03:22:24 UTC
From comment 8 of  bug #210849

[  224.767359] Call Trace:
[  224.767529]  amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu]
[  224.767535]  backlight_suspend+0x6a/0x80
[  224.767538]  ? brightness_store+0x50/0x50
[  224.767541]  dpm_run_callback+0x4f/0x140
[  224.767544]  __device_suspend+0x11c/0x4a0
[  224.767547]  dpm_suspend+0x117/0x250
[  224.767549]  dpm_suspend_start+0x77/0x80
[  224.767554]  suspend_devices_and_enter+0xe6/0x7f0
[  224.767557]  pm_suspend.cold+0x333/0x38c
[  224.767560]  state_store+0x71/0xd0
[  224.767565]  kernfs_fop_write+0xce/0x1b0
[  224.767569]  vfs_write+0xc7/0x1f0
[  224.767572]  ksys_write+0x4f/0xc0
[  224.767576]  do_syscall_64+0x4d/0x90
[  224.767580]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Does this look like the same bug?
Comment 19 John Shand 2021-01-03 08:59:29 UTC
this is the bug I've been waiting for to be fixed for opensuse tumbleweed.

however, I am now using Arch Linux for the time being until the royal stuff up is fixed.  have you not heard of the motto....if it ain't broke don't fix it.

what log files will you require??  i am having no issues with amdgpu driver whatsoever.
Comment 20 John Shand 2021-01-03 09:05:19 UTC
i am using the 5.10.3 kernel with the latest iso.  I have installed it on my machine.  i have no issues with the amdgpu driver at all
Comment 21 JerryD 2021-01-10 22:59:51 UTC
Created attachment 294599 [details]
dmesg from 5.10.5 kernel

I bummped up to the 5.10.5 kernel. The laptop suspends on lid close and comes back when opened. However, there are two call traces from issues ocurring, so I would not consider this resolved, however the kenel is recovering from it.  The actual bug is probably in the amdgpu driver, but leave that to the experts.

Note You need to log in before you can comment on or make changes to this bug.