Bug 208947
Summary: | amdgpu DisplayPort won't recognize all display modes after 5.9 merges | ||
---|---|---|---|
Product: | Drivers | Reporter: | Coleman Kane (ckane) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | alexdeucher, jshand2013, jvdelisle2, sarah |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.9 staging-testing, rc1, and latest linux.git | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Working dmesg from Arch "linux" kernel 5.8.1-1
dmesg on faliure case from linux-next.git as of 2020-08-18 00:00:00 (pending-fixes branch) Xorg.log from working case (Arch 5.8.1-1 kernel) Xorg log from failure case (same 5.9-next kernel) Proposed patch (add amdgpu.dp_cll_status bool, default: on) dmesg from 5.10.5 kernel |
Description
Coleman Kane
2020-08-17 21:02:51 UTC
Please attach your dmesg output and xorg log (if using X) in both the working and non-working cases. Can you bisect? Created attachment 292011 [details]
Working dmesg from Arch "linux" kernel 5.8.1-1
Working dmesg from Arch "linux" kernel 5.8.1-1
Created attachment 292013 [details]
dmesg on faliure case from linux-next.git as of 2020-08-18 00:00:00 (pending-fixes branch)
I built a bootable kernel from the linux-next "pending-fixes" based upon sources synchronized with upstream git repo as of 2020-08-18 morning. Did this to verify that there is no pending fix to this issue.
Created attachment 292015 [details]
Xorg.log from working case (Arch 5.8.1-1 kernel)
Xorg from working case on Arch 5.8.1-1 kernel package.
You can see from the following line that it is booting to 3840x2160 resolution:
[ 25.694] (II) AMDGPU(0): Output DisplayPort-0 using initial mode 3840x2160 +1920+0
Created attachment 292017 [details]
Xorg log from failure case (same 5.9-next kernel)
Xorg.log from trying to run Xorg under kernel that fails to set resolutions properly. The DisplayPort-0 displays at 1024x768, and I can not set the resolution any higher than that from within Xorg (using the enlightenment "screen" settings panel).
You can see from the following line that it boots into 1024x768:
[ 38.608] (II) AMDGPU(0): Output DisplayPort-0 using initial mode 1024x768 +0+0
Thanks Alex - I have added the four attachments you requested. I haven't done a git bisect before, but I am very familiar with using git and understand the concept of a bisect, so I could probably figure it out using some documentation such as https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html. Now is as good a time as any to learn! Thanks for the tip - I think I understand how this works now, so I will go through the cycles of bisecting and follow up with what it finds, or a patch if I think I'm able to address it myself. Update - below is the finding from git bisect: 471c1dd9546df81d259664ac3e2ab0e99169f755 is the first bad commit commit 471c1dd9546df81d259664ac3e2ab0e99169f755 Author: Reza Amini <Reza.Amini@amd.com> Date: Wed Jul 15 11:33:23 2020 -0400 drm/amd/display: Allow asic specific FSFT timing optimization [Why] Each asic can optimize best based on its capabilities [How] Optimizing timing for a new pixel clock Signed-off-by: Reza Amini <Reza.Amini@amd.com> Reviewed-by: Anthony Koo <Anthony.Koo@amd.com> Acked-by: Eryk Brol <eryk.brol@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drivers/gpu/drm/amd/display/dc/core/dc_stream.c | 18 +++++++-------- drivers/gpu/drm/amd/display/dc/dc_stream.h | 4 ++-- drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c | 27 ++++++++++++++++++++++ drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.h | 5 ++++ drivers/gpu/drm/amd/display/dc/dcn20/dcn20_init.c | 3 +++ drivers/gpu/drm/amd/display/dc/dcn21/dcn21_init.c | 3 +++ drivers/gpu/drm/amd/display/dc/inc/hw_sequencer.h | 5 ++++ .../drm/amd/display/modules/freesync/freesync.c | 5 +++- 8 files changed, 57 insertions(+), 13 deletions(-) I'm going to venture a guess that if I were to reverse the following part of the change that it would likely "undo" the cause of the problem, but it doesn't really solve the problem. One question I'd have though: return (dc->hwss.optimize_timing_for_fsft && dc->hwss.optimize_timing_for_fsft(dc, &pStream->timing, max_input_rate_in_khz)); Looks like if dc->hwss.optimize_timing_for_fsft is NULL (function pointer nonexistent) then it would always return false, rather than using the old behavior. It looks like the structure should be populated with that function pointer, based upon changes in dcn20_init.c and dcn21_init.c, but it's probably worth verifying whether its ending up NULL at that point or not. Aside from that, the only other "return false" I see is here in dcn20_hwseq.c: + if (max_input_rate_in_100hz < timing->pix_clk_100hz) + return false; I'll try to also get a read out of what values are being populated for these two values as well. diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_stream.c b/drivers/gpu/drm/amd/display/dc/core/dc_stream.c index 10d69ada88e3..0257a900fe2b 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc_stream.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc_stream.c @@ -246,20 +246,18 @@ struct dc_stream_status *dc_stream_get_status( #ifndef TRIM_FSFT /** - * dc_optimize_timing() - dc to optimize timing + * dc_optimize_timing_for_fsft() - dc to optimize timing */ -bool dc_optimize_timing( - struct dc_crtc_timing *timing, +bool dc_optimize_timing_for_fsft( + struct dc_stream_state *pStream, unsigned int max_input_rate_in_khz) { - //optimization is expected to assing a value to these: - //timing->pix_clk_100hz - //timing->v_front_porch - //timing->v_total - //timing->fast_transport_output_rate_100hz; - timing->fast_transport_output_rate_100hz = timing->pix_clk_100hz; + struct dc *dc; - return true; + dc = pStream->ctx->dc; + + return (dc->hwss.optimize_timing_for_fsft && + dc->hwss.optimize_timing_for_fsft(dc, &pStream->timing, max_input_rate_in_khz)); } #endif Trying to do that small change didn't seem to work, so I'm just going to try seeing if it will work if I just remove that whole commit from the newer staging-testing I've been working off of, and see if it works and the problem goes away. It looks like this particular commit is part of a long series of changes that went in however... Ok scratch those last three messages - I must have mislabeled one of the bisect cycles, or some other screw up. I still narrowed my search space down, but it looks like the commit it finished on is still ahead of the one introducing the bug, so I'm going to attempt it again with different starting points. Ok I re-did the bisect and I think I found the actual commit that introduced the regression. I have confirmed this by generating a reverse-patch and then applying that to the latest staging-testing branch I had cloned at the beginning of the week, and I was able to use the newer kernel without the bug occurring. The bug appears to have been introduced by commit 3fd20292c2352660155bbc11736dd014b2fc6e98: diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c index 5cb7b834e459..1a3dbed3becb 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c @@ -1133,6 +1133,45 @@ static inline enum link_training_result perform_link_training_int( return status; } +static enum link_training_result check_link_loss_status( + struct dc_link *link, + const struct link_training_settings *link_training_setting) +{ + enum link_training_result status = LINK_TRAINING_SUCCESS; + unsigned int lane01_status_address = DP_LANE0_1_STATUS; + union lane_status lane_status; + uint8_t dpcd_buf[4] = {0}; + uint32_t lane; + + core_link_read_dpcd( + link, + lane01_status_address, + (uint8_t *)(dpcd_buf), + sizeof(dpcd_buf)); + + /*parse lane status*/ + for (lane = 0; lane < link->cur_link_settings.lane_count; lane++) { + /* + * check lanes status + */ + lane_status.raw = get_nibble_at_index(&dpcd_buf[0], lane); + + if (!lane_status.bits.CHANNEL_EQ_DONE_0 || + !lane_status.bits.CR_DONE_0 || + !lane_status.bits.SYMBOL_LOCKED_0) { + /* if one of the channel equalization, clock + * recovery or symbol lock is dropped + * consider it as (link has been + * dropped) dp sink status has changed + */ + status = LINK_TRAINING_LINK_LOSS; + break; + } + } + + return status; +} + static void initialize_training_settings( struct dc_link *link, const struct dc_link_settings *link_setting, @@ -1372,6 +1411,9 @@ static void print_status_message( case LINK_TRAINING_LQA_FAIL: lt_result = "LQA failed"; break; + case LINK_TRAINING_LINK_LOSS: + lt_result = "Link loss"; + break; default: break; } @@ -1531,6 +1573,14 @@ enum link_training_result dc_link_dp_perform_link_training( status); } + /* delay 5ms after Main Link output idle pattern and then check + * DPCD 0202h. + */ + if (link->connector_signal != SIGNAL_TYPE_EDP && status == LINK_TRAINING_SUCCESS) { + msleep(5); + status = check_link_loss_status(link, <_settings); + } + /* 6. print status message*/ print_status_message(link, <_settings, status); diff --git a/drivers/gpu/drm/amd/display/include/link_service_types.h b/drivers/gpu/drm/amd/display/include/link_service_types.h index 4869d4562e4d..550f46e9b95f 100644 --- a/drivers/gpu/drm/amd/display/include/link_service_types.h +++ b/drivers/gpu/drm/amd/display/include/link_service_types.h @@ -66,6 +66,8 @@ enum link_training_result { /* other failure during EQ step */ LINK_TRAINING_EQ_FAIL_EQ, LINK_TRAINING_LQA_FAIL, + /* one of the CR,EQ or symbol lock is dropped */ + LINK_TRAINING_LINK_LOSS, }; struct link_training_settings { Interestingly enough, the description of a problem attempting to be addressed with this fix sounds similar to behavior I experience with this system: Namely, the Radeon RX 580M connected over DisplayPort advertises 3840x2160@60Hz but I've never gotten the card to successfully link up to the monitor at that resolution. In console mode, this results in the "black screen" where only the built-in display for the laptop works and the external monitor just stays black (falls asleep). In Xorg, if I set it to 3840x2160@29.97Hz, it works however. Additionally, setting the higher resolutions can take a noticeable moment to display. I have another laptop with another graphics card and was able to test that the display does work at 3840x2160@60Hz. So diving a bit further it looks like a contributor to this behavior might be my misbehaving video hardware. After some reading I did determine that I can add the following to my linux command line in GRUB and Xorg will recognize the video mode, as well as all the other video modes supported by the monitor: video=DP-1:3840x2160@29.97me So I think at this point, I have a solution that will address the problem for me. I am not certain whether you think it would make sense to modify the code to account for this situation or not. The old behavior was still able to get the video modes from the display despite it coming up "inactive" on boot, but it seems like the new behavior causes it to throw away all the video modes and substitute them with some basic SVGA modes. Spoke too soon - that addition to the linux command line merely makes it work intermittently. Sometimes I'll get the old behavior (max 1024x768), sometimes I'll get the correct behavior, and sometimes I'll have the video modes available but the display won't turn on if I select them. So I'm back to patching over the problem by commenting out the call to check_link_loss_status and its assignment to status in drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c I'll report back after that. Pulling the call / reassignment of "status" added in that commit seems to revert the undesirable behavior, including in the latest "staging-testing" code that I've tested on. My hardware situation seems to be that the "native" resolution of my DP display (3840x2160@60Hz) is also advertised as "supported" by the RX 580m on my laptop, but for whatever reason the GPU is unable to actually drive that resolution. I don't know enough about the hardware to speculate as to why this is, though. Using the "video=" option to the kernel, I can force a functioning video mode, which seems to be consistent with the intent of that variable, when this call is commented out. However, if the call to check_link_loss_status happens, then the inconsistent behavior occurs that I described above where it randomly might work, might only give me 1024x768, or might never display on my DP-connected monitor regardless of the resolution chosen. That said, I wonder if it makes sense to wrap this call with another kernel command-line boolean variable, so the behavior can be disabled in the event someone has a misbehaving GPU or monitor like I do - unless you think there might be a graceful way to auto-identify this type of defect. Not sure how the auto-selection works for KMS console resolutions, but it would be helpful too if it detected that an attempt at a high refresh rate didn't work, it automatically tried lower refresh rates. Created attachment 292061 [details]
Proposed patch (add amdgpu.dp_cll_status bool, default: on)
Attaching a proposed patch to provide a module parameter that can allow a user to control the behavior of the new link status detection code.
From comment 8 of bug #210849 [ 224.767359] Call Trace: [ 224.767529] amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu] [ 224.767535] backlight_suspend+0x6a/0x80 [ 224.767538] ? brightness_store+0x50/0x50 [ 224.767541] dpm_run_callback+0x4f/0x140 [ 224.767544] __device_suspend+0x11c/0x4a0 [ 224.767547] dpm_suspend+0x117/0x250 [ 224.767549] dpm_suspend_start+0x77/0x80 [ 224.767554] suspend_devices_and_enter+0xe6/0x7f0 [ 224.767557] pm_suspend.cold+0x333/0x38c [ 224.767560] state_store+0x71/0xd0 [ 224.767565] kernfs_fop_write+0xce/0x1b0 [ 224.767569] vfs_write+0xc7/0x1f0 [ 224.767572] ksys_write+0x4f/0xc0 [ 224.767576] do_syscall_64+0x4d/0x90 [ 224.767580] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Does this look like the same bug? this is the bug I've been waiting for to be fixed for opensuse tumbleweed. however, I am now using Arch Linux for the time being until the royal stuff up is fixed. have you not heard of the motto....if it ain't broke don't fix it. what log files will you require?? i am having no issues with amdgpu driver whatsoever. i am using the 5.10.3 kernel with the latest iso. I have installed it on my machine. i have no issues with the amdgpu driver at all Created attachment 294599 [details]
dmesg from 5.10.5 kernel
I bummped up to the 5.10.5 kernel. The laptop suspends on lid close and comes back when opened. However, there are two call traces from issues ocurring, so I would not consider this resolved, however the kenel is recovering from it. The actual bug is probably in the amdgpu driver, but leave that to the experts.
|