Bug 217545

Summary: Serious regression on amdgpu (drm_display_helper/drm_dp_atomic_find_time_slots) with two DisplayPort connected via a HP G5 docking station
Product: Drivers Reporter: Christoph Biedl (bugzilla.kernel.bpeb)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED ANSWERED    
Severity: normal CC: alexdeucher, harry.wentland, lyude
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: lscpi
dmesg
kernel.log

Description Christoph Biedl 2023-06-12 14:15:03 UTC
Created attachment 304404 [details]
lscpi

This was previously mentioned in comment 69 in https://bugzilla.kernel.org/show_bug.cgi?id=204181 but I reckon it's a different story.

Hardware:

* Notebook hp mt45 with

    03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15d8] (rev d3)

  Also briefly confirmed with a Thinkpad T495 with

    06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15d8] (rev d2)

* HP G5 USB-C docking station (HSN-IX02)

Problem:

After upgrading from kernel 6.0 to 6.1, the host crashes once a *second* external display is connected via DisplayPort on that docking station:

    BUG: kernel NULL pointer dereference, address: 0000000000000008
    RIP: 0010:drm_dp_atomic_find_time_slots+0x47/0x250 [drm_display_helper]

(see attachment for full log)

Another way to reproduce this:

* Use the Ubuntu 23.04 installer/test image.
* Boot with displays disconnected.
* Select "Try Ubuntu".
* Connect displays.
* Upon connecting the second display, the system freezes.


Insights found after bisecting:

* This was introduced with commit

    commit a5c2c0d164e96d24f73faffcd3b7bbb607e701a9
    Author: Lyude Paul <lyude@redhat.com>
    Date:   Wed Aug 17 15:38:37 2022 -0400

        drm/display/dp_mst: Add nonblocking helpers for DP MST

* Later this was half-fixed with commit

    commit efa4c4df864ecd969670093524d3e8f69188e5eb
    Author: Harry Wentland <harry.wentland@amd.com>
    Date:   Mon Feb 13 17:36:55 2023 -0500

        drm/amd/display: call remove_stream_from_ctx from res_pool funcs

Now the notebook no longer crashes, however the displays remain dark although from the operating system's point of view they are connected and should show some content. This may or may not be a userland problem. (Aside: This commit should be backported to the 6.1 stable series. It applies and likewise reduces the problem from "crash" to "dark screens" there.)

This situation still exists with 6.4-rc6 (released last night)

Please advise if there's anything else we can do.

Find attached:

* lscpi
* dmesg (with drm.debug=4)
* kernel.log (the actual crash)
Comment 1 Christoph Biedl 2023-06-12 14:15:37 UTC
Created attachment 304405 [details]
dmesg
Comment 2 Christoph Biedl 2023-06-12 14:15:59 UTC
Created attachment 304406 [details]
kernel.log
Comment 3 Alex Deucher 2023-06-12 14:22:13 UTC
Please see: https://gitlab.freedesktop.org/drm/amd/-/issues/2492
Comment 4 Christoph Biedl 2023-06-13 07:41:00 UTC
Thanks, seems my search-foo didn't find that one.

FWIW, the mentioned fix https://github.com/jlindgren90/linux/commit/8cf17c25e2d2644fa6dfc3d7de6b3b35689d4db0.patch solves the problem for me (on top of 6.1.30).
Comment 5 Christoph Biedl 2023-06-16 16:17:49 UTC
Dear future reader, please be reminded this is still an ongoing issue, or as written in the drm/amd ticket: "this patch is just papering over underlying issues". So while it makes that specific setup usable, the ultimate fix is still not available yet. Check on the other bug tracker, or re-try newer kernel releases.