Created attachment 301196 [details] kernel dmesg (kernel 5.19-rc2, AMD Ryzen 9 5950X) Starting with kernel v5.18 series I get a severly garbled screen on my system and lotsa "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!" in dmesg. Kernel v5.19-rc2 still got this flaw. Kernel dmesg looks like that: [...] [drm] Initialized amdgpu 3.47.0 20150101 for 0000:09:00.0 on minor 0 fbcon: amdgpudrmfb (fb0) is primary device [drm] DSC precompute is not needed. Console: switching to colour frame buffer device 240x67 amdgpu 0000:09:00.0: [drm] fb0: amdgpudrmfb frame buffer device [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -22! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [...] Apart from that the machine runs fine without X. Xorg.0.log states X also gets started, seeminlgy without problems. Kernel v5.17 series was all good so I did a bisect which revealed this as 1st bad commit: # git bisect good c18a2a280c073f70569a91ef0d7434d12e66e200 is the first bad commit commit c18a2a280c073f70569a91ef0d7434d12e66e200 Merge: 70da382e1c5b 94f4c4965e55 Author: Dave Airlie <airlied@redhat.com> Date: Sat Apr 23 15:00:33 2022 +1000 Merge tag 'drm-misc-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Two fixes for the raspberrypi panel initialisation, one fix for a logic inversion in radeon, a build and pm refcounting fix for vc4, two reverts for drm_of_get_bridge that caused a number of regression and a locking regression for amdgpu. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20220422084403.2xrhf3jusdej5yo4@houat drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 + drivers/gpu/drm/drm_of.c | 84 +++---- .../gpu/drm/panel/panel-raspberrypi-touchscreen.c | 13 +- drivers/gpu/drm/radeon/radeon_sync.c | 2 +- drivers/gpu/drm/vc4/Kconfig | 3 + drivers/gpu/drm/vc4/vc4_dsi.c | 2 +- drivers/gpu/drm/vmwgfx/vmwgfx_bo.c | 43 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 8 +- drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 7 +- include/linux/dma-buf-map.h | 266 --------------------- 12 files changed, 94 insertions(+), 358 deletions(-) delete mode 100644 include/linux/dma-buf-map.h Some data about the machine: # lspci 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7 01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. A2000 NVMe SSD (rev 03) 02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller (rev 01) 02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01) 02:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01) 03:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) 03:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) 03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) 07:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c5) 08:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch 09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5) 09:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio 0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function 0b:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP 0b:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP 0b:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller 0b:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller # lspci -v -s 09:00.0 09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5) (prog-if 00 [VGA controller]) Subsystem: ASRock Incorporation Navi 14 [Radeon RX 5500/5500M / Pro 5500M] Flags: bus master, fast devsel, latency 0, IRQ 81, IOMMU group 18 Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e0000000 (64-bit, prefetchable) [size=2M] I/O ports at e000 [size=256] Memory at c0300000 (32-bit, non-prefetchable) [size=512K] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Capabilities: [64] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150] Advanced Error Reporting Capabilities: [200] Physical Resizable BAR Capabilities: [240] Power Budgeting <?> Capabilities: [270] Secondary PCI Express Capabilities: [2a0] Access Control Services Capabilities: [2b0] Address Translation Service (ATS) Capabilities: [2c0] Page Request Interface (PRI) Capabilities: [2d0] Process Address Space ID (PASID) Capabilities: [320] Latency Tolerance Reporting Capabilities: [400] Data Link Feature <?> Capabilities: [410] Physical Layer 16.0 GT/s <?> Capabilities: [440] Lane Margining at the Receiver <?> Kernel driver in use: amdgpu Kernel modules: amdgpu # inxi -bZ System: Host: supah Kernel: 5.19.0-rc2-Zen3 x86_64 bits: 64 Console: pty pts/0 Distro: Gentoo Base System release 2.8 Machine: Type: Desktop Mobo: ASRock model: B450M Steel Legend serial: M80-D1005301508 UEFI: American Megatrends v: P4.30 date: 02/25/2022 CPU: Info: 16-core AMD Ryzen 9 5950X [MT MCP] speed (MHz): avg: 616 min/max: 550/5084 Graphics: Device-1: AMD Navi 14 [Radeon RX 5500/5500M / Pro 5500M] driver: amdgpu v: kernel Display: server: X.org 1.21.1.3 driver: loaded: amdgpu,ati unloaded: fbdev,modesetting,radeon tty: 211x54 Message: Advanced graphics data unavailable in console for root. Network: Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169
Created attachment 301197 [details] kernel .config (kernel 5.19-rc2, AMD Ryzen 9 5950X)
Created attachment 301198 [details] Xorg.0.log
Created attachment 301199 [details] bisect.log
Ok, seems to be commit 94f4c4965e5513ba624488f4b601d6b385635aec drm/amdgpu: partial revert "remove ctx->lock" v2 specifically. Reverting it on top of v5.19-rc2 gives me working X again and also the "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!" errors disappear from dmesg.
Does this patch help? https://patchwork.freedesktop.org/patch/490475/
It does not apply on top of 5.18.7 nor on top of 5.19-rc4.
(In reply to Alex Deucher from comment #5) > Does this patch help? > https://patchwork.freedesktop.org/patch/490475/ Had a closer look at the patch as it did not apply on top of v5.19-rc4. Seems like almost all of the patch diff is already in upstream v5.19-rc4. Only thing left to patch is: --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 2022-07-02 21:59:53.171528202 +0200 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 2022-07-02 23:12:13.481985665 +0200 @@ -579,16 +579,6 @@ static int amdgpu_cs_parser_bos(struct a e->bo_va = amdgpu_vm_bo_find(vm, bo); } - /* Move fence waiting after getting reservation lock of - * PD root. Then there is no need on a ctx mutex lock. - */ - r = amdgpu_ctx_wait_prev_fence(p->ctx, p->entity); - if (unlikely(r != 0)) { - if (r != -ERESTARTSYS) - DRM_ERROR("amdgpu_ctx_wait_prev_fence failed.\n"); - goto error_validate; - } - amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold, &p->bytes_moved_vis_threshold); p->bytes_moved = 0; @@ -947,7 +937,7 @@ static int amdgpu_cs_ib_fill(struct amdg if (parser->job->uf_addr && ring->funcs->no_user_fence) return -EINVAL; - return 0; + return amdgpu_ctx_wait_prev_fence(parser->ctx, parser->entity); } static int amdgpu_cs_process_fence_dep(struct amdgpu_cs_parser *p, But applying this on top of v5.19-rc4 does not help either. I still need to revert 94f4c4965e5513ba624488f4b601d6b385635aec to get X going.
Tried https://cgit.freedesktop.org/drm/drm-misc/commit/?h=drm-misc-fixes&id=925b6e59138cefa47275c67891c65d48d3266d57 suggested in https://gitlab.freedesktop.org/drm/amd/-/issues/2050#note_1461646 but it did not work out. This bug here seems an entirely different matter.
v5.19-rc7 still affected.
Created attachment 301573 [details] kernel dmesg (kernel 6.0-rc1, AMD Ryzen 9 5950X) No change with v6-0-rc1. [...] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -22! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766! [...] Additionally I get: [...] ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 7 PID: 2120 at lib/refcount.c:28 refcount_warn_saturate+0x93/0xf0 Modules linked in: rfkill dm_crypt nhpoly1305_avx2 nhpoly1305 aes_generic aesni_intel libaes crypto_simd cryptd chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher joydev input_leds hid_generic usbhid hid ext4 mbcache crc16 jbd2 sr_mod amdgpu cdrom snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio dm_mod led_class mfd_core snd_hda_codec_hdmi drm_buddy r8169 gpu_sched evdev wmi_bmof drm_ttm_helper snd_hda_intel ttm snd_intel_dspcfg realtek i2c_algo_bit snd_hda_codec drm_display_helper snd_hwdep mdio_devres drm_kms_helper snd_hda_core sysimgblt syscopyarea snd_pcm sysfillrect libphy fb_sys_fops xhci_pci snd_timer ahci xhci_hcd snd libahci soundcore usbcore libata k10temp usb_common i2c_piix4 gpio_amdpt gpio_generic button pkcs8_key_parser nct6775 hwmon_vid nct6775_core wmi hwmon zram zsmalloc amd_pstate drm fuse drm_panel_orientation_quirks backlight configfs efivarfs CPU: 7 PID: 2120 Comm: X:cs0 Not tainted 6.0.0-rc1-Zen3 #1 Hardware name: To Be Filled By O.E.M. B450M Steel Legend/B450M Steel Legend, BIOS P4.30 02/25/2022 RIP: 0010:refcount_warn_saturate+0x93/0xf0 Code: c7 c7 6d 4b e9 b2 e8 cc 13 bf ff 0f 0b c3 80 3d 5b fe da 00 00 75 af c6 05 52 fe da 00 01 48 c7 c7 ad 45 ea b2 e8 ad 13 bf ff <0f> 0b c3 80 3d 39 fe da 00 00 75 90 c6 05 30 fe da 00 01 48 c7 c7 RSP: 0018:ffffbc8ac1b7fb38 EFLAGS: 00010246 RAX: d8250f016f21c100 RBX: 0000000000000038 RCX: 0000000000000027 RDX: 00000000ffffbfff RSI: 0000000000000004 RDI: ffffa0db5ebd71c8 RBP: 0000000000000003 R08: 0000000000000000 R09: ffffa0db5e8a0000 R10: 0000000000000419 R11: 0000000000000000 R12: 00000000aaaaaaaa R13: ffffa0d4f3e20000 R14: ffffa0d5a62ccc00 R15: 0000000000000003 FS: 00007f879006c640(0000) GS:ffffa0db5ebc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000563c67cfb000 CR3: 00000002c3938000 CR4: 0000000000350ee0 Call Trace: <TASK> amdgpu_cs_ioctl+0x498/0xdd0 [amdgpu] ? amdgpu_cs_report_moved_bytes+0x60/0x60 [amdgpu] drm_ioctl_kernel+0xdb/0x150 [drm] drm_ioctl+0x301/0x440 [drm] ? amdgpu_cs_report_moved_bytes+0x60/0x60 [amdgpu] amdgpu_drm_ioctl+0x42/0x80 [amdgpu] __se_sys_ioctl+0x72/0xc0 do_syscall_64+0x6a/0x90 ? do_user_addr_fault+0x2da/0x410 ? exc_page_fault+0x5f/0x90 entry_SYSCALL_64_after_hwframe+0x4b/0xb5 RIP: 0033:0x7f879b42496b Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00 RSP: 002b:00007f879006b590 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000c0186444 RCX: 00007f879b42496b RDX: 00007f879006b8a8 RSI: 00000000c0186444 RDI: 000000000000000d RBP: 00007f879006b8e0 R08: 00007f879006b970 R09: 0000000000000003 R10: 0000560b4e0cdc40 R11: 0000000000000246 R12: 000000000000000d R13: 0000560b4e175968 R14: 0000000000000000 R15: 00007f879006b8a8 </TASK> ---[ end trace 0000000000000000 ]---
Created attachment 301574 [details] kernel .config (kernel 6.0-rc1, AMD Ryzen 9 5950X)
Created attachment 301683 [details] kernel .config (kernel 5.19.4, AMD Ryzen 9 5950X)
Interesting! Found out this is a gcc vs. clang issue. Using a kernel built with the attached 5.19.4 config with clang-14.0.6 leads to the issue as described. Using a kernel built with the same config but with gcc-12.2.0 just works fine! I'll close here as it's clear this is not strictly an AMD driver issue.
(In reply to Erhard F. from comment #13) > I'll close here as it's clear this is not strictly an AMD driver issue. Not really clear; there could be buggy amdgpu driver code, which happens not to result in noticeable issues in practice when compiled by GCC.
Agreed. I'll keep it open and check the issue again on new 6.x stable kernel releases and when clang 15 becomes available.
Created attachment 303180 [details] kernel dmesg (kernel 6.1-rc5, AMD Ryzen 9 5950X) Reinvestigating on kernel 6.1-rc5 built with clang 15.0.3 + lld 15.0.3. So far I was not able to reproduce the bug! X runs just fine for now. I'll close here once 6.1 is stable and I can assure the bug does no longer show up on my other affected machines (AMD PRO A12-8830B, AMD PRO A10-8750B) too.
Created attachment 303181 [details] kernel .config (kernel 6.1-rc5, AMD Ryzen 9 5950X)
I can confirm the bug is gone now on all my affected systems. Kernels 6.1.x build & boot fine with GCC 12.2.1 and CLANG 15.0.7, no graphical corruption or dmesg errors to be seen. Closing here.