Bug 216143
Summary: | [bisected] garbled screen when starting X + dmesg cluttered with "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!" | ||
---|---|---|---|
Product: | Drivers | Reporter: | Erhard F. (erhard_f) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | airlied, alexdeucher, deathsimple, toadron |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | v5.19-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
kernel dmesg (kernel 5.19-rc2, AMD Ryzen 9 5950X)
kernel .config (kernel 5.19-rc2, AMD Ryzen 9 5950X) Xorg.0.log bisect.log kernel dmesg (kernel 6.0-rc1, AMD Ryzen 9 5950X) kernel .config (kernel 6.0-rc1, AMD Ryzen 9 5950X) kernel .config (kernel 5.19.4, AMD Ryzen 9 5950X) kernel dmesg (kernel 6.1-rc5, AMD Ryzen 9 5950X) kernel .config (kernel 6.1-rc5, AMD Ryzen 9 5950X) |
Description
Erhard F.
2022-06-17 21:44:45 UTC
Created attachment 301197 [details]
kernel .config (kernel 5.19-rc2, AMD Ryzen 9 5950X)
Created attachment 301198 [details]
Xorg.0.log
Created attachment 301199 [details]
bisect.log
Ok, seems to be commit 94f4c4965e5513ba624488f4b601d6b385635aec drm/amdgpu: partial revert "remove ctx->lock" v2 specifically. Reverting it on top of v5.19-rc2 gives me working X again and also the "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!" errors disappear from dmesg. Does this patch help? https://patchwork.freedesktop.org/patch/490475/ It does not apply on top of 5.18.7 nor on top of 5.19-rc4. (In reply to Alex Deucher from comment #5) > Does this patch help? > https://patchwork.freedesktop.org/patch/490475/ Had a closer look at the patch as it did not apply on top of v5.19-rc4. Seems like almost all of the patch diff is already in upstream v5.19-rc4. Only thing left to patch is: --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 2022-07-02 21:59:53.171528202 +0200 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 2022-07-02 23:12:13.481985665 +0200 @@ -579,16 +579,6 @@ static int amdgpu_cs_parser_bos(struct a e->bo_va = amdgpu_vm_bo_find(vm, bo); } - /* Move fence waiting after getting reservation lock of - * PD root. Then there is no need on a ctx mutex lock. - */ - r = amdgpu_ctx_wait_prev_fence(p->ctx, p->entity); - if (unlikely(r != 0)) { - if (r != -ERESTARTSYS) - DRM_ERROR("amdgpu_ctx_wait_prev_fence failed.\n"); - goto error_validate; - } - amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold, &p->bytes_moved_vis_threshold); p->bytes_moved = 0; @@ -947,7 +937,7 @@ static int amdgpu_cs_ib_fill(struct amdg if (parser->job->uf_addr && ring->funcs->no_user_fence) return -EINVAL; - return 0; + return amdgpu_ctx_wait_prev_fence(parser->ctx, parser->entity); } static int amdgpu_cs_process_fence_dep(struct amdgpu_cs_parser *p, But applying this on top of v5.19-rc4 does not help either. I still need to revert 94f4c4965e5513ba624488f4b601d6b385635aec to get X going. Tried https://cgit.freedesktop.org/drm/drm-misc/commit/?h=drm-misc-fixes&id=925b6e59138cefa47275c67891c65d48d3266d57 suggested in https://gitlab.freedesktop.org/drm/amd/-/issues/2050#note_1461646 but it did not work out. This bug here seems an entirely different matter. v5.19-rc7 still affected. Created attachment 301573 [details]
kernel dmesg (kernel 6.0-rc1, AMD Ryzen 9 5950X)
No change with v6-0-rc1.
[...]
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -22!
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed in the dependencies handling -1431655766!
[...]
Additionally I get:
[...]
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 7 PID: 2120 at lib/refcount.c:28 refcount_warn_saturate+0x93/0xf0
Modules linked in: rfkill dm_crypt nhpoly1305_avx2 nhpoly1305 aes_generic aesni_intel libaes crypto_simd cryptd chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher joydev input_leds hid_generic usbhid hid ext4 mbcache crc16 jbd2 sr_mod amdgpu cdrom snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio dm_mod led_class mfd_core snd_hda_codec_hdmi drm_buddy r8169 gpu_sched evdev wmi_bmof drm_ttm_helper snd_hda_intel ttm snd_intel_dspcfg realtek i2c_algo_bit snd_hda_codec drm_display_helper snd_hwdep mdio_devres drm_kms_helper snd_hda_core sysimgblt syscopyarea snd_pcm sysfillrect libphy fb_sys_fops xhci_pci snd_timer ahci xhci_hcd snd libahci soundcore usbcore libata k10temp usb_common i2c_piix4 gpio_amdpt gpio_generic button pkcs8_key_parser nct6775 hwmon_vid nct6775_core wmi hwmon zram zsmalloc amd_pstate drm fuse drm_panel_orientation_quirks backlight configfs efivarfs
CPU: 7 PID: 2120 Comm: X:cs0 Not tainted 6.0.0-rc1-Zen3 #1
Hardware name: To Be Filled By O.E.M. B450M Steel Legend/B450M Steel Legend, BIOS P4.30 02/25/2022
RIP: 0010:refcount_warn_saturate+0x93/0xf0
Code: c7 c7 6d 4b e9 b2 e8 cc 13 bf ff 0f 0b c3 80 3d 5b fe da 00 00 75 af c6 05 52 fe da 00 01 48 c7 c7 ad 45 ea b2 e8 ad 13 bf ff <0f> 0b c3 80 3d 39 fe da 00 00 75 90 c6 05 30 fe da 00 01 48 c7 c7
RSP: 0018:ffffbc8ac1b7fb38 EFLAGS: 00010246
RAX: d8250f016f21c100 RBX: 0000000000000038 RCX: 0000000000000027
RDX: 00000000ffffbfff RSI: 0000000000000004 RDI: ffffa0db5ebd71c8
RBP: 0000000000000003 R08: 0000000000000000 R09: ffffa0db5e8a0000
R10: 0000000000000419 R11: 0000000000000000 R12: 00000000aaaaaaaa
R13: ffffa0d4f3e20000 R14: ffffa0d5a62ccc00 R15: 0000000000000003
FS: 00007f879006c640(0000) GS:ffffa0db5ebc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563c67cfb000 CR3: 00000002c3938000 CR4: 0000000000350ee0
Call Trace:
<TASK>
amdgpu_cs_ioctl+0x498/0xdd0 [amdgpu]
? amdgpu_cs_report_moved_bytes+0x60/0x60 [amdgpu]
drm_ioctl_kernel+0xdb/0x150 [drm]
drm_ioctl+0x301/0x440 [drm]
? amdgpu_cs_report_moved_bytes+0x60/0x60 [amdgpu]
amdgpu_drm_ioctl+0x42/0x80 [amdgpu]
__se_sys_ioctl+0x72/0xc0
do_syscall_64+0x6a/0x90
? do_user_addr_fault+0x2da/0x410
? exc_page_fault+0x5f/0x90
entry_SYSCALL_64_after_hwframe+0x4b/0xb5
RIP: 0033:0x7f879b42496b
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007f879006b590 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000c0186444 RCX: 00007f879b42496b
RDX: 00007f879006b8a8 RSI: 00000000c0186444 RDI: 000000000000000d
RBP: 00007f879006b8e0 R08: 00007f879006b970 R09: 0000000000000003
R10: 0000560b4e0cdc40 R11: 0000000000000246 R12: 000000000000000d
R13: 0000560b4e175968 R14: 0000000000000000 R15: 00007f879006b8a8
</TASK>
---[ end trace 0000000000000000 ]---
Created attachment 301574 [details]
kernel .config (kernel 6.0-rc1, AMD Ryzen 9 5950X)
Created attachment 301683 [details]
kernel .config (kernel 5.19.4, AMD Ryzen 9 5950X)
Interesting! Found out this is a gcc vs. clang issue. Using a kernel built with the attached 5.19.4 config with clang-14.0.6 leads to the issue as described. Using a kernel built with the same config but with gcc-12.2.0 just works fine! I'll close here as it's clear this is not strictly an AMD driver issue. (In reply to Erhard F. from comment #13) > I'll close here as it's clear this is not strictly an AMD driver issue. Not really clear; there could be buggy amdgpu driver code, which happens not to result in noticeable issues in practice when compiled by GCC. Agreed. I'll keep it open and check the issue again on new 6.x stable kernel releases and when clang 15 becomes available. Created attachment 303180 [details]
kernel dmesg (kernel 6.1-rc5, AMD Ryzen 9 5950X)
Reinvestigating on kernel 6.1-rc5 built with clang 15.0.3 + lld 15.0.3.
So far I was not able to reproduce the bug! X runs just fine for now.
I'll close here once 6.1 is stable and I can assure the bug does no longer show up on my other affected machines (AMD PRO A12-8830B, AMD PRO A10-8750B) too.
Created attachment 303181 [details]
kernel .config (kernel 6.1-rc5, AMD Ryzen 9 5950X)
I can confirm the bug is gone now on all my affected systems. Kernels 6.1.x build & boot fine with GCC 12.2.1 and CLANG 15.0.7, no graphical corruption or dmesg errors to be seen. Closing here. |