Created attachment 299097 [details] kernel dmesg (5.15-rc4, AMD PRO A10-8750B) Happened during reboot. Machine was able to reboot succesfully however. [...] ------------[ cut here ]------------ WARNING: CPU: 3 PID: 521 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0xb64/0xe40 [ttm] Modules linked in: rfcomm cmac bnep btusb btrtl btbcm btintel bluetooth jitterentropy_rng sha512_ssse3 sha512_generic drbg ansi_cprng ecdh_generic ecc rfkill dm_crypt nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher input_leds led_class joydev dm_mod hid_generic usbhid hid f2fs evdev crc32_generic lz4hc_compress raid456 async_raid6_recov async_memcpy lz4_compress async_pq async_xor lz4_decompress async_tx crc32_pclmul ohci_pci md_mod aesni_intel libaes crypto_simd cryptd amdgpu ext4 crc16 fam15h_power k10temp snd_hda_codec_hdmi mbcache ohci_hcd ehci_pci jbd2 ehci_hcd i2c_piix4 snd_hda_intel drm_ttm_helper ttm snd_intel_dspcfg mfd_core snd_hda_codec gpu_sched i2c_algo_bit xhci_pci snd_hwdep snd_hda_core xhci_hcd drm_kms_helper snd_pcm usbcore snd_timer usb_common syscopyarea sysfillrect snd sysimgblt fb_sys_fops soundcore acpi_cpufreq video button processor zram zsmalloc nct6775 hwmon_vid hwmon nfsd auth_rpcgss lockd grace drm fuse drm_panel_orientation_quirks backlight configfs sunrpc efivarfs CPU: 3 PID: 521 Comm: X Not tainted 5.15.0-rc4-bdver3 #2 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A88M-G/3.1, BIOS P1.40C 11/21/2016 RIP: 0010:ttm_bo_release+0xb64/0xe40 [ttm] Code: c1 ea 03 80 3c 02 00 0f 85 77 01 00 00 48 8b bb f0 fe ff ff b9 28 23 00 00 31 d2 be 01 00 00 00 e8 81 bc 50 dd e9 d3 fe ff ff <0f> 0b e9 1c f5 ff ff 4c 89 e7 e8 4d 4d 50 dd e9 26 fc ff ff be 03 RSP: 0018:ffffc90001a8fbe0 EFLAGS: 00010206 RAX: 0000000000000007 RBX: ffff888106ade698 RCX: 0000000000000009 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff888106ade698 RBP: ffff888106ade458 R08: ffffffffc0ba3689 R09: ffff888106ade69b R10: ffffed1020d5bcd3 R11: 0000000000000001 R12: ffff88814db40010 R13: ffff88814c0ad2f8 R14: ffff88814c0ad340 R15: ffff88816a9fb1c0 FS: 00007f52f84f2dc0(0000) GS:ffff8883d1780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1fd3d6d078 CR3: 0000000142fda000 CR4: 00000000000506e0 Call Trace: ? fsnotify_grab_connector+0xcc/0x190 ? fsnotify_destroy_marks+0x5f/0x200 amdgpu_bo_unref+0x2c/0x60 [amdgpu] amdgpu_gem_object_free+0x6a/0xa0 [amdgpu] ? amdgpu_gem_object_mmap+0xe0/0xe0 [amdgpu] ? trace_hardirqs_on+0x1c/0x110 drm_gem_dmabuf_release+0x82/0xb0 [drm] dma_buf_release+0x127/0x230 __dentry_kill+0x376/0x550 ? dma_buf_file_release+0x177/0x200 __fput+0x2c0/0x8c0 task_work_run+0xc5/0x150 do_exit+0x799/0x20c0 ? mm_update_next_owner+0x6d0/0x6d0 do_group_exit+0xe7/0x290 __x64_sys_exit_group+0x35/0x40 do_syscall_64+0x66/0x90 ? do_syscall_64+0xe/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f52f7d8c2f9 Code: Unable to access opcode bytes at RIP 0x7f52f7d8c2cf. RSP: 002b:00007ffc11abca58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f52f7e74920 RCX: 00007f52f7d8c2f9 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 RBP: 00007f52f7e74920 R08: fffffffffffffd40 R09: 000055d959e97190 R10: 00007f52f75386b8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000668 R15: 0000000000000000 irq event stamp: 815615493 hardirqs last enabled at (815615499): [<ffffffff9d1f7f4c>] vprintk_emit+0x2dc/0x310 hardirqs last disabled at (815615504): [<ffffffff9d1f7efb>] vprintk_emit+0x28b/0x310 softirqs last enabled at (815614376): [<ffffffff9e62766c>] unix_release_sock+0x23c/0xa70 softirqs last disabled at (815614374): [<ffffffff9e6275f2>] unix_release_sock+0x1c2/0xa70 ---[ end trace 4449f17f76814cfa ]--- # lspci 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) I/O Memory Management Unit 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d7) 00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri HDMI/DP Audio Controller 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Kaveri P2P Bridge for GFX PCIe Port [1:0] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port 00:03.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port 00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 09) 00:10.1 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 09) 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 40) 00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11) 00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11) 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 16) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11) 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] FCH PCI Bridge (rev 40) 00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 0) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 5 01:00.0 USB controller: ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller 02:00.0 USB controller: ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11) 05:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD # lspci -s 00:01.0 -v 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d7) (prog-if 00 [VGA controller]) Subsystem: ASRock Incorporation Kaveri [Radeon R7 Graphics] Flags: bus master, fast devsel, latency 0, IRQ 62, IOMMU group 0 Memory at c0000000 (64-bit, prefetchable) [size=256M] Memory at d0000000 (64-bit, prefetchable) [size=8M] I/O ports at f000 [size=256] Memory at feb00000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Capabilities: [58] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [270] Secondary PCI Express Capabilities: [2b0] Address Translation Service (ATS) Capabilities: [2c0] Page Request Interface (PRI) Capabilities: [2d0] Process Address Space ID (PASID) Kernel driver in use: amdgpu Kernel modules: amdgpu
Created attachment 299099 [details] kernel .config (5.15-rc4, AMD PRO A10-8750B)
*** This bug has been marked as a duplicate of bug 213983 ***
Could you please reproduce it on a ubuntu 20.04 system? I didn't reproduce it on ubuntu 20.04. And could you please get outputs of following command before rebooting? Thanks! $ cat /sys/kernel/debug/dma_buf/bufinfo
(In reply to Lang Yu from comment #3) > Could you please reproduce it on a ubuntu 20.04 system? I didn't reproduce > it on ubuntu 20.04. Not on this machine unfortunately. I need it 24/7 as a fileserver and audio server. Building vanilla or -rc kernels and running these on Gentoo is fine but installing another distro, getting configs equal and running Ubuntu instead for testing purposes is too much effort, sorry. > And could you please get outputs of following command before rebooting? > Thanks! > > $ cat /sys/kernel/debug/dma_buf/bufinfo Current kernel is v5.14.14, built with clang 12.0.1 (-Os). # cat /sys/kernel/debug/dma_buf/bufinfo Dma-buf Objects: size flags mode count exp_name ino 03047424 00000002 00080007 00000003 drm 00415409 Shared fence: drm_sched gfx signalled Attached Devices: Total 0 devices attached 03047424 00000002 00080007 00000003 drm 00414559 Shared fence: drm_sched gfx signalled Attached Devices: Total 0 devices attached 00004096 00000002 00080007 00000003 drm 00414537 Attached Devices: Total 0 devices attached 00004096 00000002 00080007 00000003 drm 00415280 Shared fence: drm_sched gfx signalled Attached Devices: Total 0 devices attached 07864320 00000002 00080007 00000003 drm 00244896 Shared fence: drm_sched gfx signalled Attached Devices: Total 0 devices attached 07864320 00000002 00080007 00000003 drm 00242446 Shared fence: drm_sched gfx signalled Attached Devices: Total 0 devices attached 00065536 00000002 00080007 00000003 drm 00058778 Attached Devices: Total 0 devices attached 08355840 00000002 00080007 00000004 drm 00058661 Attached Devices: Total 0 devices attached 08355840 00000002 00080007 00000002 drm 00003715 Shared fence: drm_sched gfx signalled Attached Devices: Total 0 devices attached Total 9 objects, 38608896 bytes
Created attachment 299383 [details] fix a potential dma-buf release warning Please have a try with attached patch. Thanks!
(In reply to Lang Yu from comment #5) > Created attachment 299383 [details] > fix a potential dma-buf release warning > > Please have a try with attached patch. Thanks! Thanks! Applied the patch on top of v5.15 but still get: [...] ------------[ cut here ]------------ WARNING: CPU: 2 PID: 519 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0xb64/0xe40 [ttm] Modules linked in: rfkill dm_crypt nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher joydev input_leds led_class hid_generic usbhid dm_mod hid ohci_pci raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx evdev f2fs crc32_generic lz4hc_compress lz4_compress lz4_decompress crc32_pclmul amdgpu md_mod aesni_intel libaes crypto_simd cryptd ext4 crc16 mbcache fam15h_power snd_hda_codec_hdmi jbd2 k10temp ehci_pci ohci_hcd snd_hda_intel ehci_hcd snd_intel_dspcfg xhci_pci drm_ttm_helper i2c_piix4 snd_hda_codec ttm mfd_core snd_hwdep snd_hda_core gpu_sched xhci_hcd i2c_algo_bit snd_pcm drm_kms_helper usbcore snd_timer syscopyarea sysfillrect snd sysimgblt usb_common fb_sys_fops soundcore acpi_cpufreq video processor button zram zsmalloc nfsd nct6775 hwmon_vid hwmon auth_rpcgss drm lockd grace fuse drm_panel_orientation_quirks backlight configfs sunrpc efivarfs CPU: 2 PID: 519 Comm: X Not tainted 5.15.0-bdver3+ #3 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A88M-G/3.1, BIOS P1.40C 11/21/2016 RIP: 0010:ttm_bo_release+0xb64/0xe40 [ttm] Code: c1 ea 03 80 3c 02 00 0f 85 77 01 00 00 48 8b bb f0 fe ff ff b9 28 23 00 00 31 d2 be 01 00 00 00 e8 81 c9 54 da e9 d3 fe ff ff <0f> 0b e9 1c f5 ff ff 4c 89 e7 e8 4d 5a 54 da e9 26 fc ff ff be 03 RSP: 0018:ffffc900018afb18 EFLAGS: 00010202 RAX: 0000000000000007 RBX: ffff88813d2a7298 RCX: 000000000000001c RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88813d2a7298 RBP: ffff88813d2a7000 R08: ffffffffc0b63689 R09: ffff88813d2a729b R10: ffffed1027a54e53 R11: 0000000000000001 R12: dffffc0000000000 R13: ffff8881748d03a8 R14: ffff8881748d03f0 R15: ffff88810b138b40 FS: 00007fa8bfe7adc0(0000) GS:ffff8883d1700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000562b379a2098 CR3: 0000000142970000 CR4: 00000000000506e0 Call Trace: ? fsnotify_grab_connector+0xcc/0x190 amdgpu_bo_unref+0x2c/0x60 [amdgpu] amdgpu_gem_object_free+0xc0/0x100 [amdgpu] ? amdgpu_gem_object_mmap+0xe0/0xe0 [amdgpu] ? call_rcu+0x37f/0x730 ? trace_hardirqs_on+0x1c/0x110 drm_gem_dmabuf_release+0x82/0xb0 [drm] dma_buf_release+0x127/0x230 __dentry_kill+0x376/0x550 ? dma_buf_file_release+0x177/0x200 __fput+0x2c0/0x8c0 task_work_run+0xc5/0x150 do_exit+0x799/0x20c0 ? mm_update_next_owner+0x6d0/0x6d0 do_group_exit+0xe7/0x290 __x64_sys_exit_group+0x35/0x40 do_syscall_64+0x66/0x90 ? do_syscall_64+0xe/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa8bf6fc2f9 Code: Unable to access opcode bytes at RIP 0x7fa8bf6fc2cf. RSP: 002b:00007ffc95722778 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007fa8bf7e4920 RCX: 00007fa8bf6fc2f9 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 RBP: 00007fa8bf7e4920 R08: fffffffffffffd40 R09: 000000000098b190 R10: 00007fa8bef086b8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 000000000000066a R15: 0000000000000000 irq event stamp: 887428545 hardirqs last enabled at (887428551): [<ffffffff9a1f801c>] vprintk_emit+0x2dc/0x310 hardirqs last disabled at (887428556): [<ffffffff9a1f7fcb>] vprintk_emit+0x28b/0x310 softirqs last enabled at (887427644): [<ffffffff9a0d0165>] __irq_exit_rcu+0xe5/0x120 softirqs last disabled at (887427625): [<ffffffff9a0d0165>] __irq_exit_rcu+0xe5/0x120 ---[ end trace 1b4ae7cf543ff5f4 ]--- [...] It does not trigger on every reboot though, the machine needs to have been running for a few hrs.
Yeah, that won't work. As far as I can see the problem is not inside amdgpu, but rather inside the driver which is importing buffers from amdgpu.
(In reply to Christian König from comment #7) > Yeah, that won't work. As far as I can see the problem is not inside amdgpu, > but rather inside the driver which is importing buffers from amdgpu. At least, we should call drm_prime_gem_destroy() to detach dma-buf(if exists) before WARN_ON_ONCE(bo->pin_count). And do you think if clients don't unmap/detach amdgpu dma-buf properly, should amdgpu do that work? Thanks!
(In reply to Lang Yu from comment #8) > (In reply to Christian König from comment #7) > > Yeah, that won't work. As far as I can see the problem is not inside > amdgpu, > > but rather inside the driver which is importing buffers from amdgpu. > > At least, we should call drm_prime_gem_destroy() to detach dma-buf(if > exists) before WARN_ON_ONCE(bo->pin_count). Nope, that's incorrect. You are mixing things up here. This is for the case when amdgpu imports a buffer, but the warning happens when amdgpu exports a buffer. And on import you indeed only want to drop the attachment after the BO is really destroyed or not when the GEM handle is destroyed. Otherwise you could potentially unmap memory while it is still used by the hardware. > And do you think if clients don't unmap/detach amdgpu dma-buf properly, > should amdgpu do that work? Thanks! No. That rather looks like the importer is messing up some reference count and forgets to destroy the attachment before the dma-buf. There is absolutely nothing the exporter can do in that situation. There is the slightly chance that the bug is indeed somewhere inside amdgpu or the dma-buf framework itself (Michel and I are huntin a similar issue at the moment), but it does work with other driver combinations.
(In reply to Christian König from comment #9) > (In reply to Lang Yu from comment #8) > > (In reply to Christian König from comment #7) > > > Yeah, that won't work. As far as I can see the problem is not inside > > amdgpu, > > > but rather inside the driver which is importing buffers from amdgpu. > > > > At least, we should call drm_prime_gem_destroy() to detach dma-buf(if > > exists) before WARN_ON_ONCE(bo->pin_count). > > Nope, that's incorrect. You are mixing things up here. > > This is for the case when amdgpu imports a buffer, but the warning happens > when amdgpu exports a buffer. > > And on import you indeed only want to drop the attachment after the BO is > really destroyed or not when the GEM handle is destroyed. Otherwise you > could potentially unmap memory while it is still used by the hardware. > > > And do you think if clients don't unmap/detach amdgpu dma-buf properly, > > should amdgpu do that work? Thanks! > > No. That rather looks like the importer is messing up some reference count > and forgets to destroy the attachment before the dma-buf. There is > absolutely nothing the exporter can do in that situation. > > There is the slightly chance that the bug is indeed somewhere inside amdgpu > or the dma-buf framework itself (Michel and I are huntin a similar issue at > the moment), but it does work with other driver combinations. Thanks for your clarification. Seems hard to reproduce the issue.
Well it's really appreciated that you are looking into this. One thing we might want to do is to move the warning in dma_buf_release(): diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 3f63d58bf68a..6ecc01585cf4 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -75,6 +75,7 @@ static void dma_buf_release(struct dentry *dentry) * dma-buf while still having pending operation to the buffer. */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active); + WARN_ON(!list_empty(&dmabuf->attachments)); dma_buf_stats_teardown(dmabuf); dmabuf->ops->release(dmabuf); @@ -82,7 +83,6 @@ static void dma_buf_release(struct dentry *dentry) if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) dma_resv_fini(dmabuf->resv); - WARN_ON(!list_empty(&dmabuf->attachments)); module_put(dmabuf->owner); kfree(dmabuf->name); kfree(dmabuf); This way users get the dma-buf warning first and maybe a bit less confused.
(In reply to Christian König from comment #11) > Well it's really appreciated that you are looking into this. > > One thing we might want to do is to move the warning in dma_buf_release(): > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 3f63d58bf68a..6ecc01585cf4 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -75,6 +75,7 @@ static void dma_buf_release(struct dentry *dentry) > * dma-buf while still having pending operation to the buffer. > */ > BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active); > + WARN_ON(!list_empty(&dmabuf->attachments)); > > dma_buf_stats_teardown(dmabuf); > dmabuf->ops->release(dmabuf); > @@ -82,7 +83,6 @@ static void dma_buf_release(struct dentry *dentry) > if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) > dma_resv_fini(dmabuf->resv); > > - WARN_ON(!list_empty(&dmabuf->attachments)); > module_put(dmabuf->owner); > kfree(dmabuf->name); > kfree(dmabuf); > > This way users get the dma-buf warning first and maybe a bit less confused. The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this warning. And we can also add a debug WARN() into amdgpu_dma_buf_pin() to see who pinned dma_buf.
(In reply to Lang Yu from comment #12) > The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49 > 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this warning. Good point. > And we can also add a debug WARN() into amdgpu_dma_buf_pin() to see who > pinned dma_buf. I thought about that as well, the problem is that we call this function very often (e.g. 60 times a second if we play a 60fps video or similar). Saying that I could as well be misguided by the dma_buf_release() function in the call stack. This could potentially also be a bug in DAL/DC where we forget to unpin a BO in some situation.
(In reply to Lang Yu from comment #12) > The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49 > 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this warning. I applied your patch on top of v5.15 after its' release which was 2021-10-31 not on git master.
(In reply to Erhard F. from comment #14) > (In reply to Lang Yu from comment #12) > > The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49 > > 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this > warning. > I applied your patch on top of v5.15 after its' release which was 2021-10-31 > not on git master. Many thanks for your help! I made a test patch to find who pinned amdgpu dmabuf. Could you please apply it on latest(commit 7ddb58cb0ecae8e8b6181d736a87667cc9ab8389) mainline 5.15.0, then reproduce the warning and collect full dmesg? As I still didn't reproduce it on my machine...
(In reply to Lang Yu from comment #15) > Many thanks for your help! I made a test patch to find who pinned amdgpu > dmabuf. Did you forget to attach it?
Created attachment 299441 [details] test patch to find who pinned amdgpu dmabuf dmesg may be too large, please add log_buf_len=1024M into kernel cmdline. Thanks for you help!
Hi all, I reproduced the issue. Thanks for Erhard F.'s work! The problem is the pinned BO of last call to amdgpu_display_crtc_page_flip_target() was not unpinned properly. int amdgpu_display_crtc_page_flip_target(struct drm_crtc *crtc, struct drm_framebuffer *fb, struct drm_pending_vblank_event *event, uint32_t page_flip_flags, uint32_t target, struct drm_modeset_acquire_ctx *ctx) { struct drm_device *dev = crtc->dev; struct amdgpu_device *adev = drm_to_adev(dev); struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc); struct drm_gem_object *obj; struct amdgpu_flip_work *work; struct amdgpu_bo *new_abo; unsigned long flags; u64 tiling_flags; int i, r; work = kzalloc(sizeof *work, GFP_KERNEL); if (work == NULL) return -ENOMEM; INIT_DELAYED_WORK(&work->flip_work, amdgpu_display_flip_work_func); INIT_WORK(&work->unpin_work, amdgpu_display_unpin_work_func); work->event = event; work->adev = adev; work->crtc_id = amdgpu_crtc->crtc_id; work->async = (page_flip_flags & DRM_MODE_PAGE_FLIP_ASYNC) != 0; /* schedule unpin of the old buffer */ obj = crtc->primary->fb->obj[0]; /* take a reference to the old object */ work->old_abo = gem_to_amdgpu_bo(obj); amdgpu_bo_ref(work->old_abo); obj = fb->obj[0]; new_abo = gem_to_amdgpu_bo(obj); /* pin the new buffer */ r = amdgpu_bo_reserve(new_abo, false); if (unlikely(r != 0)) { DRM_ERROR("failed to reserve new abo buffer before flip\n"); goto cleanup; } if (!adev->enable_virtual_display) { r = amdgpu_bo_pin(new_abo, amdgpu_display_supported_domains(adev, new_abo->flags)); if (unlikely(r != 0)) { DRM_ERROR("failed to pin new abo buffer before flip\n"); goto unreserve; } } ...... } Regards, Lang
(In reply to Lang Yu from comment #18) > Hi all, > > I reproduced the issue. Thanks for Erhard F.'s work! > > The problem is the pinned BO of last call to > amdgpu_display_crtc_page_flip_target() was not unpinned properly. Thanks for your work on it Lang! I was rather busy and would have been able to test it out this weekend. But glad you found the cause of the issue anyhow!
Nice work Lang, question is now only how to fix it? We probably need to assign this bug to Harry and Nicholas.
On Fedora 35 Gnome I had same problem when I logged with xorg session, and was not probleb with xwayland. Then I rebooted with init 3 and started both session without GDM and problem was gone! If this information helps I will happy, Thanks all developers for your work!