Bug 214621

Summary: WARNING: CPU: 3 PID: 521 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0xb64/0xe40 [ttm]
Product: Drivers Reporter: Erhard F. (erhard_f)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED DUPLICATE    
Severity: normal CC: alexdeucher, christian.koenig, fdc, frederik, jmprieto, kakha, Lang.Yu, ray.huang
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5,15-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel dmesg (5.15-rc4, AMD PRO A10-8750B)
kernel .config (5.15-rc4, AMD PRO A10-8750B)
fix a potential dma-buf release warning
test patch to find who pinned amdgpu dmabuf

Description Erhard F. 2021-10-04 23:41:06 UTC
Created attachment 299097 [details]
kernel dmesg (5.15-rc4, AMD PRO A10-8750B)

Happened during reboot. Machine was able to reboot succesfully however.

[...]
------------[ cut here ]------------
WARNING: CPU: 3 PID: 521 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0xb64/0xe40 [ttm]
Modules linked in: rfcomm cmac bnep btusb btrtl btbcm btintel bluetooth jitterentropy_rng sha512_ssse3 sha512_generic drbg ansi_cprng ecdh_generic ecc rfkill dm_crypt nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher input_leds led_class joydev dm_mod hid_generic usbhid hid f2fs evdev crc32_generic lz4hc_compress raid456 async_raid6_recov async_memcpy lz4_compress async_pq async_xor lz4_decompress async_tx crc32_pclmul ohci_pci md_mod aesni_intel libaes crypto_simd cryptd amdgpu ext4 crc16 fam15h_power k10temp snd_hda_codec_hdmi mbcache ohci_hcd ehci_pci jbd2 ehci_hcd i2c_piix4 snd_hda_intel drm_ttm_helper ttm snd_intel_dspcfg mfd_core snd_hda_codec gpu_sched i2c_algo_bit xhci_pci snd_hwdep snd_hda_core xhci_hcd drm_kms_helper snd_pcm usbcore snd_timer usb_common syscopyarea sysfillrect snd sysimgblt fb_sys_fops soundcore acpi_cpufreq video button processor zram zsmalloc nct6775 hwmon_vid hwmon nfsd auth_rpcgss lockd grace
 drm fuse drm_panel_orientation_quirks backlight configfs sunrpc efivarfs
CPU: 3 PID: 521 Comm: X Not tainted 5.15.0-rc4-bdver3 #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A88M-G/3.1, BIOS P1.40C 11/21/2016
RIP: 0010:ttm_bo_release+0xb64/0xe40 [ttm]
Code: c1 ea 03 80 3c 02 00 0f 85 77 01 00 00 48 8b bb f0 fe ff ff b9 28 23 00 00 31 d2 be 01 00 00 00 e8 81 bc 50 dd e9 d3 fe ff ff <0f> 0b e9 1c f5 ff ff 4c 89 e7 e8 4d 4d 50 dd e9 26 fc ff ff be 03
RSP: 0018:ffffc90001a8fbe0 EFLAGS: 00010206
RAX: 0000000000000007 RBX: ffff888106ade698 RCX: 0000000000000009
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff888106ade698
RBP: ffff888106ade458 R08: ffffffffc0ba3689 R09: ffff888106ade69b
R10: ffffed1020d5bcd3 R11: 0000000000000001 R12: ffff88814db40010
R13: ffff88814c0ad2f8 R14: ffff88814c0ad340 R15: ffff88816a9fb1c0
FS:  00007f52f84f2dc0(0000) GS:ffff8883d1780000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1fd3d6d078 CR3: 0000000142fda000 CR4: 00000000000506e0
Call Trace:
 ? fsnotify_grab_connector+0xcc/0x190
 ? fsnotify_destroy_marks+0x5f/0x200
 amdgpu_bo_unref+0x2c/0x60 [amdgpu]
 amdgpu_gem_object_free+0x6a/0xa0 [amdgpu]
 ? amdgpu_gem_object_mmap+0xe0/0xe0 [amdgpu]
 ? trace_hardirqs_on+0x1c/0x110
 drm_gem_dmabuf_release+0x82/0xb0 [drm]
 dma_buf_release+0x127/0x230
 __dentry_kill+0x376/0x550
 ? dma_buf_file_release+0x177/0x200
 __fput+0x2c0/0x8c0
 task_work_run+0xc5/0x150
 do_exit+0x799/0x20c0
 ? mm_update_next_owner+0x6d0/0x6d0
 do_group_exit+0xe7/0x290
 __x64_sys_exit_group+0x35/0x40
 do_syscall_64+0x66/0x90
 ? do_syscall_64+0xe/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f52f7d8c2f9
Code: Unable to access opcode bytes at RIP 0x7f52f7d8c2cf.
RSP: 002b:00007ffc11abca58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f52f7e74920 RCX: 00007f52f7d8c2f9
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 00007f52f7e74920 R08: fffffffffffffd40 R09: 000055d959e97190
R10: 00007f52f75386b8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000668 R15: 0000000000000000
irq event stamp: 815615493
hardirqs last  enabled at (815615499): [<ffffffff9d1f7f4c>] vprintk_emit+0x2dc/0x310
hardirqs last disabled at (815615504): [<ffffffff9d1f7efb>] vprintk_emit+0x28b/0x310
softirqs last  enabled at (815614376): [<ffffffff9e62766c>] unix_release_sock+0x23c/0xa70
softirqs last disabled at (815614374): [<ffffffff9e6275f2>] unix_release_sock+0x1c2/0xa70
---[ end trace 4449f17f76814cfa ]---


 # lspci 
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) I/O Memory Management Unit
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d7)
00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri HDMI/DP Audio Controller
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Kaveri P2P Bridge for GFX PCIe Port [1:0]
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port
00:03.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Root Port
00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 09)
00:10.1 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 09)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 40)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 16)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] FCH PCI Bridge (rev 40)
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 0)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 30h-3fh) Processor Function 5
01:00.0 USB controller: ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller
02:00.0 USB controller: ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11)
05:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD

 # lspci -s 00:01.0 -v
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kaveri [Radeon R7 Graphics] (rev d7) (prog-if 00 [VGA controller])
	Subsystem: ASRock Incorporation Kaveri [Radeon R7 Graphics]
	Flags: bus master, fast devsel, latency 0, IRQ 62, IOMMU group 0
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=8M]
	I/O ports at f000 [size=256]
	Memory at feb00000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Complex Integrated Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
Comment 1 Erhard F. 2021-10-04 23:41:43 UTC
Created attachment 299099 [details]
kernel .config (5.15-rc4, AMD PRO A10-8750B)
Comment 2 Erhard F. 2021-10-20 15:15:25 UTC

*** This bug has been marked as a duplicate of bug 213983 ***
Comment 3 Lang Yu 2021-10-27 07:12:25 UTC
Could you please reproduce it on a ubuntu 20.04 system? I didn't reproduce it on ubuntu 20.04. 

And could you please get outputs of following command before rebooting? Thanks!

$ cat /sys/kernel/debug/dma_buf/bufinfo
Comment 4 Erhard F. 2021-10-27 13:50:46 UTC
(In reply to Lang Yu from comment #3)
> Could you please reproduce it on a ubuntu 20.04 system? I didn't reproduce
> it on ubuntu 20.04.
Not on this machine unfortunately. I need it 24/7 as a fileserver and audio server. Building vanilla or -rc kernels and running these on Gentoo is fine but installing another distro, getting configs equal and running Ubuntu instead for testing purposes is too much effort, sorry.

> And could you please get outputs of following command before rebooting?
> Thanks!
> 
> $ cat /sys/kernel/debug/dma_buf/bufinfo
Current kernel is v5.14.14, built with clang 12.0.1 (-Os).

 # cat /sys/kernel/debug/dma_buf/bufinfo

Dma-buf Objects:
size    	flags   	mode    	count   	exp_name	ino     
03047424	00000002	00080007	00000003	drm	00415409	
	Shared fence: drm_sched gfx signalled
	Attached Devices:
Total 0 devices attached

03047424	00000002	00080007	00000003	drm	00414559	
	Shared fence: drm_sched gfx signalled
	Attached Devices:
Total 0 devices attached

00004096	00000002	00080007	00000003	drm	00414537	
	Attached Devices:
Total 0 devices attached

00004096	00000002	00080007	00000003	drm	00415280	
	Shared fence: drm_sched gfx signalled
	Attached Devices:
Total 0 devices attached

07864320	00000002	00080007	00000003	drm	00244896	
	Shared fence: drm_sched gfx signalled
	Attached Devices:
Total 0 devices attached

07864320	00000002	00080007	00000003	drm	00242446	
	Shared fence: drm_sched gfx signalled
	Attached Devices:
Total 0 devices attached

00065536	00000002	00080007	00000003	drm	00058778	
	Attached Devices:
Total 0 devices attached

08355840	00000002	00080007	00000004	drm	00058661	
	Attached Devices:
Total 0 devices attached

08355840	00000002	00080007	00000002	drm	00003715	
	Shared fence: drm_sched gfx signalled
	Attached Devices:
Total 0 devices attached


Total 9 objects, 38608896 bytes
Comment 5 Lang Yu 2021-11-01 08:46:05 UTC
Created attachment 299383 [details]
fix a potential dma-buf release warning

Please have a try with attached patch. Thanks!
Comment 6 Erhard F. 2021-11-03 18:01:09 UTC
(In reply to Lang Yu from comment #5)
> Created attachment 299383 [details]
> fix a potential dma-buf release warning
> 
> Please have a try with attached patch. Thanks!
Thanks! Applied the patch on top of v5.15 but still get:

[...]
------------[ cut here ]------------
WARNING: CPU: 2 PID: 519 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0xb64/0xe40 [ttm]
Modules linked in: rfkill dm_crypt nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher joydev input_leds led_class hid_generic usbhid dm_mod hid ohci_pci raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx evdev f2fs crc32_generic lz4hc_compress lz4_compress lz4_decompress crc32_pclmul amdgpu md_mod aesni_intel libaes crypto_simd cryptd ext4 crc16 mbcache fam15h_power snd_hda_codec_hdmi jbd2 k10temp ehci_pci ohci_hcd snd_hda_intel ehci_hcd snd_intel_dspcfg xhci_pci drm_ttm_helper i2c_piix4 snd_hda_codec ttm mfd_core snd_hwdep snd_hda_core gpu_sched xhci_hcd i2c_algo_bit snd_pcm drm_kms_helper usbcore snd_timer syscopyarea sysfillrect snd sysimgblt usb_common fb_sys_fops soundcore acpi_cpufreq video processor button zram zsmalloc nfsd nct6775 hwmon_vid hwmon auth_rpcgss drm lockd grace fuse drm_panel_orientation_quirks backlight configfs sunrpc efivarfs
CPU: 2 PID: 519 Comm: X Not tainted 5.15.0-bdver3+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A88M-G/3.1, BIOS P1.40C 11/21/2016
RIP: 0010:ttm_bo_release+0xb64/0xe40 [ttm]
Code: c1 ea 03 80 3c 02 00 0f 85 77 01 00 00 48 8b bb f0 fe ff ff b9 28 23 00 00 31 d2 be 01 00 00 00 e8 81 c9 54 da e9 d3 fe ff ff <0f> 0b e9 1c f5 ff ff 4c 89 e7 e8 4d 5a 54 da e9 26 fc ff ff be 03
RSP: 0018:ffffc900018afb18 EFLAGS: 00010202
RAX: 0000000000000007 RBX: ffff88813d2a7298 RCX: 000000000000001c
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88813d2a7298
RBP: ffff88813d2a7000 R08: ffffffffc0b63689 R09: ffff88813d2a729b
R10: ffffed1027a54e53 R11: 0000000000000001 R12: dffffc0000000000
R13: ffff8881748d03a8 R14: ffff8881748d03f0 R15: ffff88810b138b40
FS:  00007fa8bfe7adc0(0000) GS:ffff8883d1700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000562b379a2098 CR3: 0000000142970000 CR4: 00000000000506e0
Call Trace:
 ? fsnotify_grab_connector+0xcc/0x190
 amdgpu_bo_unref+0x2c/0x60 [amdgpu]
 amdgpu_gem_object_free+0xc0/0x100 [amdgpu]
 ? amdgpu_gem_object_mmap+0xe0/0xe0 [amdgpu]
 ? call_rcu+0x37f/0x730
 ? trace_hardirqs_on+0x1c/0x110
 drm_gem_dmabuf_release+0x82/0xb0 [drm]
 dma_buf_release+0x127/0x230
 __dentry_kill+0x376/0x550
 ? dma_buf_file_release+0x177/0x200
 __fput+0x2c0/0x8c0
 task_work_run+0xc5/0x150
 do_exit+0x799/0x20c0
 ? mm_update_next_owner+0x6d0/0x6d0
 do_group_exit+0xe7/0x290
 __x64_sys_exit_group+0x35/0x40
 do_syscall_64+0x66/0x90
 ? do_syscall_64+0xe/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa8bf6fc2f9
Code: Unable to access opcode bytes at RIP 0x7fa8bf6fc2cf.
RSP: 002b:00007ffc95722778 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007fa8bf7e4920 RCX: 00007fa8bf6fc2f9
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 00007fa8bf7e4920 R08: fffffffffffffd40 R09: 000000000098b190
R10: 00007fa8bef086b8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 000000000000066a R15: 0000000000000000
irq event stamp: 887428545
hardirqs last  enabled at (887428551): [<ffffffff9a1f801c>] vprintk_emit+0x2dc/0x310
hardirqs last disabled at (887428556): [<ffffffff9a1f7fcb>] vprintk_emit+0x28b/0x310
softirqs last  enabled at (887427644): [<ffffffff9a0d0165>] __irq_exit_rcu+0xe5/0x120
softirqs last disabled at (887427625): [<ffffffff9a0d0165>] __irq_exit_rcu+0xe5/0x120
---[ end trace 1b4ae7cf543ff5f4 ]---
[...]

It does not trigger on every reboot though, the machine needs to have been running for a few hrs.
Comment 7 Christian König 2021-11-04 07:19:33 UTC
Yeah, that won't work. As far as I can see the problem is not inside amdgpu, but rather inside the driver which is importing buffers from amdgpu.
Comment 8 Lang Yu 2021-11-04 07:54:00 UTC
(In reply to Christian König from comment #7)
> Yeah, that won't work. As far as I can see the problem is not inside amdgpu,
> but rather inside the driver which is importing buffers from amdgpu.

At least, we should call drm_prime_gem_destroy() to detach dma-buf(if exists) before WARN_ON_ONCE(bo->pin_count).

And do you think if clients don't unmap/detach amdgpu dma-buf properly, should amdgpu do that work? Thanks!
Comment 9 Christian König 2021-11-04 08:05:24 UTC
(In reply to Lang Yu from comment #8)
> (In reply to Christian König from comment #7)
> > Yeah, that won't work. As far as I can see the problem is not inside
> amdgpu,
> > but rather inside the driver which is importing buffers from amdgpu.
> 
> At least, we should call drm_prime_gem_destroy() to detach dma-buf(if
> exists) before WARN_ON_ONCE(bo->pin_count).

Nope, that's incorrect. You are mixing things up here.

This is for the case when amdgpu imports a buffer, but the warning happens when amdgpu exports a buffer.

And on import you indeed only want to drop the attachment after the BO is really destroyed or not when the GEM handle is destroyed. Otherwise you could potentially unmap memory while it is still used by the hardware.

> And do you think if clients don't unmap/detach amdgpu dma-buf properly,
> should amdgpu do that work? Thanks!

No. That rather looks like the importer is messing up some reference count and forgets to destroy the attachment before the dma-buf. There is absolutely nothing the exporter can do in that situation.

There is the slightly chance that the bug is indeed somewhere inside amdgpu or the dma-buf framework itself (Michel and I are huntin a similar issue at the moment), but it does work with other driver combinations.
Comment 10 Lang Yu 2021-11-04 10:09:23 UTC
(In reply to Christian König from comment #9)
> (In reply to Lang Yu from comment #8)
> > (In reply to Christian König from comment #7)
> > > Yeah, that won't work. As far as I can see the problem is not inside
> > amdgpu,
> > > but rather inside the driver which is importing buffers from amdgpu.
> > 
> > At least, we should call drm_prime_gem_destroy() to detach dma-buf(if
> > exists) before WARN_ON_ONCE(bo->pin_count).
> 
> Nope, that's incorrect. You are mixing things up here.
> 
> This is for the case when amdgpu imports a buffer, but the warning happens
> when amdgpu exports a buffer.
> 
> And on import you indeed only want to drop the attachment after the BO is
> really destroyed or not when the GEM handle is destroyed. Otherwise you
> could potentially unmap memory while it is still used by the hardware.
> 
> > And do you think if clients don't unmap/detach amdgpu dma-buf properly,
> > should amdgpu do that work? Thanks!
> 
> No. That rather looks like the importer is messing up some reference count
> and forgets to destroy the attachment before the dma-buf. There is
> absolutely nothing the exporter can do in that situation.
> 
> There is the slightly chance that the bug is indeed somewhere inside amdgpu
> or the dma-buf framework itself (Michel and I are huntin a similar issue at
> the moment), but it does work with other driver combinations.

Thanks for your clarification. Seems hard to reproduce the issue.
Comment 11 Christian König 2021-11-04 10:16:16 UTC
Well it's really appreciated that you are looking into this.

One thing we might want to do is to move the warning in dma_buf_release():

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 3f63d58bf68a..6ecc01585cf4 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -75,6 +75,7 @@ static void dma_buf_release(struct dentry *dentry)
         * dma-buf while still having pending operation to the buffer.
         */
        BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
+       WARN_ON(!list_empty(&dmabuf->attachments));
 
        dma_buf_stats_teardown(dmabuf);
        dmabuf->ops->release(dmabuf);
@@ -82,7 +83,6 @@ static void dma_buf_release(struct dentry *dentry)
        if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
                dma_resv_fini(dmabuf->resv);
 
-       WARN_ON(!list_empty(&dmabuf->attachments));
        module_put(dmabuf->owner);
        kfree(dmabuf->name);
        kfree(dmabuf);

This way users get the dma-buf warning first and maybe a bit less confused.
Comment 12 Lang Yu 2021-11-04 12:15:48 UTC
(In reply to Christian König from comment #11)
> Well it's really appreciated that you are looking into this.
> 
> One thing we might want to do is to move the warning in dma_buf_release():
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 3f63d58bf68a..6ecc01585cf4 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -75,6 +75,7 @@ static void dma_buf_release(struct dentry *dentry)
>          * dma-buf while still having pending operation to the buffer.
>          */
>         BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> +       WARN_ON(!list_empty(&dmabuf->attachments));
>  
>         dma_buf_stats_teardown(dmabuf);
>         dmabuf->ops->release(dmabuf);
> @@ -82,7 +83,6 @@ static void dma_buf_release(struct dentry *dentry)
>         if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
>                 dma_resv_fini(dmabuf->resv);
>  
> -       WARN_ON(!list_empty(&dmabuf->attachments));
>         module_put(dmabuf->owner);
>         kfree(dmabuf->name);
>         kfree(dmabuf);
> 
> This way users get the dma-buf warning first and maybe a bit less confused.

The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this warning. 

And we can also add a debug WARN() into amdgpu_dma_buf_pin() to see who pinned
dma_buf.
Comment 13 Christian König 2021-11-04 12:39:54 UTC
(In reply to Lang Yu from comment #12)
> The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49
> 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this warning. 

Good point.

> And we can also add a debug WARN() into amdgpu_dma_buf_pin() to see who
> pinned dma_buf.

I thought about that as well, the problem is that we call this function very often (e.g. 60 times a second if we play a 60fps video or similar).

Saying that I could as well be misguided by the dma_buf_release() function in the call stack. This could potentially also be a bug in DAL/DC where we forget to unpin a BO in some situation.
Comment 14 Erhard F. 2021-11-04 13:02:16 UTC
(In reply to Lang Yu from comment #12)
> The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49
> 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this warning.
I applied your patch on top of v5.15 after its' release which was 2021-10-31 not on git master.
Comment 15 Lang Yu 2021-11-04 14:13:05 UTC
(In reply to Erhard F. from comment #14)
> (In reply to Lang Yu from comment #12)
> > The warning was just merged into mainline 5.15.0 on Tue Nov 2 16:47:49
> > 2021(commit 56d33754481f). Not sure Erhard F.'s build contains this
> warning.
> I applied your patch on top of v5.15 after its' release which was 2021-10-31
> not on git master.

Many thanks for your help! I made a test patch to find who pinned amdgpu dmabuf.
Could you please apply it on latest(commit 7ddb58cb0ecae8e8b6181d736a87667cc9ab8389) mainline 5.15.0, then reproduce the warning and collect full dmesg? As I still didn't reproduce it on my machine...
Comment 16 Alex Deucher 2021-11-04 14:17:34 UTC
(In reply to Lang Yu from comment #15)
> Many thanks for your help! I made a test patch to find who pinned amdgpu
> dmabuf.

Did you forget to attach it?
Comment 17 Lang Yu 2021-11-04 14:18:40 UTC
Created attachment 299441 [details]
test patch to find who pinned amdgpu dmabuf

dmesg may be too large, please add log_buf_len=1024M into kernel cmdline.
Thanks for you help!
Comment 18 Lang Yu 2021-11-12 10:10:01 UTC
Hi all,

I reproduced the issue. Thanks for Erhard F.'s work!

The problem is the pinned BO of last call to  amdgpu_display_crtc_page_flip_target() was not unpinned properly.


int amdgpu_display_crtc_page_flip_target(struct drm_crtc *crtc,
				struct drm_framebuffer *fb,
				struct drm_pending_vblank_event *event,
				uint32_t page_flip_flags, uint32_t target,
				struct drm_modeset_acquire_ctx *ctx)
{
	struct drm_device *dev = crtc->dev;
	struct amdgpu_device *adev = drm_to_adev(dev);
	struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
	struct drm_gem_object *obj;
	struct amdgpu_flip_work *work;
	struct amdgpu_bo *new_abo;
	unsigned long flags;
	u64 tiling_flags;
	int i, r;

	work = kzalloc(sizeof *work, GFP_KERNEL);
	if (work == NULL)
		return -ENOMEM;

	INIT_DELAYED_WORK(&work->flip_work, amdgpu_display_flip_work_func);
	INIT_WORK(&work->unpin_work, amdgpu_display_unpin_work_func);

	work->event = event;
	work->adev = adev;
	work->crtc_id = amdgpu_crtc->crtc_id;
	work->async = (page_flip_flags & DRM_MODE_PAGE_FLIP_ASYNC) != 0;

	/* schedule unpin of the old buffer */
	obj = crtc->primary->fb->obj[0];

	/* take a reference to the old object */
	work->old_abo = gem_to_amdgpu_bo(obj);
	amdgpu_bo_ref(work->old_abo);

	obj = fb->obj[0];
	new_abo = gem_to_amdgpu_bo(obj);

	/* pin the new buffer */
	r = amdgpu_bo_reserve(new_abo, false);
	if (unlikely(r != 0)) {
		DRM_ERROR("failed to reserve new abo buffer before flip\n");
		goto cleanup;
	}

	if (!adev->enable_virtual_display) {
		r = amdgpu_bo_pin(new_abo,
				  amdgpu_display_supported_domains(adev, new_abo->flags));
		if (unlikely(r != 0)) {
			DRM_ERROR("failed to pin new abo buffer before flip\n");
			goto unreserve;
		}
	}
        
        ......

}

Regards,
Lang
Comment 19 Erhard F. 2021-11-12 11:08:12 UTC
(In reply to Lang Yu from comment #18)
> Hi all,
> 
> I reproduced the issue. Thanks for Erhard F.'s work!
> 
> The problem is the pinned BO of last call to 
> amdgpu_display_crtc_page_flip_target() was not unpinned properly.
Thanks for your work on it Lang! I was rather busy and would have been able to test it out this weekend. But glad you found the cause of the issue anyhow!
Comment 20 Christian König 2021-11-12 12:43:35 UTC
Nice work Lang, question is now only how to fix it?

We probably need to assign this bug to Harry and Nicholas.
Comment 21 Kakha 2022-02-17 13:12:31 UTC
On Fedora 35 Gnome I had same problem when I logged with xorg session, and was not probleb with xwayland. Then I rebooted with init 3 and started both session without GDM and problem was gone! If this information helps I will happy, Thanks all developers for your work!