Bug 214447 - [bisected] Several memory leaks in radeon and ttm
Summary: [bisected] Several memory leaks in radeon and ttm
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-17 01:12 UTC by Erhard F.
Modified: 2021-11-03 09:45 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.14.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel dmesg (kernel 5.14.5, AMD Ryzen 9 5950X) (59.83 KB, text/plain)
2021-09-17 01:12 UTC, Erhard F.
Details
kernel .config (kernel 5.14.5, AMD Ryzen 9 5950X) (114.38 KB, text/plain)
2021-09-17 01:13 UTC, Erhard F.
Details
output of kmemleak (kernel 5.14.5, AMD Ryzen 9 5950X) (1.83 MB, text/plain)
2021-09-17 01:14 UTC, Erhard F.
Details
bisect.log (3.17 KB, text/plain)
2021-09-17 22:38 UTC, Erhard F.
Details
kernel dmesg (kernel 5.14.9, Talos II Secure Workstation) (90.07 KB, text/plain)
2021-10-08 00:15 UTC, Erhard F.
Details
kernel .config (kernel 5.14.9, Talos II Secure Workstation) (107.72 KB, text/plain)
2021-10-08 00:21 UTC, Erhard F.
Details
kernel dmesg (kernel 5.15-rc5, AMD Ryzen 9 5950X) (61.99 KB, text/plain)
2021-10-17 22:39 UTC, Erhard F.
Details
output of kmemleak (kernel 5.15-rc5, AMD Ryzen 9 5950X) (226.68 KB, application/x-xz)
2021-10-17 22:40 UTC, Erhard F.
Details
kernel .config (kernel 5.15-rc5, AMD Ryzen 9 5950X) (113.56 KB, text/plain)
2021-10-17 22:41 UTC, Erhard F.
Details
Potential fix (1.00 KB, application/mbox)
2021-10-20 17:27 UTC, Christian König
Details

Description Erhard F. 2021-09-17 01:12:42 UTC
Created attachment 298853 [details]
kernel dmesg (kernel 5.14.5, AMD Ryzen 9 5950X)

Getting this on kernel 5.14.5 with my Radeon X1550:

unreferenced object 0xffff911cfdbf4e00 (size 128):
  comm "X", pid 570, jiffies 4294879485 (age 3667.907s)
  hex dump (first 32 bytes):
    00 99 b7 fd 1c 91 ff ff 00 f2 70 c0 ff ff ff ff  ..........p.....
    a7 c6 67 be 01 00 00 00 d0 7c 4e 0d e2 a8 ff ff  ..g......|N.....
  backtrace:
    [<ffffffffc06415f0>] radeon_fence_emit+0x20/0xe0 [radeon]
    [<ffffffffc065d966>] r100_copy_blit+0x586/0x650 [radeon]
    [<ffffffffc0642b55>] radeon_bo_move+0x365/0x560 [radeon]
    [<ffffffffc055ae8a>] ttm_bo_handle_move_mem+0x8a/0x180 [ttm]
    [<ffffffffc055c19e>] ttm_bo_validate+0xae/0x180 [ttm]
    [<ffffffffc06442a3>] radeon_bo_fault_reserve_notify+0x113/0x1e0 [radeon]
    [<ffffffffc06557aa>] radeon_gem_fault+0x5a/0xb0 [radeon]
    [<ffffffffb6171dd3>] __do_fault+0x33/0xe0
    [<ffffffffb6178d20>] __handle_mm_fault+0xc90/0x1260
    [<ffffffffb61793a5>] handle_mm_fault+0xb5/0x230
    [<ffffffffb66d6205>] exc_page_fault+0x185/0x5e0
    [<ffffffffb6800b1e>] asm_exc_page_fault+0x1e/0x30
unreferenced object 0xffff911cfdbf7800 (size 128):
  comm "X", pid 570, jiffies 4294879485 (age 3667.907s)
  hex dump (first 32 bytes):
    00 99 b7 fd 1c 91 ff ff 00 f2 70 c0 ff ff ff ff  ..........p.....
    a9 01 6a be 01 00 00 00 d0 7c 4e 0d e2 a8 ff ff  ..j......|N.....
  backtrace:
    [<ffffffffc06415f0>] radeon_fence_emit+0x20/0xe0 [radeon]
    [<ffffffffc065d966>] r100_copy_blit+0x586/0x650 [radeon]
    [<ffffffffc0642b55>] radeon_bo_move+0x365/0x560 [radeon]
    [<ffffffffc055ae8a>] ttm_bo_handle_move_mem+0x8a/0x180 [ttm]
    [<ffffffffc055c19e>] ttm_bo_validate+0xae/0x180 [ttm]
    [<ffffffffc06442a3>] radeon_bo_fault_reserve_notify+0x113/0x1e0 [radeon]
    [<ffffffffc06557aa>] radeon_gem_fault+0x5a/0xb0 [radeon]
    [<ffffffffb6171dd3>] __do_fault+0x33/0xe0
    [<ffffffffb6178d20>] __handle_mm_fault+0xc90/0x1260
    [<ffffffffb61793a5>] handle_mm_fault+0xb5/0x230
    [<ffffffffb66d6205>] exc_page_fault+0x185/0x5e0
    [<ffffffffb6800b1e>] asm_exc_page_fault+0x1e/0x30
[...]


 # lspci 
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7
01:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01)
02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller (rev 01)
02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
02:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01)
03:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
03:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01)
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series]
07:00.1 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series] (Secondary)
08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
09:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
09:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller


 # lspci -s 07:00.0 -vv
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series] (prog-if 00 [VGA controller])
	Subsystem: PC Partner Limited / Sapphire Technology RV516 [Radeon X1300/X1550 Series]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 57
	IOMMU group: 2
	Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at fce30000 (64-bit, non-prefetchable) [size=64K]
	Region 4: I/O ports at e000 [size=256]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset- SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Kernel driver in use: radeon
	Kernel modules: radeon
Comment 1 Erhard F. 2021-09-17 01:13:20 UTC
Created attachment 298855 [details]
kernel .config (kernel 5.14.5, AMD Ryzen 9 5950X)
Comment 2 Erhard F. 2021-09-17 01:14:17 UTC
Created attachment 298857 [details]
output of kmemleak (kernel 5.14.5, AMD Ryzen 9 5950X)
Comment 3 Christian König 2021-09-17 06:14:54 UTC
Looks like a missing dma_fence_put() somewhere.

Can you bisect this? Should be rather trivial to pin point the patch which introduced this.
Comment 4 Erhard F. 2021-09-17 10:08:21 UTC
(In reply to Christian König from comment #3)
> Looks like a missing dma_fence_put() somewhere.
> 
> Can you bisect this? Should be rather trivial to pin point the patch which
> introduced this.
Ok, I will try to bisect.

Also the problem must have been introduced recently. Have been running the machine with Kernel 5.13.18 for a few hours now and the leak did not show up.
Comment 5 Erhard F. 2021-09-17 22:38:01 UTC
The bisect revealed this commit. The NULL dereference surely got fixed as I found out during bisecting but with the side effect of this memory leak it seems.

 # git bisect good
f18f58012ee894039cd59ee8c889bf499d7a3943 is the first bad commit
commit f18f58012ee894039cd59ee8c889bf499d7a3943
Author: Mikel Rychliski <mikel@mikelr.com>
Date:   Thu Jun 24 00:51:20 2021 -0400

    drm/radeon: Fix NULL dereference when updating memory stats
    
    radeon_ttm_bo_destroy() is attempting to access the resource object to
    update memory counters. However, the resource object is already freed when
    ttm calls this function via the destroy callback. This causes an oops when
    a bo is freed:
    
            BUG: kernel NULL pointer dereference, address: 0000000000000010
            RIP: 0010:radeon_ttm_bo_destroy+0x2c/0x100 [radeon]
            Call Trace:
             radeon_bo_unref+0x1a/0x30 [radeon]
             radeon_gem_object_free+0x33/0x50 [radeon]
             drm_gem_object_release_handle+0x69/0x70 [drm]
             drm_gem_handle_delete+0x62/0xa0 [drm]
             ? drm_mode_destroy_dumb+0x40/0x40 [drm]
             drm_ioctl_kernel+0xb2/0xf0 [drm]
             drm_ioctl+0x30a/0x3c0 [drm]
             ? drm_mode_destroy_dumb+0x40/0x40 [drm]
             radeon_drm_ioctl+0x49/0x80 [radeon]
             __x64_sys_ioctl+0x8e/0xd0
    
    Avoid the issue by updating the counters in the delete_mem_notify callback
    instead. Also, fix memory statistic updating in radeon_bo_move() to
    identify the source type correctly. The source type needs to be saved
    before the move, because the moved from object may be altered by the move.
    
    Fixes: bfa3357ef9ab ("drm/ttm: allocate resource object instead of embedding it v2")
    Signed-off-by: Mikel Rychliski <mikel@mikelr.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20210624045121.15643-1-mikel@mikelr.com

 drivers/gpu/drm/radeon/radeon_object.c | 29 ++++++++++++-----------------
 drivers/gpu/drm/radeon/radeon_object.h |  2 +-
 drivers/gpu/drm/radeon/radeon_ttm.c    | 13 ++++++++++---
 3 files changed, 23 insertions(+), 21 deletions(-)
Comment 6 Erhard F. 2021-09-17 22:38:38 UTC
Created attachment 298869 [details]
bisect.log
Comment 7 Christian König 2021-09-20 07:49:48 UTC
Mhm, that bisect result doesn't really make much sense.

Can you try to revert the change and so double check if that helps or not?
Comment 8 Erhard F. 2021-09-20 18:20:29 UTC
As I tried to revert the change I realized I can no longer reproduce this issue. In the meantime I did nothing specual but to update some system packages... Anyway, vanilla v5.14.6 and v5.15-rc2 run just fine without showing memory leaks (left the machine for >1 hr which normally delivered plenty of leaks).

So I'll mark this as obsolete. In case I should see this again I will re-open the bug and have a closer look at the circumstances it appears.
Comment 9 Erhard F. 2021-10-08 00:13:22 UTC
Reopening as I can replicate the bug on my Talos II with another Radeon card () on kernel 5.14.9 and latest 5.15-rc4.

On the Talos II I was also able to verify that reverting f18f58012ee894039cd59ee8c889bf499d7a3943 (on top of v5.15-rc4) fixes the memory leak. But of course you get the NULL dereference back.

[..]
unreferenced object 0xc00020000ea7fb00 (size 128):
  comm "X", pid 543, jiffies 4295307011 (age 9605.800s)
  hex dump (first 32 bytes):
    c0 00 00 00 29 86 9a b0 c0 08 00 00 1f 28 81 38  ....)........(.8
    00 00 01 4d 80 92 a3 83 c0 00 20 00 0a 97 77 98  ...M...... ...w.
  backtrace:
    [<c00800001f13f568>] .radeon_fence_emit+0x38/0x130 [radeon]
    [<c00800001f261800>] .evergreen_copy_dma+0x390/0x4a0 [radeon]
    [<c00800001f141a68>] .radeon_bo_move+0x438/0x640 [radeon]
    [<c00800001ed2857c>] .ttm_bo_handle_move_mem+0xbc/0x220 [ttm]
    [<c00800001ed2a254>] .ttm_bo_validate+0xe4/0x1a0 [ttm]
    [<c00800001f143c98>] .radeon_bo_fault_reserve_notify+0x178/0x2c0 [radeon]
    [<c00800001f15fa78>] .radeon_gem_fault+0x98/0x130 [radeon]
    [<c000000000292d58>] .__do_fault+0x58/0x120
    [<c0000000002999cc>] .__handle_mm_fault+0x105c/0x1900
    [<c00000000029a3a8>] .handle_mm_fault+0x138/0x330
    [<c00000000005cd50>] .___do_page_fault+0x5a0/0xa30
    [<c00000000005d20c>] .do_page_fault+0x2c/0xc0
    [<c0000000000088dc>] data_access_common_virt+0x19c/0x1f0
unreferenced object 0xc000200004ba6900 (size 128):
  comm "X", pid 543, jiffies 4296163097 (age 6752.330s)
  hex dump (first 32 bytes):
    c0 00 00 00 29 86 9a b0 c0 08 00 00 1f 28 81 38  ....)........(.8
    00 00 03 e5 e9 9d fd fe c0 00 20 00 0a 97 77 98  .......... ...w.
  backtrace:
    [<c00800001f13f568>] .radeon_fence_emit+0x38/0x130 [radeon]
    [<c00800001f261800>] .evergreen_copy_dma+0x390/0x4a0 [radeon]
    [<c00800001f141a68>] .radeon_bo_move+0x438/0x640 [radeon]
    [<c00800001ed2857c>] .ttm_bo_handle_move_mem+0xbc/0x220 [ttm]
    [<c00800001ed2a254>] .ttm_bo_validate+0xe4/0x1a0 [ttm]
    [<c00800001f143c98>] .radeon_bo_fault_reserve_notify+0x178/0x2c0 [radeon]
    [<c00800001f15fa78>] .radeon_gem_fault+0x98/0x130 [radeon]
    [<c000000000292d58>] .__do_fault+0x58/0x120
    [<c0000000002999cc>] .__handle_mm_fault+0x105c/0x1900
    [<c00000000029a3a8>] .handle_mm_fault+0x138/0x330
    [<c00000000005cd50>] .___do_page_fault+0x5a0/0xa30
    [<c00000000005d20c>] .do_page_fault+0x2c/0xc0
    [<c0000000000088dc>] data_access_common_virt+0x19c/0x1f0
unreferenced object 0xc000000029864e00 (size 128):
  comm "X", pid 543, jiffies 4297018853 (age 3899.817s)
  hex dump (first 32 bytes):
    c0 00 00 00 29 86 9a b0 c0 08 00 00 1f 28 81 38  ....)........(.8
    00 00 06 7e 11 07 45 5e c0 00 20 00 0a 97 77 98  ...~..E^.. ...w.
  backtrace:
    [<c00800001f13f568>] .radeon_fence_emit+0x38/0x130 [radeon]
    [<c00800001f261800>] .evergreen_copy_dma+0x390/0x4a0 [radeon]
    [<c00800001f141a68>] .radeon_bo_move+0x438/0x640 [radeon]
    [<c00800001ed2857c>] .ttm_bo_handle_move_mem+0xbc/0x220 [ttm]
    [<c00800001ed2a254>] .ttm_bo_validate+0xe4/0x1a0 [ttm]
    [<c00800001f143c98>] .radeon_bo_fault_reserve_notify+0x178/0x2c0 [radeon]
    [<c00800001f15fa78>] .radeon_gem_fault+0x98/0x130 [radeon]
    [<c000000000292d58>] .__do_fault+0x58/0x120
    [<c0000000002999cc>] .__handle_mm_fault+0x105c/0x1900
    [<c00000000029a3a8>] .handle_mm_fault+0x138/0x330
    [<c00000000005cd50>] .___do_page_fault+0x5a0/0xa30
    [<c00000000005d20c>] .do_page_fault+0x2c/0xc0
    [<c0000000000088dc>] data_access_common_virt+0x19c/0x1f0
Comment 10 Erhard F. 2021-10-08 00:15:22 UTC
Created attachment 299135 [details]
kernel dmesg (kernel 5.14.9, Talos II Secure Workstation)
Comment 11 Erhard F. 2021-10-08 00:21:57 UTC
Created attachment 299137 [details]
kernel .config (kernel 5.14.9, Talos II Secure Workstation)

Card on the Talos II is a passive cooled HD 6670.

 # lspci -v -s 0000:01:00.0
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Turks XT [Radeon HD 6670/7670] (prog-if 00 [VGA controller])
	Subsystem: PC Partner Limited / Sapphire Technology Turks XT [Radeon HD 6670/7670]
	Device tree node: /sys/firmware/devicetree/base/pciex@600c3c0000000/pci@0/vga@0
	Flags: bus master, fast devsel, latency 0, IRQ 77, NUMA node 0, IOMMU group 0
	Memory at 6000000000000 (64-bit, prefetchable) [size=256M]
	Memory at 600c000000000 (64-bit, non-prefetchable) [size=128K]
	I/O ports at <unassigned> [disabled]
	Expansion ROM at 600c000020000 [disabled] [size=128K]
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Kernel driver in use: radeon
	Kernel modules: radeon
Comment 12 Erhard F. 2021-10-17 22:39:29 UTC
Created attachment 299231 [details]
kernel dmesg (kernel 5.15-rc5, AMD Ryzen 9 5950X)

I was also able to replicate the bug on the hardware which I originally discovered it. Probably I mixed something up with the booted kernels the time I closed the bug... As of v5.15-rc5 it was easy to get the leak, even when no monitor is connected to the card.

And I can conform that reverting f18f58012ee894039cd59ee8c889bf499d7a3943 (on top of v5.15-rc5) fixes the leak with the rv515 card on the Ryzen 9 too. But you also get the NULL pointer dereference back.
Comment 13 Erhard F. 2021-10-17 22:40:53 UTC
Created attachment 299233 [details]
output of kmemleak (kernel 5.15-rc5, AMD Ryzen 9 5950X)
Comment 14 Erhard F. 2021-10-17 22:41:49 UTC
Created attachment 299235 [details]
kernel .config (kernel 5.15-rc5, AMD Ryzen 9 5950X)
Comment 15 Christian König 2021-10-20 16:52:02 UTC
Thanks for the logs! I think I've found the root cause of this!.

It's just that the bisect is unfortunately incorrect.The bug happens really at random during memory eviction and so very hard to reproduce.
Comment 16 Christian König 2021-10-20 17:27:49 UTC
Created attachment 299275 [details]
Potential fix

Please give the attache patch a try.
Comment 17 Erhard F. 2021-10-21 07:36:23 UTC
(In reply to Christian König from comment #16)
> Created attachment 299275 [details]
> Potential fix
> 
> Please give the attache patch a try.
This patch indeed did the trick, many thanks! Uptime with patched v5.15-rc6 is >12 hrs now without any memleak, where the unpatched kernel would show dozens.

The patch also fixed the leak from bug #214029. There with patched v5.14.13, also >12 hrs uptime without any leak up to now. Only machine left to test would be my Talos II where I'apply the patch now.
Comment 18 Erhard F. 2021-11-03 09:45:25 UTC
The fix landed in kernel 5.15, 5.14.16 and affected LTS kernels.

Closing.

Note You need to log in before you can comment on or make changes to this bug.