Bug 214447
Summary: | [bisected] Several memory leaks in radeon and ttm | ||
---|---|---|---|
Product: | Drivers | Reporter: | Erhard F. (erhard_f) |
Component: | Video(Other) | Assignee: | drivers_video-other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | christian.koenig |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.14.5 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kernel dmesg (kernel 5.14.5, AMD Ryzen 9 5950X)
kernel .config (kernel 5.14.5, AMD Ryzen 9 5950X) output of kmemleak (kernel 5.14.5, AMD Ryzen 9 5950X) bisect.log kernel dmesg (kernel 5.14.9, Talos II Secure Workstation) kernel .config (kernel 5.14.9, Talos II Secure Workstation) kernel dmesg (kernel 5.15-rc5, AMD Ryzen 9 5950X) output of kmemleak (kernel 5.15-rc5, AMD Ryzen 9 5950X) kernel .config (kernel 5.15-rc5, AMD Ryzen 9 5950X) Potential fix |
Created attachment 298855 [details]
kernel .config (kernel 5.14.5, AMD Ryzen 9 5950X)
Created attachment 298857 [details]
output of kmemleak (kernel 5.14.5, AMD Ryzen 9 5950X)
Looks like a missing dma_fence_put() somewhere. Can you bisect this? Should be rather trivial to pin point the patch which introduced this. (In reply to Christian König from comment #3) > Looks like a missing dma_fence_put() somewhere. > > Can you bisect this? Should be rather trivial to pin point the patch which > introduced this. Ok, I will try to bisect. Also the problem must have been introduced recently. Have been running the machine with Kernel 5.13.18 for a few hours now and the leak did not show up. The bisect revealed this commit. The NULL dereference surely got fixed as I found out during bisecting but with the side effect of this memory leak it seems. # git bisect good f18f58012ee894039cd59ee8c889bf499d7a3943 is the first bad commit commit f18f58012ee894039cd59ee8c889bf499d7a3943 Author: Mikel Rychliski <mikel@mikelr.com> Date: Thu Jun 24 00:51:20 2021 -0400 drm/radeon: Fix NULL dereference when updating memory stats radeon_ttm_bo_destroy() is attempting to access the resource object to update memory counters. However, the resource object is already freed when ttm calls this function via the destroy callback. This causes an oops when a bo is freed: BUG: kernel NULL pointer dereference, address: 0000000000000010 RIP: 0010:radeon_ttm_bo_destroy+0x2c/0x100 [radeon] Call Trace: radeon_bo_unref+0x1a/0x30 [radeon] radeon_gem_object_free+0x33/0x50 [radeon] drm_gem_object_release_handle+0x69/0x70 [drm] drm_gem_handle_delete+0x62/0xa0 [drm] ? drm_mode_destroy_dumb+0x40/0x40 [drm] drm_ioctl_kernel+0xb2/0xf0 [drm] drm_ioctl+0x30a/0x3c0 [drm] ? drm_mode_destroy_dumb+0x40/0x40 [drm] radeon_drm_ioctl+0x49/0x80 [radeon] __x64_sys_ioctl+0x8e/0xd0 Avoid the issue by updating the counters in the delete_mem_notify callback instead. Also, fix memory statistic updating in radeon_bo_move() to identify the source type correctly. The source type needs to be saved before the move, because the moved from object may be altered by the move. Fixes: bfa3357ef9ab ("drm/ttm: allocate resource object instead of embedding it v2") Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210624045121.15643-1-mikel@mikelr.com drivers/gpu/drm/radeon/radeon_object.c | 29 ++++++++++++----------------- drivers/gpu/drm/radeon/radeon_object.h | 2 +- drivers/gpu/drm/radeon/radeon_ttm.c | 13 ++++++++++--- 3 files changed, 23 insertions(+), 21 deletions(-) Created attachment 298869 [details]
bisect.log
Mhm, that bisect result doesn't really make much sense. Can you try to revert the change and so double check if that helps or not? As I tried to revert the change I realized I can no longer reproduce this issue. In the meantime I did nothing specual but to update some system packages... Anyway, vanilla v5.14.6 and v5.15-rc2 run just fine without showing memory leaks (left the machine for >1 hr which normally delivered plenty of leaks). So I'll mark this as obsolete. In case I should see this again I will re-open the bug and have a closer look at the circumstances it appears. Reopening as I can replicate the bug on my Talos II with another Radeon card () on kernel 5.14.9 and latest 5.15-rc4. On the Talos II I was also able to verify that reverting f18f58012ee894039cd59ee8c889bf499d7a3943 (on top of v5.15-rc4) fixes the memory leak. But of course you get the NULL dereference back. [..] unreferenced object 0xc00020000ea7fb00 (size 128): comm "X", pid 543, jiffies 4295307011 (age 9605.800s) hex dump (first 32 bytes): c0 00 00 00 29 86 9a b0 c0 08 00 00 1f 28 81 38 ....)........(.8 00 00 01 4d 80 92 a3 83 c0 00 20 00 0a 97 77 98 ...M...... ...w. backtrace: [<c00800001f13f568>] .radeon_fence_emit+0x38/0x130 [radeon] [<c00800001f261800>] .evergreen_copy_dma+0x390/0x4a0 [radeon] [<c00800001f141a68>] .radeon_bo_move+0x438/0x640 [radeon] [<c00800001ed2857c>] .ttm_bo_handle_move_mem+0xbc/0x220 [ttm] [<c00800001ed2a254>] .ttm_bo_validate+0xe4/0x1a0 [ttm] [<c00800001f143c98>] .radeon_bo_fault_reserve_notify+0x178/0x2c0 [radeon] [<c00800001f15fa78>] .radeon_gem_fault+0x98/0x130 [radeon] [<c000000000292d58>] .__do_fault+0x58/0x120 [<c0000000002999cc>] .__handle_mm_fault+0x105c/0x1900 [<c00000000029a3a8>] .handle_mm_fault+0x138/0x330 [<c00000000005cd50>] .___do_page_fault+0x5a0/0xa30 [<c00000000005d20c>] .do_page_fault+0x2c/0xc0 [<c0000000000088dc>] data_access_common_virt+0x19c/0x1f0 unreferenced object 0xc000200004ba6900 (size 128): comm "X", pid 543, jiffies 4296163097 (age 6752.330s) hex dump (first 32 bytes): c0 00 00 00 29 86 9a b0 c0 08 00 00 1f 28 81 38 ....)........(.8 00 00 03 e5 e9 9d fd fe c0 00 20 00 0a 97 77 98 .......... ...w. backtrace: [<c00800001f13f568>] .radeon_fence_emit+0x38/0x130 [radeon] [<c00800001f261800>] .evergreen_copy_dma+0x390/0x4a0 [radeon] [<c00800001f141a68>] .radeon_bo_move+0x438/0x640 [radeon] [<c00800001ed2857c>] .ttm_bo_handle_move_mem+0xbc/0x220 [ttm] [<c00800001ed2a254>] .ttm_bo_validate+0xe4/0x1a0 [ttm] [<c00800001f143c98>] .radeon_bo_fault_reserve_notify+0x178/0x2c0 [radeon] [<c00800001f15fa78>] .radeon_gem_fault+0x98/0x130 [radeon] [<c000000000292d58>] .__do_fault+0x58/0x120 [<c0000000002999cc>] .__handle_mm_fault+0x105c/0x1900 [<c00000000029a3a8>] .handle_mm_fault+0x138/0x330 [<c00000000005cd50>] .___do_page_fault+0x5a0/0xa30 [<c00000000005d20c>] .do_page_fault+0x2c/0xc0 [<c0000000000088dc>] data_access_common_virt+0x19c/0x1f0 unreferenced object 0xc000000029864e00 (size 128): comm "X", pid 543, jiffies 4297018853 (age 3899.817s) hex dump (first 32 bytes): c0 00 00 00 29 86 9a b0 c0 08 00 00 1f 28 81 38 ....)........(.8 00 00 06 7e 11 07 45 5e c0 00 20 00 0a 97 77 98 ...~..E^.. ...w. backtrace: [<c00800001f13f568>] .radeon_fence_emit+0x38/0x130 [radeon] [<c00800001f261800>] .evergreen_copy_dma+0x390/0x4a0 [radeon] [<c00800001f141a68>] .radeon_bo_move+0x438/0x640 [radeon] [<c00800001ed2857c>] .ttm_bo_handle_move_mem+0xbc/0x220 [ttm] [<c00800001ed2a254>] .ttm_bo_validate+0xe4/0x1a0 [ttm] [<c00800001f143c98>] .radeon_bo_fault_reserve_notify+0x178/0x2c0 [radeon] [<c00800001f15fa78>] .radeon_gem_fault+0x98/0x130 [radeon] [<c000000000292d58>] .__do_fault+0x58/0x120 [<c0000000002999cc>] .__handle_mm_fault+0x105c/0x1900 [<c00000000029a3a8>] .handle_mm_fault+0x138/0x330 [<c00000000005cd50>] .___do_page_fault+0x5a0/0xa30 [<c00000000005d20c>] .do_page_fault+0x2c/0xc0 [<c0000000000088dc>] data_access_common_virt+0x19c/0x1f0 Created attachment 299135 [details]
kernel dmesg (kernel 5.14.9, Talos II Secure Workstation)
Created attachment 299137 [details]
kernel .config (kernel 5.14.9, Talos II Secure Workstation)
Card on the Talos II is a passive cooled HD 6670.
# lspci -v -s 0000:01:00.0
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Turks XT [Radeon HD 6670/7670] (prog-if 00 [VGA controller])
Subsystem: PC Partner Limited / Sapphire Technology Turks XT [Radeon HD 6670/7670]
Device tree node: /sys/firmware/devicetree/base/pciex@600c3c0000000/pci@0/vga@0
Flags: bus master, fast devsel, latency 0, IRQ 77, NUMA node 0, IOMMU group 0
Memory at 6000000000000 (64-bit, prefetchable) [size=256M]
Memory at 600c000000000 (64-bit, non-prefetchable) [size=128K]
I/O ports at <unassigned> [disabled]
Expansion ROM at 600c000020000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Kernel driver in use: radeon
Kernel modules: radeon
Created attachment 299231 [details]
kernel dmesg (kernel 5.15-rc5, AMD Ryzen 9 5950X)
I was also able to replicate the bug on the hardware which I originally discovered it. Probably I mixed something up with the booted kernels the time I closed the bug... As of v5.15-rc5 it was easy to get the leak, even when no monitor is connected to the card.
And I can conform that reverting f18f58012ee894039cd59ee8c889bf499d7a3943 (on top of v5.15-rc5) fixes the leak with the rv515 card on the Ryzen 9 too. But you also get the NULL pointer dereference back.
Created attachment 299233 [details]
output of kmemleak (kernel 5.15-rc5, AMD Ryzen 9 5950X)
Created attachment 299235 [details]
kernel .config (kernel 5.15-rc5, AMD Ryzen 9 5950X)
Thanks for the logs! I think I've found the root cause of this!. It's just that the bisect is unfortunately incorrect.The bug happens really at random during memory eviction and so very hard to reproduce. Created attachment 299275 [details]
Potential fix
Please give the attache patch a try.
(In reply to Christian König from comment #16) > Created attachment 299275 [details] > Potential fix > > Please give the attache patch a try. This patch indeed did the trick, many thanks! Uptime with patched v5.15-rc6 is >12 hrs now without any memleak, where the unpatched kernel would show dozens. The patch also fixed the leak from bug #214029. There with patched v5.14.13, also >12 hrs uptime without any leak up to now. Only machine left to test would be my Talos II where I'apply the patch now. The fix landed in kernel 5.15, 5.14.16 and affected LTS kernels. Closing. |
Created attachment 298853 [details] kernel dmesg (kernel 5.14.5, AMD Ryzen 9 5950X) Getting this on kernel 5.14.5 with my Radeon X1550: unreferenced object 0xffff911cfdbf4e00 (size 128): comm "X", pid 570, jiffies 4294879485 (age 3667.907s) hex dump (first 32 bytes): 00 99 b7 fd 1c 91 ff ff 00 f2 70 c0 ff ff ff ff ..........p..... a7 c6 67 be 01 00 00 00 d0 7c 4e 0d e2 a8 ff ff ..g......|N..... backtrace: [<ffffffffc06415f0>] radeon_fence_emit+0x20/0xe0 [radeon] [<ffffffffc065d966>] r100_copy_blit+0x586/0x650 [radeon] [<ffffffffc0642b55>] radeon_bo_move+0x365/0x560 [radeon] [<ffffffffc055ae8a>] ttm_bo_handle_move_mem+0x8a/0x180 [ttm] [<ffffffffc055c19e>] ttm_bo_validate+0xae/0x180 [ttm] [<ffffffffc06442a3>] radeon_bo_fault_reserve_notify+0x113/0x1e0 [radeon] [<ffffffffc06557aa>] radeon_gem_fault+0x5a/0xb0 [radeon] [<ffffffffb6171dd3>] __do_fault+0x33/0xe0 [<ffffffffb6178d20>] __handle_mm_fault+0xc90/0x1260 [<ffffffffb61793a5>] handle_mm_fault+0xb5/0x230 [<ffffffffb66d6205>] exc_page_fault+0x185/0x5e0 [<ffffffffb6800b1e>] asm_exc_page_fault+0x1e/0x30 unreferenced object 0xffff911cfdbf7800 (size 128): comm "X", pid 570, jiffies 4294879485 (age 3667.907s) hex dump (first 32 bytes): 00 99 b7 fd 1c 91 ff ff 00 f2 70 c0 ff ff ff ff ..........p..... a9 01 6a be 01 00 00 00 d0 7c 4e 0d e2 a8 ff ff ..j......|N..... backtrace: [<ffffffffc06415f0>] radeon_fence_emit+0x20/0xe0 [radeon] [<ffffffffc065d966>] r100_copy_blit+0x586/0x650 [radeon] [<ffffffffc0642b55>] radeon_bo_move+0x365/0x560 [radeon] [<ffffffffc055ae8a>] ttm_bo_handle_move_mem+0x8a/0x180 [ttm] [<ffffffffc055c19e>] ttm_bo_validate+0xae/0x180 [ttm] [<ffffffffc06442a3>] radeon_bo_fault_reserve_notify+0x113/0x1e0 [radeon] [<ffffffffc06557aa>] radeon_gem_fault+0x5a/0xb0 [radeon] [<ffffffffb6171dd3>] __do_fault+0x33/0xe0 [<ffffffffb6178d20>] __handle_mm_fault+0xc90/0x1260 [<ffffffffb61793a5>] handle_mm_fault+0xb5/0x230 [<ffffffffb66d6205>] exc_page_fault+0x185/0x5e0 [<ffffffffb6800b1e>] asm_exc_page_fault+0x1e/0x30 [...] # lspci 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 01:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01) 02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller (rev 01) 02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01) 02:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01) 03:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) 03:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) 03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) 07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series] 07:00.1 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series] (Secondary) 08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function 09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP 09:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP 09:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller # lspci -s 07:00.0 -vv 07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited / Sapphire Technology RV516 [Radeon X1300/X1550 Series] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 57 IOMMU group: 2 Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at fce30000 (64-bit, non-prefetchable) [size=64K] Region 4: I/O ports at e000 [size=256] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset- SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Kernel driver in use: radeon Kernel modules: radeon