Created attachment 293341 [details] /sys/kernel/debug/kmemleak It looks like there's a memory leak in drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:amdgpu_dm_update_connector_after_detect. It appears to be calling drm_add_edid_modes, which indirectly calling ito either do_detailed_mode or drm_mode_duplicate. This has caused me to run out of memory a handful of times, which could only be resolved by rebooting. I only experienced this after upgrading to 5.9.1, and it looks like commit b24bdc37d03a0478189e20a50286092840f414fa added the call to drm_add_edid_modes in amdgpu_dm_update_connector_after_detect.
Created attachment 293343 [details] dmesg with oom-killer invocations Note that the stack has amdgpu_dm_update_connector_after_detect+0x28d/0x330 > drm_add_edid_modes+0x6e1/0x1860. This was recorded on Linux 5.9.1, but the kmemleak was on linux 5.9.2.
It looks like this can be fixed by setting aconnector->num_modes to the return value from drm_add_edid_modes. At least one other place in amdgpu_dm.c sets struct amdgpu_dm_connector.num_modes to the return value of drm_add_edid_modes like this. I'm not familiar enough with AMDGPU or DRM internals to know if this will mess anything up.
Created attachment 293577 [details] proposed patch
I have the same memory leak. android_x86:/ # echo scan > /sys/kernel/debug/kmemleak android_x86:/ # cat /sys/kernel/debug/kmemleak android_x86:/ # echo scan > /sys/kernel/debug/kmemleak android_x86:/ # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8edad8208580 (size 128): comm "ueventd", pid 1498, jiffies 4294676333 (age 65.106s) hex dump (first 32 bytes): 22 16 04 00 00 0a 30 0a 50 0a a0 0a 00 00 40 06 ".....0.P.....@. 43 06 48 06 69 06 00 00 05 00 00 00 00 00 00 00 C.H.i........... backtrace: [<0000000080ce8e0b>] do_detailed_mode+0x27c/0x520 [drm] [<000000000427e646>] drm_for_each_detailed_block.part.0+0x35/0x110 [drm] [<00000000566583b3>] drm_add_edid_modes+0x22b/0x1880 [drm] [<00000000f63b328b>] amdgpu_dm_update_connector_after_detect+0x385/0x4f0 [amdgpu] [<000000009f1bbb4c>] dm_helpers_read_local_edid+0xaa/0x170 [amdgpu] [<0000000005f6f065>] dc_link_detect_helper+0x29b/0xd70 [amdgpu] [<00000000a096d0f5>] dc_link_detect+0x31/0x50 [amdgpu] [<000000009a977098>] amdgpu_dm_init.isra.0.cold+0xf81/0x1297 [amdgpu] [<00000000cfd3da50>] dm_hw_init+0xe/0x20 [amdgpu] [<00000000128bd3d5>] amdgpu_device_init.cold+0x13c7/0x16b5 [amdgpu] [<0000000039b2a07d>] amdgpu_driver_load_kms+0x2b/0x200 [amdgpu] [<000000009b370228>] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] [<0000000066485d99>] pci_device_probe+0xd2/0x150 [<00000000c858be29>] really_probe+0x232/0x460 [<00000000f84cda17>] driver_probe_device+0x5d/0x150 [<00000000103f2cc3>] device_driver_attach+0xa1/0xb0 unreferenced object 0xffff8edad828f280 (size 128): comm "ueventd", pid 1498, jiffies 4294676333 (age 65.107s) hex dump (first 32 bytes): 14 44 02 00 80 07 d8 07 04 08 98 08 00 00 38 04 .D............8. 3c 04 41 04 65 04 00 00 0a 00 00 00 00 00 00 00 <.A.e........... backtrace: [<0000000017977f42>] drm_mode_duplicate+0x1f/0x90 [drm] [<00000000c4367b7e>] drm_mode_std+0x1fe/0x5e0 [drm] [<00000000d7555cdd>] drm_add_edid_modes+0x2c7/0x1880 [drm] [<00000000f63b328b>] amdgpu_dm_update_connector_after_detect+0x385/0x4f0 [amdgpu] [<000000009f1bbb4c>] dm_helpers_read_local_edid+0xaa/0x170 [amdgpu] [<0000000005f6f065>] dc_link_detect_helper+0x29b/0xd70 [amdgpu] [<00000000a096d0f5>] dc_link_detect+0x31/0x50 [amdgpu] [<000000009a977098>] amdgpu_dm_init.isra.0.cold+0xf81/0x1297 [amdgpu] [<00000000cfd3da50>] dm_hw_init+0xe/0x20 [amdgpu] [<00000000128bd3d5>] amdgpu_device_init.cold+0x13c7/0x16b5 [amdgpu] [<0000000039b2a07d>] amdgpu_driver_load_kms+0x2b/0x200 [amdgpu] [<000000009b370228>] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] [<0000000066485d99>] pci_device_probe+0xd2/0x150 [<00000000c858be29>] really_probe+0x232/0x460 [<00000000f84cda17>] driver_probe_device+0x5d/0x150 [<00000000103f2cc3>] device_driver_attach+0xa1/0xb0
(In reply to Lee Starnes from comment #3) > Created attachment 293577 [details] > proposed patch this patch seem no help for me, test on linux 5.10 kernel. thanks for you point the bad commit, i can revert "drm/amd/display: Fix EDID parsing after resume from suspend" to fix memory leak.
Does this patch work any better? https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg54780.html
(In reply to Alex Deucher from comment #6) > Does this patch work any better? > https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg54780.html nice! test this patch fix my memleak problem.
(In reply to Alex Deucher from comment #6) > Does this patch work any better? > https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg54780.html This looks better than my patch. I've been using it for the last week or so with my RX 480 and it has been working.
This change caused a regression that leads to inability to light up the display after powering it off. See: * https://lore.kernel.org/lkml/e5d9703f-42a4-f154-cf13-55a3eba10859@tomt.net/ * https://bugzilla.kernel.org/show_bug.cgi?id=211033 * https://bugs.archlinux.org/task/69202
I can't stand memory leak, i will revert "Revert "drm/amd/display: Fix memory leaks in S3 resume"" revert 5efc1f4b454c6179d35e7b0c3eda0ad5763a00fc in today linux 5.11-rc3. i use rc kernel every week.