Bug 209987

Summary: Memory leak in amdgpu_dm_update_connector_after_detect
Product: Drivers Reporter: Lee Starnes (lstarnes1024)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: alexdeucher, oleksandr, sh200105, youling257
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.9.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: /sys/kernel/debug/kmemleak
dmesg with oom-killer invocations
proposed patch

Description Lee Starnes 2020-11-01 06:15:59 UTC
Created attachment 293341 [details]
/sys/kernel/debug/kmemleak

It looks like there's a memory leak in drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:amdgpu_dm_update_connector_after_detect. It appears to be calling drm_add_edid_modes, which indirectly calling ito either do_detailed_mode or drm_mode_duplicate.

This has caused me to run out of memory a handful of times, which could only be resolved by rebooting.

I only experienced this after upgrading to 5.9.1, and it looks like commit b24bdc37d03a0478189e20a50286092840f414fa added the call to drm_add_edid_modes in amdgpu_dm_update_connector_after_detect.
Comment 1 Lee Starnes 2020-11-01 06:19:11 UTC
Created attachment 293343 [details]
dmesg with oom-killer invocations

Note that the stack has amdgpu_dm_update_connector_after_detect+0x28d/0x330 > drm_add_edid_modes+0x6e1/0x1860. This was recorded on Linux 5.9.1, but the kmemleak was on linux 5.9.2.
Comment 2 Lee Starnes 2020-11-09 04:24:50 UTC
It looks like this can be fixed by setting aconnector->num_modes to the return value from drm_add_edid_modes. At least one other place in amdgpu_dm.c sets struct amdgpu_dm_connector.num_modes to the return value of drm_add_edid_modes like this. I'm not familiar enough with AMDGPU or DRM internals to know if this will mess anything up.
Comment 3 Lee Starnes 2020-11-09 06:14:02 UTC
Created attachment 293577 [details]
proposed patch
Comment 4 youling257 2020-12-21 15:57:53 UTC
I have the same memory leak.

android_x86:/ # echo scan >  /sys/kernel/debug/kmemleak
android_x86:/ # cat /sys/kernel/debug/kmemleak
android_x86:/ # echo scan >  /sys/kernel/debug/kmemleak
android_x86:/ # cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8edad8208580 (size 128):
  comm "ueventd", pid 1498, jiffies 4294676333 (age 65.106s)
  hex dump (first 32 bytes):
    22 16 04 00 00 0a 30 0a 50 0a a0 0a 00 00 40 06  ".....0.P.....@.
    43 06 48 06 69 06 00 00 05 00 00 00 00 00 00 00  C.H.i...........
  backtrace:
    [<0000000080ce8e0b>] do_detailed_mode+0x27c/0x520 [drm]
    [<000000000427e646>] drm_for_each_detailed_block.part.0+0x35/0x110 [drm]
    [<00000000566583b3>] drm_add_edid_modes+0x22b/0x1880 [drm]
    [<00000000f63b328b>] amdgpu_dm_update_connector_after_detect+0x385/0x4f0 [amdgpu]
    [<000000009f1bbb4c>] dm_helpers_read_local_edid+0xaa/0x170 [amdgpu]
    [<0000000005f6f065>] dc_link_detect_helper+0x29b/0xd70 [amdgpu]
    [<00000000a096d0f5>] dc_link_detect+0x31/0x50 [amdgpu]
    [<000000009a977098>] amdgpu_dm_init.isra.0.cold+0xf81/0x1297 [amdgpu]
    [<00000000cfd3da50>] dm_hw_init+0xe/0x20 [amdgpu]
    [<00000000128bd3d5>] amdgpu_device_init.cold+0x13c7/0x16b5 [amdgpu]
    [<0000000039b2a07d>] amdgpu_driver_load_kms+0x2b/0x200 [amdgpu]
    [<000000009b370228>] amdgpu_pci_probe+0x129/0x1b0 [amdgpu]
    [<0000000066485d99>] pci_device_probe+0xd2/0x150
    [<00000000c858be29>] really_probe+0x232/0x460
    [<00000000f84cda17>] driver_probe_device+0x5d/0x150
    [<00000000103f2cc3>] device_driver_attach+0xa1/0xb0
unreferenced object 0xffff8edad828f280 (size 128):
  comm "ueventd", pid 1498, jiffies 4294676333 (age 65.107s)
  hex dump (first 32 bytes):
    14 44 02 00 80 07 d8 07 04 08 98 08 00 00 38 04  .D............8.
    3c 04 41 04 65 04 00 00 0a 00 00 00 00 00 00 00  <.A.e...........
  backtrace:
    [<0000000017977f42>] drm_mode_duplicate+0x1f/0x90 [drm]
    [<00000000c4367b7e>] drm_mode_std+0x1fe/0x5e0 [drm]
    [<00000000d7555cdd>] drm_add_edid_modes+0x2c7/0x1880 [drm]
    [<00000000f63b328b>] amdgpu_dm_update_connector_after_detect+0x385/0x4f0 [amdgpu]
    [<000000009f1bbb4c>] dm_helpers_read_local_edid+0xaa/0x170 [amdgpu]
    [<0000000005f6f065>] dc_link_detect_helper+0x29b/0xd70 [amdgpu]
    [<00000000a096d0f5>] dc_link_detect+0x31/0x50 [amdgpu]
    [<000000009a977098>] amdgpu_dm_init.isra.0.cold+0xf81/0x1297 [amdgpu]
    [<00000000cfd3da50>] dm_hw_init+0xe/0x20 [amdgpu]
    [<00000000128bd3d5>] amdgpu_device_init.cold+0x13c7/0x16b5 [amdgpu]
    [<0000000039b2a07d>] amdgpu_driver_load_kms+0x2b/0x200 [amdgpu]
    [<000000009b370228>] amdgpu_pci_probe+0x129/0x1b0 [amdgpu]
    [<0000000066485d99>] pci_device_probe+0xd2/0x150
    [<00000000c858be29>] really_probe+0x232/0x460
    [<00000000f84cda17>] driver_probe_device+0x5d/0x150
    [<00000000103f2cc3>] device_driver_attach+0xa1/0xb0
Comment 5 youling257 2020-12-21 16:29:28 UTC
(In reply to Lee Starnes from comment #3)
> Created attachment 293577 [details]
> proposed patch

this patch seem no help for me, test on linux 5.10 kernel.
thanks for you point the bad commit, 
i can revert "drm/amd/display: Fix EDID parsing after resume from suspend" to fix memory leak.
Comment 6 Alex Deucher 2020-12-21 16:47:29 UTC
Does this patch work any better?
https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg54780.html
Comment 7 youling257 2020-12-21 17:17:05 UTC
(In reply to Alex Deucher from comment #6)
> Does this patch work any better?
> https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg54780.html

nice! test this patch fix my memleak problem.
Comment 8 Lee Starnes 2020-12-24 15:01:19 UTC
(In reply to Alex Deucher from comment #6)
> Does this patch work any better?
> https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg54780.html

This looks better than my patch. I've been using it for the last week or so with my RX 480 and it has been working.
Comment 9 Oleksandr Natalenko 2021-01-05 14:14:29 UTC
This change caused a regression that leads to inability to light up the display after powering it off.

See:

* https://lore.kernel.org/lkml/e5d9703f-42a4-f154-cf13-55a3eba10859@tomt.net/
* https://bugzilla.kernel.org/show_bug.cgi?id=211033
* https://bugs.archlinux.org/task/69202
Comment 10 youling257 2021-01-10 19:47:03 UTC
I can't stand memory leak, i will revert "Revert "drm/amd/display: Fix memory leaks in S3 resume""

revert 5efc1f4b454c6179d35e7b0c3eda0ad5763a00fc in today linux 5.11-rc3.
i use rc kernel every week.