Bug 202537
Summary: | amdgpu/DC failed to reserve new abo buffer before flip | ||
---|---|---|---|
Product: | Drivers | Reporter: | Bernd Steinhauser (linux) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | christian.koenig, harry.wentland, nicholas.kazlauskas, pmenzel+bugzilla.kernel.org |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.20 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kernel messages
kmemleak output with 4.20.6 kmemleak output with 5.0-rc6 kmemleak output with 4.19.0-rc1 5d35ed4832da |
Description
Bernd Steinhauser
2019-02-09 14:17:05 UTC
Created attachment 281077 [details]
kernel messages
Yeah, looks like a memory leak. Please bisect and/or provide kmemleak output, otherwise it might be difficult to make progress on this issue. Sure, I can try to bisect it, but it would help if I could narrow the amount of commits down, because usually the problem doesn't come right away, so it would take some time to find out. e.g. restricting to commits made in drivers/gpu/drm/amd would result in about 8 steps instead of 13. It would really help if I could narrow it even down further, like a subset of files? drivers/gpu/drm/amd/display/ seems likely. Even if the result from that doesn't make sense, it should at least narrow down the other commits you need to test. Or maybe just start with kmemleak? I'll have a look at kmemleak, but I've never worked with it, so it would be nice to have a backup in case I don't get along with it. ;) Created attachment 281105 [details]
kmemleak output with 4.20.6
So I let kmemleak do a scan and this is the output.
In case it matters, I let mpv render a video with hwdec vaapi, since that is how I first noticed that something's going wrong.
kmemleak is claiming there are leaks all over the place. That's weird, since other people (including myself) aren't seeing any such leaks, also with 4.20 based kernels. So, I'm afraid this indicates some lower level issue, and you'll have to bisect without making any assumptions about where the problem lies. I hit kmemleak problems, and reported those at freedesktop.org [1]. Unfortuntately, I have not had access to the system after the report, and won’t have until the end of next week. [1]: https://bugs.freedesktop.org/show_bug.cgi?id=109389 "[Bug 109389] memory leak in `amdgpu_bo_create()`" (In reply to Michel Dänzer from comment #7) > kmemleak is claiming there are leaks all over the place. That's weird, since > other people (including myself) aren't seeing any such leaks, also with 4.20 > based kernels. > > So, I'm afraid this indicates some lower level issue, and you'll have to > bisect without making any assumptions about where the problem lies. Well, after my test above I crosschecked this on 4.19.20 and I definitely don't see any memleaks there. So now that I know how to perform a quick test for each kernel version, bisecting this shouldn't be a big deal anymore and I'll try to do that later on. Unfortunately this turns out to be much harder than expected, because about 1/3 of the revs to test just won't boot at all (like instant kernel panic and not responding). This problem was fixed somewhere in the release candidates of 4.19, but I first need to track down the fix so I can properly continue with the bisect. Of the rest, another 1/3 of the revs do boot, but only with a black screen. While I can ssh into the system and check for memleaks, I don't think it's a proper test, because it seems to me as if amdgpu failed to initialize properly. So I need to track down the fix for this (again somewhere in the release candidates of 4.19) as well. So far, all I can be sure of is that the responsible commit was before v4.19-rc5 was backmerged into drm-next and drm-misc-next (7b76d0588477d4b6097a9048b42835a45caf5c48). But that still leaves quite a few commits to test. Ok, so finally I think I've been able to track this down. Not 100% sure, because for the final test versions I had to apply a few patches to fix bugs that otherwise would've prevented tests. In any case, this was the first version that showed this massive amount of memleaks, before there were only 6 (of which 4 were related to HID and ACPI). 5d35ed4832dab334e076a24c18a52776c2f24911 is the first bad commit commit 5d35ed4832dab334e076a24c18a52776c2f24911 Author: Christian König <christian.koenig@amd.com> Date: Fri Aug 31 11:08:06 2018 +0200 drm/amdgpu: fix idle state and bulk_moveable flag ···· Add BOs to the idle state again and correctly clear the flag when new BOs are added. ···· Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 28e778e55b368e605e6f2df4efea4be5f324d4ae 371220da179e31b7d2c97741dd984cb896fcb4c4 M drivers Well that was a known issue, but it should be fixed with 4.20. Sorry to note that, but you most likely have a bisect result of a patch causing a memory leak which is already fixed. meh … well, I just tested 4.20.10 and I do still see a lot of memory leaks there. kmemleak: 93 new suspected memory leaks (see /sys/kernel/debug/kmemleak) And it looks an awful lot the same as when I checked the commit id above. So if you say it was fixed, it might be helpful to point me to the commit id when it was fixed, so I can check that and use it as a starting/reference point. Hi. Testing Linux 5.0-rc6 for some minutes, I am not seeing these kmemleak messages anymore on the MSI Mortar B350M. Bernd, could you test this? Unfortunately, the commit supposedly fixing the introduced leak by the commit you bisected does not have a Fixes tag. At least the command below returns nothing. git log --grep "5d35ed4832d" origin/master Sure, can test that one. Created attachment 281177 [details]
kmemleak output with 5.0-rc6
Nope sorry, I see the same with kernel 5.0-rc6.
btw, those first two leaks which start with acpi functions, I've seen those in every version I tested, including the later 4.19 versions.
Don't know if I should open a bug report about that one (or maybe there is already one).
I'll wait with further testing for the commit id of the fix.
We completely disabled the feature added in "5d35ed4832d" for upstreaming later on. Can you guys please test amd-staging-drm-next as well and check if the problem occurs there as well. If not then please bisect what fixed it. (In reply to Christian König from comment #17) > We completely disabled the feature added in "5d35ed4832d" for upstreaming > later on. Sorry, I do not understand your reply at all. Could you please rephrase? What commit does that, what you describe? > Can you guys please test amd-staging-drm-next as well and check if the > problem occurs there as well. If not then please bisect what fixed it. Bernd and I seem to have different problems – or I updated user space not triggering the problematic path anymore or did not do the steps to reproduce it (although starting GDM should have been enough). Anyway, why should the fix be bisected? To apply it to stable? Bernd, if you have time, it’d be great, if you listed the commits here, which you needed to apply on top to fix the other regressions. (In reply to Paul Menzel from comment #18) > (In reply to Christian König from comment #17) > > We completely disabled the feature added in "5d35ed4832d" for upstreaming > > later on. > > Sorry, I do not understand your reply at all. Could you please rephrase? > What commit does that, what you describe? Commit 5d35ed4832d is a bug fix for bulk moves, which is a feature which should be completely disabled in 4.20. So your bisecting is most likely incorrect. > > Can you guys please test amd-staging-drm-next as well and check if the > > problem occurs there as well. If not then please bisect what fixed it. > > Bernd and I seem to have different problems – or I updated user space not > triggering the problematic path anymore or did not do the steps to reproduce > it (although starting GDM should have been enough). > > Anyway, why should the fix be bisected? To apply it to stable? Yes, exactly. It looks like that 4.20 is either using bulk moves (which it shouldn't) or we have introduced another problem which also caused memory leaks. (In reply to Christian König from comment #17) > We completely disabled the feature added in "5d35ed4832d" for upstreaming > later on. > > Can you guys please test amd-staging-drm-next as well and check if the > problem occurs there as well. If not then please bisect what fixed it. Would've been nice to point me to the corresponding repo as well. Don't worry, I've figured it out, but still would've been nice. In any case, current HEAD of amd-staging-drm-next looks good to me, I can't reproduce the memleaks with that one. I'll try to find the fix, but that'll take me 2-3 days. (In reply to Paul Menzel from comment #18) > Bernd, if you have time, it’d be great, if you listed the commits here, > which you needed to apply on top to fix the other regressions. Most importantly 9d27e39d309c93025ae6aa97236af15bef2a5f1f, which says it's for Carrizo, but it seems to affect my Kaveri as well, which wouldn't be surprising since the two are related. But on your Ryzen(?) system, this one might not be necessary. I also applied 03651735fbded39f608163718f816ab9cf14fba7 on top for a wider range of commits after 972a21f94631642d6714bb2a1983b7b15a77526d since otherwise the system would freeze very quickly. But even with that one applied the mentioned id above is very unstable and I have only about 1min or so to do my tests. Still that was enough time to do the tests at least twice and show that there is the same flood of memory leaks with pretty much the same function sequences. (In reply to Christian König from comment #19) > > Commit 5d35ed4832d is a bug fix for bulk moves, which is a feature which > should be completely disabled in 4.20. So your bisecting is most likely > incorrect. > Well, as I said, I'm not 100% sure, because I had to apply two patches to be even able to test. But I've repeated my tests with those two versions earlier on and came to the same result. b995795bf09b6bb7847a2a9fc8e6b5b4ab0ce20c does show exactly 6 memleaks to me and those are the 2 acpi ones I mentioned above and 4 showing hid function sequences, but nothing with drm or similar. One commit later (5d35ed4832d) with the same two patches applied it's a different story and I get 60 or more memleaks listed, which you have to admit look an awful lot similar to what I've posted for 5.0-rc1 above (I'll upload the log in a minute). Now that could be pure coincidence, but I would be surprised if it was. Created attachment 281201 [details]
kmemleak output with 4.19.0-rc1 5d35ed4832da
Ok, being back at the system after some days, I see the kmemleaks are still present with Linux 5.0-rc6+. Bernd, what triggers this on your system? What is your test case? Start some program? (In reply to Paul Menzel from comment #22) > Bernd, what triggers this on your system? What is your test case? Start some > program? basically start the system, log in, ensuring that /sys/kernel/debug/kmemleak is empty, then initiating the scan and waiting for the result. I found that testing without the login (starting sddm in my case) can be enough to spot the memleaks, but you can't be sure. Also, I think that putting some more work there for the gpu (e.g. playing a video) helps to spot more memleaks quicker, thus getting a more reliable result quicker, but I it doesn't seem necessary. In case I don't find memleaks, I still repeat the scan routing a few times, do something else in the meantime (like preparing the next test version) and then before rebooting do the scan once more, just to be sure. So in total – on my rather slow system – every version is tested for about 30min, although in case of a bad version about 5min is enough. Anyway, back to the original topic. bisecting this time went much more smoothly and much quicker than before and I can actually present the result already, see below. I tried to apply the fix on top of 4.20.10, but that doesn't compile as it most likely depends on other commits. So unfortunately can't cross-check this at the moment. @Paul: might be a good idea if you check this as well, meaning to test b61857b5e and its parent. git bisect start '--term-old' 'unfixed' '--term-new' 'fixed' # unfixed: [8fe28cb58bcb235034b64cbbb7550a8a43fd88be] Linux 4.20 git bisect unfixed 8fe28cb58bcb235034b64cbbb7550a8a43fd88be # fixed: [256445aee13f4de36cb47c13a9560b5d74faacd2] drm/amdgpu: remove some old unused dpm helpers git bisect fixed 256445aee13f4de36cb47c13a9560b5d74faacd2 # unfixed: [e0c38a4d1f196a4b17d2eba36afff8f656a4f1de] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next git bisect unfixed e0c38a4d1f196a4b17d2eba36afff8f656a4f1de # unfixed: [9ef10340749e1da0c7fde609cedd5360f8484a0b] Merge tag 'xtensa-20181228' of git://github.com/jcmvbkbc/linux-xtensa git bisect unfixed 9ef10340749e1da0c7fde609cedd5360f8484a0b # unfixed: [fcf010449ebe1db0cb68b2c6410972a782f2bd14] Merge tag 'kgdb-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux git bisect unfixed fcf010449ebe1db0cb68b2c6410972a782f2bd14 # unfixed: [9b286efeb5eb5aaa2712873fc1f928b2f879dbde] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect unfixed 9b286efeb5eb5aaa2712873fc1f928b2f879dbde # unfixed: [ac5eed2b41776b05cf03aac761d3bb5e64eea24c] Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect unfixed ac5eed2b41776b05cf03aac761d3bb5e64eea24c # unfixed: [5dc3fc5a7835f6b98184d2b8df909c5230c37a2c] drm/amd/display: Check if registers are available before accessing git bisect unfixed 5dc3fc5a7835f6b98184d2b8df909c5230c37a2c # fixed: [87076c8829465b8ae71225f7e639e0e28ab4b4a2] drm/amd/display: Re-enable CRC capture following modeset git bisect fixed 87076c8829465b8ae71225f7e639e0e28ab4b4a2 # fixed: [84d3245599f527138c4d4b87deed14a7e85cd81b] drm/amdgpu: Add missing power attribute to APU check git bisect fixed 84d3245599f527138c4d4b87deed14a7e85cd81b # unfixed: [ae6d343541bb75958e9535d056adaf4ff6a66d6a] drm/ttm: add lru notify to bo driver v2 git bisect unfixed ae6d343541bb75958e9535d056adaf4ff6a66d6a # fixed: [5d50fcbda7b0acd301bb1fc3d828df0aa29237b8] drm/ttm: stop always moving BOs on the LRU on page fault git bisect fixed 5d50fcbda7b0acd301bb1fc3d828df0aa29237b8 # fixed: [d7337ca2640cde21ff178bd78f01d94cd5ea2e08] drm/amd/powerplay: support retrieving and adjusting SOC clock power levels V2 git bisect fixed d7337ca2640cde21ff178bd78f01d94cd5ea2e08 # fixed: [b61857b5e365889d67a6296c413df396032d374d] drm/amdgpu: set bulk_moveable to false when lru changed v2 git bisect fixed b61857b5e365889d67a6296c413df396032d374d # first fixed commit: [b61857b5e365889d67a6296c413df396032d374d] drm/amdgpu: set bulk_moveable to false when lru changed v2 commit b61857b5e365889d67a6296c413df396032d374d Author: Chunming Zhou <david1.zhou@amd.com> Date: Thu Jan 10 15:49:54 2019 +0800 drm/amdgpu: set bulk_moveable to false when lru changed v2 ···· if lru is changed, we cannot do bulk moving. v2: root bo isn't in bulk moving, skip its change. ···· Signed-off-by: Chunming Zhou <david1.zhou@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 3544338af6c797a518386198369dc4766961d151 392a4c14309bd108b20046609138f7bc2859f3f7 M drivers |