Bug 202537 - amdgpu/DC failed to reserve new abo buffer before flip
Summary: amdgpu/DC failed to reserve new abo buffer before flip
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-09 14:17 UTC by Bernd Steinhauser
Modified: 2019-02-19 21:00 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.20
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel messages (83.56 KB, text/plain)
2019-02-09 14:27 UTC, Bernd Steinhauser
Details
kmemleak output with 4.20.6 (118.91 KB, text/plain)
2019-02-11 18:59 UTC, Bernd Steinhauser
Details
kmemleak output with 5.0-rc6 (61.30 KB, text/plain)
2019-02-17 10:00 UTC, Bernd Steinhauser
Details
kmemleak output with 4.19.0-rc1 5d35ed4832da (60.25 KB, text/plain)
2019-02-18 22:28 UTC, Bernd Steinhauser
Details

Description Bernd Steinhauser 2019-02-09 14:17:05 UTC
I've been using amdgpu for a long time on my Kaveri (A7800) now and it works fine.
In the recent kernel versions (I think since 4.15), I've been trying it with the DC activated and apart from some initial issue with the HDMI connection it works fine and on 4.19 it's rock-stable.

However, when I tried 4.20 (all versions from 4.20.1 to 4.20.6), I'm experiencing a regression.
Initially, everything works fine, but at some point especially video-related things stop working properly.
vaapi seems more affected than vdpau, but at some point they both fail to setup the hw decoding.
(btw, vdpau for some reason can only run for one video at the time, while vaapi can do multiple ones, but that's an unrelated issue. It used to be different a year ago or even earlier on)

I think the problems start when I see a lot of messages like this:
[drm:amdgpu_display_crtc_page_flip_target] *ERROR* failed to reserve new abo buffer before flip

However, after that I can continue for a bit, possibly with the restriction of not being able to use vaapi but vdpau.
At some point, the system will fail due to a memory leak, at least the OOM is starting to kill stuff until it ends up killing the window manager and X11. Before that I get these messages:
[drm:amdgpu_cs_ioctl] *ERROR* amdgpu_vm_validate_pt_bos() failed.
[drm:amdgpu_cs_ioctl] *ERROR* Not enough memory for command submission!
[drm:amdgpu_cs_ioctl] *ERROR* amdgpu_cs_list_validate(validated) failed.
[drm:amdgpu_cs_ioctl] *ERROR* Not enough memory for command submission!
[drm:amdgpu_cs_ioctl] *ERROR* amdgpu_vm_validate_pt_bos() failed.
[drm:amdgpu_cs_ioctl] *ERROR* Not enough memory for command submission!

--- snip --- (lots of oom activity)

and finally:
[TTM] Out of kernel memory
[TTM] Out of kernel memory
[TTM] Out of kernel memory
[TTM] Out of kernel memory
[TTM] Out of kernel memory
[TTM] Out of kernel memory
[TTM] Out of kernel memory
amdgpu 0000:00:01.0: (-12) failed to allocate kernel bo
[drm:amdgpu_uvd_free_handles] *ERROR* Error destroying UVD -12!

The latter one – I think – actually being the solution to the OOM problem, but I'm certainly not an expert.

CPU/GPU is:
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 48
model name      : AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G
stepping        : 1
microcode       : 0x6003106
cpu MHz         : 1592.730
cache size      : 2048 KB

Back to 4.19 for now since that runs beautifully.
Comment 1 Bernd Steinhauser 2019-02-09 14:27:48 UTC
Created attachment 281077 [details]
kernel messages
Comment 2 Michel Dänzer 2019-02-11 10:47:50 UTC
Yeah, looks like a memory leak.

Please bisect and/or provide kmemleak output, otherwise it might be difficult to make progress on this issue.
Comment 3 Bernd Steinhauser 2019-02-11 17:09:22 UTC
Sure, I can try to bisect it, but it would help if I could narrow the amount of commits down, because usually the problem doesn't come right away, so it would take some time to find out.
e.g. restricting to commits made in drivers/gpu/drm/amd would result in about 8 steps instead of 13.
It would really help if I could narrow it even down further, like a subset of files?
Comment 4 Michel Dänzer 2019-02-11 17:20:44 UTC
drivers/gpu/drm/amd/display/ seems likely. Even if the result from that doesn't make sense, it should at least narrow down the other commits you need to test.

Or maybe just start with kmemleak?
Comment 5 Bernd Steinhauser 2019-02-11 17:50:26 UTC
I'll have a look at kmemleak, but I've never worked with it, so it would be nice to have a backup in case I don't get along with it. ;)
Comment 6 Bernd Steinhauser 2019-02-11 18:59:43 UTC
Created attachment 281105 [details]
kmemleak output with 4.20.6

So I let kmemleak do a scan and this is the output.
In case it matters, I let mpv render a video with hwdec vaapi, since that is how I first noticed that something's going wrong.
Comment 7 Michel Dänzer 2019-02-12 09:07:56 UTC
kmemleak is claiming there are leaks all over the place. That's weird, since other people (including myself) aren't seeing any such leaks, also with 4.20 based kernels.

So, I'm afraid this indicates some lower level issue, and you'll have to bisect without making any assumptions about where the problem lies.
Comment 8 Paul Menzel 2019-02-12 15:15:45 UTC
I hit kmemleak problems, and reported those at freedesktop.org [1]. Unfortuntately, I have not had access to the system after the report, and won’t have until the end of next week.

[1]: https://bugs.freedesktop.org/show_bug.cgi?id=109389
     "[Bug 109389] memory leak in `amdgpu_bo_create()`"
Comment 9 Bernd Steinhauser 2019-02-12 17:09:06 UTC
(In reply to Michel Dänzer from comment #7)
> kmemleak is claiming there are leaks all over the place. That's weird, since
> other people (including myself) aren't seeing any such leaks, also with 4.20
> based kernels.
> 
> So, I'm afraid this indicates some lower level issue, and you'll have to
> bisect without making any assumptions about where the problem lies.

Well, after my test above I crosschecked this on 4.19.20 and I definitely don't see any memleaks there.

So now that I know how to perform a quick test for each kernel version, bisecting this shouldn't be a big deal anymore and I'll try to do that later on.
Comment 10 Bernd Steinhauser 2019-02-14 07:34:37 UTC
Unfortunately this turns out to be much harder than expected, because about 1/3 of the revs to test just won't boot at all (like instant kernel panic and not responding).
This problem was fixed somewhere in the release candidates of 4.19, but I first need to track down the fix so I can properly continue with the bisect.

Of the rest, another 1/3 of the revs do boot, but only with a black screen.
While I can ssh into the system and check for memleaks, I don't think it's a proper test, because it seems to me as if amdgpu failed to initialize properly.
So I need to track down the fix for this (again somewhere in the release candidates of 4.19) as well.

So far, all I can be sure of is that the responsible commit was before v4.19-rc5 was backmerged into drm-next and drm-misc-next (7b76d0588477d4b6097a9048b42835a45caf5c48).
But that still leaves quite a few commits to test.
Comment 11 Bernd Steinhauser 2019-02-15 22:37:09 UTC
Ok, so finally I think I've been able to track this down.
Not 100% sure, because for the final test versions I had to apply a few patches to fix bugs that otherwise would've prevented tests.
In any case, this was the first version that showed this massive amount of memleaks, before there were only 6 (of which 4 were related to HID and ACPI).
5d35ed4832dab334e076a24c18a52776c2f24911 is the first bad commit
commit 5d35ed4832dab334e076a24c18a52776c2f24911
Author: Christian König <christian.koenig@amd.com>
Date:   Fri Aug 31 11:08:06 2018 +0200

    drm/amdgpu: fix idle state and bulk_moveable flag
····
    Add BOs to the idle state again and correctly clear the flag when
    new BOs are added.
····
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Tested-by: Michel Dänzer <michel.daenzer@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 28e778e55b368e605e6f2df4efea4be5f324d4ae 371220da179e31b7d2c97741dd984cb896fcb4c4 M      drivers
Comment 12 Christian König 2019-02-16 19:17:05 UTC
Well that was a known issue, but it should be fixed with 4.20.

Sorry to note that, but you most likely have a bisect result of a patch causing a memory leak which is already fixed.
Comment 13 Bernd Steinhauser 2019-02-17 08:07:54 UTC
meh …

well, I just tested 4.20.10 and I do still see a lot of memory leaks there.
kmemleak: 93 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

And it looks an awful lot the same as when I checked the commit id above.
So if you say it was fixed, it might be helpful to point me to the commit id when it was fixed, so I can check that and use it as a starting/reference point.
Comment 14 Paul Menzel 2019-02-17 08:36:29 UTC
Hi. Testing Linux 5.0-rc6 for some minutes, I am not seeing these kmemleak messages anymore on the MSI Mortar B350M. Bernd, could you test this?

Unfortunately, the commit supposedly fixing the introduced leak by the commit you bisected does not have a Fixes tag. At least the command below returns nothing.

    git log --grep "5d35ed4832d" origin/master
Comment 15 Bernd Steinhauser 2019-02-17 08:57:26 UTC
Sure, can test that one.
Comment 16 Bernd Steinhauser 2019-02-17 10:00:01 UTC
Created attachment 281177 [details]
kmemleak output with 5.0-rc6

Nope sorry, I see the same with kernel 5.0-rc6.

btw, those first two leaks which start with acpi functions, I've seen those in every version I tested, including the later 4.19 versions.
Don't know if I should open a bug report about that one (or maybe there is already one).

I'll wait with further testing for the commit id of the fix.
Comment 17 Christian König 2019-02-18 07:45:48 UTC
We completely disabled the feature added in "5d35ed4832d" for upstreaming later on.

Can you guys please test amd-staging-drm-next as well and check if the problem occurs there as well. If not then please bisect what fixed it.
Comment 18 Paul Menzel 2019-02-18 08:08:41 UTC
(In reply to Christian König from comment #17)
> We completely disabled the feature added in "5d35ed4832d" for upstreaming
> later on.

Sorry, I do not understand your reply at all. Could you please rephrase? What commit does that, what you describe?

> Can you guys please test amd-staging-drm-next as well and check if the
> problem occurs there as well. If not then please bisect what fixed it.

Bernd and I seem to have different problems – or I updated user space not triggering the problematic path anymore or did not do the steps to reproduce it (although starting GDM should have been enough).

Anyway, why should the fix be bisected? To apply it to stable?

Bernd, if you have time, it’d be great, if you listed the commits here, which you needed to apply on top to fix the other regressions.
Comment 19 Christian König 2019-02-18 09:02:41 UTC
(In reply to Paul Menzel from comment #18)
> (In reply to Christian König from comment #17)
> > We completely disabled the feature added in "5d35ed4832d" for upstreaming
> > later on.
> 
> Sorry, I do not understand your reply at all. Could you please rephrase?
> What commit does that, what you describe?

Commit 5d35ed4832d is a bug fix for bulk moves, which is a feature which should be completely disabled in 4.20. So your bisecting is most likely incorrect.

> > Can you guys please test amd-staging-drm-next as well and check if the
> > problem occurs there as well. If not then please bisect what fixed it.
> 
> Bernd and I seem to have different problems – or I updated user space not
> triggering the problematic path anymore or did not do the steps to reproduce
> it (although starting GDM should have been enough).
> 
> Anyway, why should the fix be bisected? To apply it to stable?

Yes, exactly.

It looks like that 4.20 is either using bulk moves (which it shouldn't) or we have introduced another problem which also caused memory leaks.
Comment 20 Bernd Steinhauser 2019-02-18 22:26:42 UTC
(In reply to Christian König from comment #17)
> We completely disabled the feature added in "5d35ed4832d" for upstreaming
> later on.
> 
> Can you guys please test amd-staging-drm-next as well and check if the
> problem occurs there as well. If not then please bisect what fixed it.
Would've been nice to point me to the corresponding repo as well.
Don't worry, I've figured it out, but still would've been nice.
In any case, current HEAD of amd-staging-drm-next looks good to me, I can't reproduce the memleaks with that one.

I'll try to find the fix, but that'll take me 2-3 days.

(In reply to Paul Menzel from comment #18)
> Bernd, if you have time, it’d be great, if you listed the commits here,
> which you needed to apply on top to fix the other regressions.
Most importantly 9d27e39d309c93025ae6aa97236af15bef2a5f1f, which says it's for Carrizo, but it seems to affect my Kaveri as well, which wouldn't be surprising since the two are related.
But on your Ryzen(?) system, this one might not be necessary.
I also applied 03651735fbded39f608163718f816ab9cf14fba7 on top for a wider range of commits after 972a21f94631642d6714bb2a1983b7b15a77526d since otherwise the system would freeze very quickly.
But even with that one applied the mentioned id above is very unstable and I have only about 1min or so to do my tests.
Still that was enough time to do the tests at least twice and show that there is the same flood of memory leaks with pretty much the same function sequences.

(In reply to Christian König from comment #19)
> 
> Commit 5d35ed4832d is a bug fix for bulk moves, which is a feature which
> should be completely disabled in 4.20. So your bisecting is most likely
> incorrect.
> 
Well, as I said, I'm not 100% sure, because I had to apply two patches to be even able to test.
But I've repeated my tests with those two versions earlier on and came to the same result.
b995795bf09b6bb7847a2a9fc8e6b5b4ab0ce20c does show exactly 6 memleaks to me and those are the 2 acpi ones I mentioned above and 4 showing hid function sequences, but nothing with drm or similar.
One commit later (5d35ed4832d) with the same two patches applied it's a different story and I get 60 or more memleaks listed, which you have to admit look an awful lot similar to what I've posted for 5.0-rc1 above (I'll upload the log in a minute).
Now that could be pure coincidence, but I would be surprised if it was.
Comment 21 Bernd Steinhauser 2019-02-18 22:28:07 UTC
Created attachment 281201 [details]
kmemleak output with 4.19.0-rc1 5d35ed4832da
Comment 22 Paul Menzel 2019-02-19 19:27:41 UTC
Ok, being back at the system after some days, I see the kmemleaks are still present with Linux 5.0-rc6+.

Bernd, what triggers this on your system? What is your test case? Start some program?
Comment 23 Bernd Steinhauser 2019-02-19 21:00:27 UTC
(In reply to Paul Menzel from comment #22)
> Bernd, what triggers this on your system? What is your test case? Start some
> program?
basically start the system, log in, ensuring that /sys/kernel/debug/kmemleak is empty, then initiating the scan and waiting for the result.
I found that testing without the login (starting sddm in my case) can be enough to spot the memleaks, but you can't be sure.
Also, I think that putting some more work there for the gpu (e.g. playing a video) helps to spot more memleaks quicker, thus getting a more reliable result quicker, but I it doesn't seem necessary.

In case I don't find memleaks, I still repeat the scan routing a few times, do something else in the meantime (like preparing the next test version) and then before rebooting do the scan once more, just to be sure.
So in total – on my rather slow system – every version is tested for about 30min, although in case of a bad version about 5min is enough.

Anyway, back to the original topic. bisecting this time went much more smoothly and much quicker than before and I can actually present the result already, see below.
I tried to apply the fix on top of 4.20.10, but that doesn't compile as it most likely depends on other commits.
So unfortunately can't cross-check this at the moment.
@Paul: might be a good idea if you check this as well, meaning to test b61857b5e and its parent.

git bisect start '--term-old' 'unfixed' '--term-new' 'fixed'
# unfixed: [8fe28cb58bcb235034b64cbbb7550a8a43fd88be] Linux 4.20
git bisect unfixed 8fe28cb58bcb235034b64cbbb7550a8a43fd88be
# fixed: [256445aee13f4de36cb47c13a9560b5d74faacd2] drm/amdgpu: remove some old unused dpm helpers
git bisect fixed 256445aee13f4de36cb47c13a9560b5d74faacd2
# unfixed: [e0c38a4d1f196a4b17d2eba36afff8f656a4f1de] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect unfixed e0c38a4d1f196a4b17d2eba36afff8f656a4f1de
# unfixed: [9ef10340749e1da0c7fde609cedd5360f8484a0b] Merge tag 'xtensa-20181228' of git://github.com/jcmvbkbc/linux-xtensa
git bisect unfixed 9ef10340749e1da0c7fde609cedd5360f8484a0b
# unfixed: [fcf010449ebe1db0cb68b2c6410972a782f2bd14] Merge tag 'kgdb-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
git bisect unfixed fcf010449ebe1db0cb68b2c6410972a782f2bd14
# unfixed: [9b286efeb5eb5aaa2712873fc1f928b2f879dbde] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect unfixed 9b286efeb5eb5aaa2712873fc1f928b2f879dbde
# unfixed: [ac5eed2b41776b05cf03aac761d3bb5e64eea24c] Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect unfixed ac5eed2b41776b05cf03aac761d3bb5e64eea24c
# unfixed: [5dc3fc5a7835f6b98184d2b8df909c5230c37a2c] drm/amd/display: Check if registers are available before accessing
git bisect unfixed 5dc3fc5a7835f6b98184d2b8df909c5230c37a2c
# fixed: [87076c8829465b8ae71225f7e639e0e28ab4b4a2] drm/amd/display: Re-enable CRC capture following modeset
git bisect fixed 87076c8829465b8ae71225f7e639e0e28ab4b4a2
# fixed: [84d3245599f527138c4d4b87deed14a7e85cd81b] drm/amdgpu: Add missing power attribute to APU check
git bisect fixed 84d3245599f527138c4d4b87deed14a7e85cd81b
# unfixed: [ae6d343541bb75958e9535d056adaf4ff6a66d6a] drm/ttm: add lru notify to bo driver v2
git bisect unfixed ae6d343541bb75958e9535d056adaf4ff6a66d6a
# fixed: [5d50fcbda7b0acd301bb1fc3d828df0aa29237b8] drm/ttm: stop always moving BOs on the LRU on page fault
git bisect fixed 5d50fcbda7b0acd301bb1fc3d828df0aa29237b8
# fixed: [d7337ca2640cde21ff178bd78f01d94cd5ea2e08] drm/amd/powerplay: support retrieving and adjusting SOC clock power levels V2
git bisect fixed d7337ca2640cde21ff178bd78f01d94cd5ea2e08
# fixed: [b61857b5e365889d67a6296c413df396032d374d] drm/amdgpu: set bulk_moveable to false when lru changed v2
git bisect fixed b61857b5e365889d67a6296c413df396032d374d
# first fixed commit: [b61857b5e365889d67a6296c413df396032d374d] drm/amdgpu: set bulk_moveable to false when lru changed v2
commit b61857b5e365889d67a6296c413df396032d374d
Author: Chunming Zhou <david1.zhou@amd.com>
Date:   Thu Jan 10 15:49:54 2019 +0800

    drm/amdgpu: set bulk_moveable to false when lru changed v2
····
    if lru is changed, we cannot do bulk moving.
    v2:
    root bo isn't in bulk moving, skip its change.
····
    Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 3544338af6c797a518386198369dc4766961d151 392a4c14309bd108b20046609138f7bc2859f3f7 M      drivers

Note You need to log in before you can comment on or make changes to this bug.