Bug 214001 - [bisected][regression] After commit "drm/ttm: Initialize debugfs from ttm_global_init()" kernels without debugfs explicitly set to 'allow all' fail to boot
Summary: [bisected][regression] After commit "drm/ttm: Initialize debugfs from ttm_glo...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-08 19:09 UTC by Linux_Chemist
Modified: 2021-08-22 22:19 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.14-rc4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Linux_Chemist 2021-08-08 19:09:15 UTC
So this is an interesting one.

Problem: System hangs indefinitely/refuses to boot up. 5.14rc3 was totally fine but rc4 has the problem and I've bisected the commit to: 

69de4421bb4c103ef42a32bafc596e23918c106f is the first bad commit
commit 69de4421bb4c103ef42a32bafc596e23918c106f
Author: Jason Ekstrand <jason@jlekstrand.net>
Date:   Wed Jul 21 10:23:57 2021 -0500

    drm/ttm: Initialize debugfs from ttm_global_init()
    
    We create a bunch of debugfs entries as a side-effect of
    ttm_global_init() and then never clean them up.  This isn't usually a
    problem because we free the whole debugfs directory on module unload.
    However, if the global reference count ever goes to zero and then
    ttm_global_init() is called again, we'll re-create those debugfs entries
    and debugfs will complain in dmesg that we're creating entries that
    already exist.  This patch fixes this problem by changing the lifetime
    of the whole TTM debugfs directory to match that of the TTM global
    state.
    
    Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Link: https://patchwork.freedesktop.org/patch/msgid/20210721152358.2893314-6-jason@jlekstrand.net

I then tried loading an ubuntu mainline kernel for 5.14-rc4 and that was fine too, which meant my .config was to blame in conjunction with the change.
The specific issue narrowed down to not having debug_fs enabled in my kernel (CONFIG_DEBUG_FS is not set)

Now I've not had debugfs enabled for many, many years (is this even necessary on a kernel on which the user makes no use of the information it provides?) and now I see the option CONFIG_DEBUG_FS=y allows for one of three exclusive options. (CONFIG_DEBUG_FS_ALLOW_ALL; CONFIG_DEBUG_FS_DISALLOW_MOUNT and CONFIG_DEBUG_FS_ALLOW_NONE)

(*Moving forward, is debug_fs now a critical component of the linux kernel and required to be enabled (CONFIG_DEBUG_FS=Y) with a minimum of the 3rd option of 'allow none' given that so many things want to make use of it? Is debugfs 'expected' to be there to make reference to in driver code from now on?)

At any rate, I tested each of the 3 options and can confirm that since the commit in question, the system will _only_ boot now if: 

CONFIG_DEBUG_FS_ALLOW_ALL=y

I suspect that the commit did not account for kernel compilers who don't have debugfs at all - however, it even causes boot issues if debugfs is present but minimalised because neither:

CONFIG_DEBUG_FS_DISALLOW_MOUNT
"The API is open but filesystem is not loaded. Clients can still do
their work and read with debug tools that do not need debugfs filesystem." 

nor CONFIG_DEBUG_FS_ALLOW_NONE:
"Access is off. Clients get -PERM when trying to create nodes in
debugfs tree and debugfs is not registered as a filesystem.
Client can then back-off or continue without debugfs access."

are sufficient to get a successful boot after this commit.
Comment 1 Linux_Chemist 2021-08-08 19:18:02 UTC
As an addendum, I suppose a slight source of confusion is the info for 

CONFIG_DEBUG_FS which reads:

"debugfs is a virtual file system that kernel developers use to put
debugging files into. Enable this option to be able to read and
write to these files.

For detailed documentation on the debugfs API, see
Documentation/filesystems/.

If unsure, say N."

which implies: a) that it isn't strictly necessary to have enabled in order to boot/run normally (highlighting this bug) and b) that you would have zero need for it if you weren't reading/writing to these debugging files. 


To then have the option to enable debugfs but only run minimally with CONFIG_DEBUG_FS_ALLOW_NONE:
"Access is off. Clients get -PERM when trying to create nodes in
debugfs tree and debugfs is not registered as a filesystem.
Client can then back-off or continue without debugfs access."

leaves the question of 'why have it on and set to "allow none" rather than off completely?'
Comment 2 Duncan 2021-08-13 17:50:29 UTC
This has been reported (by someone else) on the dri-devel list (with the main kernel list and the devs CCed) as well, with me confirming it there.  No answer from the devs there either.  The reporter and I followed reporting instructions to take it to the list, and no hint it was even seen, despite the release getting closer and closer.

So I was going to try bugzilla (despite instructions to take it to the list), to see if I could raise the profile a bit, and find this bug.

Anyway, it's on both channels now.  FWIW:

https://lore.kernel.org/dri-devel/?q=5.14.0-rc4+broke+drm%2Fttm

Tho FWIW your symptoms are a bit different than those of the OP there and I.  We were able to boot, but only to legacy low-res VGA mode.  He has a boot-splash enabled and the screen blanked from early boot when the drm-framebuffer would normally take over until the login prompt, which appeared in vga mode.  I prefer to see the boot messages so no splash, and didn't have it blank, the screen just never left the legacy vga mode it normally uses for early boot.  We're both on Radeons; he's on the old radeon driver while I'm on amdgpu (polaris-11, rx460).

I wonder if you don't have the legacy vgacon (CONFIG_VGA_CONSOLE) enabled as a fall-back, as that would explain an apparent hang due to no valid graphics (tho the system may have booted, just without graphics).  Alternatively, I don't know what the behavior of non-radeon/amdgpu drm-framebuffer drivers is, maybe whatever you're running either does hang or simply doesn't fall back to vgacon as our radeon and amdgpu drivers did?

But in both his case and mine it bisected to the same commit, 69de4421bb, and reverting it against current gave both of us working systems again, so it's the same bug.
Comment 3 Linux_Chemist 2021-08-13 18:24:12 UTC
Thanks for your comment, Duncan!
Yes, I'm on a customised kernel that has a lot removed (including debugfs as you can tell) and also amdgpu (RX 5700). 
There's usually a bug in a testing RC every few releases, I just report them here after bisecting; seems the right place for it even if it's not lol  
Caught a nice bug last release cycle with the memory reservation for the bios (https://www.phoronix.com/scan.php?page=news_item&px=Linux-Always-Reserve-1MB)

(I wasn't sure to file this one under an AMD ("non-intel") specific 'video' bug but the commit was for 'drivers/gpu/drm/ttm' which I assume is agnostic. I don't know what it's for or whether only amdgpu/radeon makes use of it to say but it is interesting that the 3 of us have similar hardware.)
I can confirm all my .configs have had CONFIG_VGA_CONSOLE=y in it (though a lot of fallback stuff pulled out that probably stops me getting the legacy low-res VGA mode you mention, c'est la vie)

But anyways as you say, the ability to create a bootable kernel only becomes an issue from the commit in question when not having CONFIG_DEBUG_FS=y (and CONFIG_DEBUG_FS_ALLOW_ALL=y along with that)

Don't get me wrong, it's not a showstopper 'massive bug' because you can always put debugfs + 'allow all' into your kernel, I did so and am happily on rc5 now, but that's why I'd like a consensus to be known or shared (i.e. change the wording for the kconfig options) about whether a lot of things are expecting debugfs to be there in some form now - is it now an 'essential' part of the kernel? Or should things that rely on it fail gracefully if they don't find it?
Either it's essential and this isn't a bug and there needs to be clarification that debugfs should always be there in some form, or this is a bug and the commit needs tweaked to account for debugfs not being there or there in a diminished capacity.    

It is a bit silly that even CONFIG_DEBUG_FS_ALLOW_NONE wouldn't work for this bug because that seems like it should be providing that 'fail gracefully' mechanism to debugfs being 'there' but 'don't bother with it'.
Comment 4 Duncan 2021-08-13 19:59:06 UTC
(In reply to Linux_Chemist from comment #3)
> Thanks for your comment, Duncan!
> Yes, I'm on a customised kernel that has a lot removed (including debugfs as
> you can tell) and also amdgpu (RX 5700).

So amdgpu too, but a lot newer than my rx460.
 
> There's usually a bug in a testing RC every few releases, I just report them
> here after bisecting; seems the right place for it even if it's not lol  

That's what I had done previously.  But it looks like the kernel bugzilla folks updated the kernel-specific instructions recently, and I decided to try to follow them.  Not that it seems to have done a lot of good.  At least in bugzilla it's easier for other users (not on whatever list) to see it, which makes a difference if it's something that can't be fixed immediately, due either to no response (as here) or to complications like bisect difficulty or inability to directly revert due to commit dependencies, both of which complicated the last amdgpu bug I filed (against 5.7, bug #207383).

> (https://www.phoronix.com/scan.php?page=news_item&px=Linux-Always-Reserve-
> 1MB)

I remember reading about that and agreeing with the 1MiB reserved idea in general, tho the bug triggering the proposal didn't directly affect me.

> 
> (I wasn't sure to file this one under an AMD ("non-intel") specific 'video'
> bug but the commit was for 'drivers/gpu/drm/ttm' which I assume is agnostic.
> I don't know what it's for or whether only amdgpu/radeon makes use of it to
> say but it is interesting that the 3 of us have similar hardware.)

All three of us users seemed to consider it generic drm, as I know ttm in general is, but you're correct, the fact that we're all on some form of radeon looks like it could be more than coincidence.  The on-list reporter and I, was interesting, but now it's three, and getting more than interesting.  

FWIW the #207383 bug I mentioned bisected to a commit in Andrew Morton's mm tree, but only affected amdgpu (and radeon? IDR) because of something only amdgpu was doing.  I filed that against amdgpu but only because I hadn't bisected yet when I first filed it, and had no clue it was going to bisect to an mm tree commit.  And *OH* *MAN* was that thing hard to bisect, in part because I could count on my bisect-bad results but could never be entirely sure about bisect-good.  By the time I finally finished the bisect, a number of others were CCed as they were seeing the bug too, some reproducing a lot more reliably than I was, but no one was really helping with the bisecting!  OTOH, all the others /were/ an encouragement to keep going, a good thing really, because as hard as that thing was to bisect, if it hadn't been for them I'd have been *seriously* tempted to simply buy a new graphics card and be done with it!  I seriously hope I never have any any bugs so difficult to bisect ever again!

> I can confirm all my .configs have had CONFIG_VGA_CONSOLE=y in it

Hmm.  I really suspected that was the problem.  I guess not.  And amdgpu as well.

New theory would be that the behavior's different on your much newer hardware.  Not that it's likely to help with the bug but I'd be interested in whether it works with /only/ vgacon, no drm configured at all.  Of course that'd be a CLI-only test, booting to text login, but if you've never tried it on that hardware it might be useful to know if vgacon works at all for you.

(Similarly here, the last time I had really tested vgacon was before the switch from ums to kms.  I knew it was /supposed/ to be a fallback, but for the last year or so the question had been nagging at me as to whether the fallback would actually work if the graphics card failed and I had to switch, since I'm doing a monolithic kernel with only the specific firmware for this card builtin, so if the vgacon fallback didn't work I'd be in trouble as in that case I couldn't get a text login to configure the new drivers and firmware and rebuild.  So this failure was actually a relief for me, as it demonstrated that the fallback /does/ still work, should I ever need it.)

> But anyways as you say, the ability to create a bootable kernel only becomes
> an issue from the commit in question when not having CONFIG_DEBUG_FS=y (and
> CONFIG_DEBUG_FS_ALLOW_ALL=y along with that)

I think you said that.  I was at least getting the fallback, and I guess like you before this bug, I didn't even have a clue that the three secondary choices once CONFIG_DEBUG_FS was enabled were there. Reading it here was new to me!  =:^)

> Don't get me wrong, it's not a showstopper 'massive bug' because you can
> always put debugfs + 'allow all' into your kernel, I did so and am happily
> on rc5 now

I, and I think the guy who reported it to the list, reverted the commit in question, instead.  Here, I do that by doing a git show --reverse redirected to a patch-file, then drop that patch-file in a particular directory where it gets auto-applied to further kernel updates by my kernel management scripts.  (FWIW being on gentoo, auto-applying patches to package updates works similarly, and I just used the same idea for my kernel scripts.)

If the bug goes long-term and the patch ultimately quits applying due to further updates, /that's/ when I'll switch to the debugfs option.  But while I'm beginning to suspect they may not get it fixed for 5.14 release, I suspect if it hits release and then stable they'll have enough bug reports to raise the profile, and we'll have a fix for 5.15.

> but that's why I'd like a consensus to be known or shared (i.e.
> change the wording for the kconfig options) about whether a lot of things
> are expecting debugfs to be there in some form now - is it now an
> 'essential' part of the kernel? Or should things that rely on it fail
> gracefully if they don't find it?

I believe debugfs is still clearly, at least modulo bugs, optional, and we'll eventually have a fix for this bug.  Even were someone to argue that had changed, I expect such an argument would generate a big enough discussion that it'd hit the usual Linux news sites, LWN, phoronix, lxer, etc.  Since kconfig still says "N" if not sure and I've seen no such discussion, that would appear not to have happened yet, and even if it were to happen, I'd expect the new policy to apply to some later kernel, meaning they'd still need a fix for the current bug.

> It is a bit silly that even CONFIG_DEBUG_FS_ALLOW_NONE wouldn't work for
> this bug because that seems like it should be providing that 'fail
> gracefully' mechanism to debugfs being 'there' but 'don't bother with it'.

To me, that's the more demonstration that it's truly a bug, for all the reasons you say.
Comment 5 Linux_Chemist 2021-08-22 22:10:07 UTC
I can sense you're a smart cookie, Duncan, I've enjoyed this little tete a tete. 

I think this bug has been addressed, it's just not been mentioned yet (see the following into mainline):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu?id=958f44255058338f4b370d8e4100e1e7d72db0cc

" This changes it so that if creation of TTM's debugfs root directory fails, then no biggie: keep calm and carry on."

Will test it out as soon as I can and comment/adjust the bug report.
Comment 6 Linux_Chemist 2021-08-22 22:19:47 UTC
All good! CONFIG_DEBUG_FS is not set and can boot again :) Marking as fixed.

Note You need to log in before you can comment on or make changes to this bug.