Bug 207635 - EXT4-fs error (device sda3): ext4_lookup:1701: inode #...: comm find: casefold flag without casefold feature; EXT4-fs (sda3): Remounting filesystem read-only
Summary: EXT4-fs error (device sda3): ext4_lookup:1701: inode #...: comm find: casefol...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-09 09:07 UTC by Joerg M. Sigle
Modified: 2020-05-12 00:55 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.4.20, 5.5.10, 5.5.11, 5.6.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Joerg M. Sigle 2020-05-09 09:07:16 UTC
Hello.

Since upgrade to Linux 5.5.11 on a Devuan System I've had a previously completely stable system repeatedly remount the root file system as read only, effectively requiring a reboot and fsck.

This problem appeared at about the same time when I read about the casefold feature being added; I did not willingly activate that feature nor do I want to use it. I suspect however, that the new feature might have come with a new bug.

This is an ext2 filesystem using the ext4 kernel subsystem.

The error occurs maybe once every few days; i.e. I'm practically unable to reproduce it willingly, but it's often enough to recognize it as a truly existing problem. It's definitely a show-stopper, but I don't know yet whether it causes data loss as well.

Going back to the previously used Linux 5.3.15 quite probably removed the problem. And going forward to 5.5.11 made the problem re-appear.

These are self compiled kernels, quite possibly I don't use the best .config that I could. I can provide the .config files for both the 5.3.15 and the 5.5.11 kernels I have used.

$ cat /proc/version
Linux version 5.5.11-i7 (root@think3) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PREEMPT Wed Mar 25 04:44:37 CET 2020

# dmesg | tail
EXT4-fs error (device sda3): ext4_lookup:1701: inode #4253945: comm find: casefold flag without casefold feature
EXT4-fs (sda3): Remounting filesystem read-only

This was after startup, a little bit of web browsing, and getting and reading my e-mail.

And upon restart:

/dev/sda3 contains a file system with errors, check forced.
/dev/sda3: Deleted inode 14729236 has zero dtime.  FIXED.
/dev/sda3: Deleted inode 14729243 has zero dtime.  FIXED.
/dev/sda3: Deleted inode 14738534 has zero dtime.  FIXED.
/dev/sda3: Deleted inode 14738583 has zero dtime.  FIXED.
/dev/sda3: Deleted inode 14738843 has zero dtime.  FIXED.

Thanks a lot for looking into this; and I may provide additional info if you tell me what you need.

Kind regards - j.
Comment 1 Joerg M. Sigle 2020-05-09 09:13:57 UTC
N.B.: Reviewing my kernel history, I found: The problem already happened with 5.5.10 and made me try 5.5.11 hoping it might have been fixed.
Comment 2 Joerg M. Sigle 2020-05-10 20:37:42 UTC
After upgrade to 5.6.11, once again had the root filesystem become read only.

I'm now going back to 5.3.15 which may have been the last unproblematic version I tried - just to verify once more that the problem doesn't occur with that older version.

N.B.: Googling "casefold flag without casefold feature" returned a patch - and most probably, the same problem for someone else in Kernel 5.4 vs. 5.3:

https://patchwork.ozlabs.org/project/linux-ext4/patch/20190903054324.20072-1-tytso@mit.edu/
> ext4: fix kernel oops caused by spurious casefold flag

https://forum.armbian.com/topic/13111-solved-5420-vs-539-arbian-bionic/
> 5.4.20: EXT4-fs error (device mmcblk0p1): ext4_lookup:1700: inode #4950: comm
> rsync: casefold flag without casefold feature
> 5.3.9: without error 


So the problem may have been adressed - but that was in 2019; I'm wondering how it could still be there in May 2020 kernels?
Comment 3 Eric Biggers 2020-05-10 22:06:23 UTC
It sounds like the problem is that one of the inodes on your filesystem was previously corrupted in such a way that it had unknown flags set.

Then one of these flags was assigned a meaning, causing things to break.

Running e2fsck v1.45.4 or later on the filesystem should fix this by clearing the casefold flag.  Can you try that?

I'm not sure there's anything else to do here, unless we were to make the kernel ignore unexpected flags.  Ted, have you considered that?  And it is intentional that e2fsck ignores unknown flags?
Comment 4 Joerg M. Sigle 2020-05-11 16:24:49 UTC
Hi Eric - thanks for your quick & helpful response.

Your explanation sounds plausible; now I also understand why the problem was persistent:

The e2fsck of my distribution was 1.44. If it is too old for these problems, setting the fs to ro and marking it for an fsck on next reboot simply could not achieve anything.

Me manually overwriting the default kernel version in grub.cfg with 5.3.15, was also short lived: because grub.cfg gets rewritten during various apt updates.
This is why the problem kept coming back even when I did not /knowingly/ switch back to a more recent kernel...

Now I got the current master e2fsck from github - thank you tytso.
This may hopefully have found and fixed the problem:

e2fsck 1.46-WIP (20-Mar-2020)
Pass 1: Checking inodes, blocks, and sizes
Inode 4244276 has encrypt flag but no encryption extended attribute.
Clear flag<y>? yes
Inode 4253945 has the casefold flag set but is not a directory. Clear flag<y>? yes
Inode 4253945 has encrypt flag but no encryption extended attribute.
Clear flag<y>? yes
Pass 2: Checking directorry structure
Pass 3: Checking directorry connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sda3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda3: ***** REBOOT SYSTEM *****
...

(The last mentioned inode is the same that caused the error message in the beginning of this thread.)

Kind regards, Joerg
Comment 5 Joerg M. Sigle 2020-05-11 16:30:42 UTC
Eric, re. your other question:

>I'm not sure there's anything else to do here, unless we were to make the
>kernel ignore unexpected flags.
>Ted, have you considered that?  And it is intentional that e2fsck ignores
>unknown flags?

Please allow me some input on this:

Someone might use a kernel with casefold or encryption support on Monday - and even use these features, causing a few of these flags to be set.

The same person might run a kernel with casefold and/or encryption disabled on Tuesday. So, would it really be necessary to set their filesystem to ro - giving them a hard time, just because they like to use different kernels? I think not.

There are many reasons to use different kernels: System-Rescue CD; kernel building experiments etc.

So IMHO, a kernel that doesn't support a certain capability should  not do *anything* with the bits used for that capability. It should make no assumptions about them, and at best not even look at them. Just leave them as they are.

At most, it might write a warning to /var/log/messages.

But it should not turn a working machine into a not working one for "reserved" bits being in a "surprising" state. There are other kernels out there, they might have some reason to set them as they are.

(Saying this, I assume *and hope!* that it is generally no problem to use an fs that has these flags set with a kernel not supporting them - apart from the missing extra functionality.)

This is just my naive opinion; I'm writing it however as someone who sees more and more complexity and unforeseen dependencies with bad side effects added to all areas of computing - often by people that were just a little bit too caring, or made too narrow assumptions on other peoples' usage scenarios.

Thank you very much again, and kind regards, Joerg
Comment 6 Christian Kujau 2020-05-11 16:59:00 UTC
(In reply to Joerg M. Sigle from comment #5)
> The same person might run a kernel with casefold and/or encryption disabled
> on Tuesday. So, would it really be necessary to set their filesystem to ro -

I don't know about the inner workings of this particular issue here, but turning the filesystem to RO when errors are encountered - isn't this what the "error-behavior" flag controls?

$ tune2fs -l /dev/sda1 | grep ^Err
Errors behavior:          Remount read-only

Setting this to "continue" or "panic" should have avoided that whole "RO" situation, I guess.
Comment 7 Eric Biggers 2020-05-11 18:26:24 UTC
> Someone might use a kernel with casefold or encryption support on Monday -
> and even use these features, causing a few of these flags to be set.
>
> The same person might run a kernel with casefold and/or encryption disabled
> on Tuesday. So, would it really be necessary to set their filesystem to ro -
> giving them a hard time, just because they like to use different kernels? I
> think not.

Casefold is an 'incompat' feature, because it changes the directory format.  So if someone enables it (on-disk, which is different from merely using a kernel that supports it), then old kernels can't use the filesystem at all.  That's working as intended.

But that's *not* actually what this issue is about.  This issue is about how the kernel treats inodes that got corrupted to have the casefold flag set when the user didn't actually enable the casefold feature.  The ext4 feature flags have clear behavior for how unexpected flags are handled, but the inode flags don't.
Comment 8 Joerg M. Sigle 2020-05-12 00:55:04 UTC
Thank you all for your explanations, again.

> Casefold is an 'incompat' feature, because it changes the directory format.
> So if someone enables it (on-disk [...]), old kernels can't use the
> filesystem at all.

Might this not warrant a warning in the ext4/ext5 manpage?
E.g. similar to what is there for the "ext64" feature:

"Note that some older kernels and older versions of e2fsprogs
 will not support file systems with this ext4 feature enabled."

The current text doesn't reveal this:

"[casefold] is name-preserving on the disk, but it
 allows applications to lookup for a file in the file system
 using an encoding equivalent version of the file name."

The required minimum versions of e2fsck, tune2fs etc. might also be mentioned.


> This issue is about how the kernel treats inodes that got corrupted
> to have the casefold flag set when the user didn't actually enable the
> casefold feature. The ext4 feature flags have clear behavior for how
> unexpected flags are handled, but the inode flags don't.

I use ext2 as least common denominator for multiple OS. I use conservative settings. And still "got caught" by a new (!) disabled (!) opt-in (!) feature,
with hits apparently coming out-of-the-blue, and left without (much of) a clue :-)

So IMHO a system not expected to use certain bits should rather not turn the fs to r/o because of them. Or, at least, indicate the minimum e2fsck version needed for fixing.

These bits could, however, still be checked very systematically: by letting tune2fs  recommend (or even call?) ext2fs whenever a user is about to enable casefold support on a preexisting fs. Only then, these bits become really meaningful, the admin is ready, and ext2fs is guaranteed to be sufficiently new.

Anytime afterwards, once that feature *is* enabled - any form of checking and reaction by the kernel would be perfectly fine and no bad surprise anymore.

Equally ok: Revelation of the problem by regular scheduled e2fsck runs, once e2fsck has been sufficiently updated over time to recognize and handle the problem.

Thanks again for your consideration & regards! Joerg

Note You need to log in before you can comment on or make changes to this bug.