Bug 196405
Summary: | mkdir mishandles st_nlink in ext4 directory with 64997 subdirectories | ||
---|---|---|---|
Product: | File System | Reporter: | Paul Eggert (eggert) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | NEW --- | ||
Severity: | normal | CC: | adilger.kernelbugzilla, michelbach94, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.13-rc1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | sample program illustrating how ext4 st_nlink behavior breaks glibc |
Description
Paul Eggert
2017-07-17 21:23:44 UTC
On Mon, Jul 17, 2017 at 09:23:44PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > See the following shell transcript, run in an ext4 directory, for an > illustration of the bug. The first four commands set things up. The 5th > command > 'mkdir d/64998' calls 'mkdir("d/64998, 0777)' and succeeds. The 6th command > shows that d's resulting st_nlink value is 1, which is incorrect; it should > be > 65000. I discovered this bug in Fedora 26, which runs Linux 4.11.10; from > inspecting the source it appears the bug is still there in 4.13-rc1. This is actually working as intended. In order to support a large number of directories, when we have a 16-bit link count field, we use a link count of 1 to mean, "lots". If you don't want this behavior because strict POSIX compliance is preferable to failing with ENOSPC when you hit that overflow condition, you can create the file system with the dir_nlink feature turned off. I will note that various userspace utilities that try to optimize directory walking by depending on the directory's link count to understand when they have found all of the subdirectories know that directory link count of 1 means "lots". This is why a link count of 1 was chosen. - Ted Thank you, Ted. Is it necessary for it to "break" (quotes because 1 seems to be a kind-of wildcard) at such a low and weird value? This behavior doesn't occur with ext4 ram disks. Run the same command, possibly with an even higher number, but within an ext4 ram disk, created via $ sudo mount -t ramfs -o size=4G ext4 <path> Example output: $ ll total 4 drwxr-xr-x 3 christoph root 0 Jul 18 22:59 ./ drwxrwxrwt 18 root root 4096 Jul 18 23:00 ../ drwxrwxr-x 446639 christoph christoph 0 Jul 18 23:05 d/ > Is it necessary for it to "break" (quotes because 1 seems to be a kind-of > wildcard) at such a low and weird value? The reason for choosing 1 is because there were other file systems where they don't have a '..' where the link count of the directory is 1. In fact, I just checked; I can't find anything in POSIX or the Single Unix Specification which guarantees that '.' and '..' are hard links. It is that way historically, but there are file systems such as VFAT where this is not true. That's important because there are programs which, if they see a link count of 42, if they are doing a directory tree walk, and they see 42 directories, they will assume that there are no more subdirectories at that level. I suppose we could have used a link count of 0xFFFFFFFF as the magic "lots" value, but that could potentially trigger overflow bugs in programs. And given that most of these programs already understand that 1 is "lots" so they can do the right thing when you run "find" on a mounted thumb drive using VFAT. > This behavior doesn't occur with ext4 ram disks. Run the same command, > possibly > with an even higher number, but within an ext4 ram disk, created via > > $ sudo mount -t ramfs -o size=4G ext4 <path> That's not a ext4 ram disk. That's a tmpfs partition (ramfs is an alias for tmpfs). "ext4" in the above invocation can be anything, and it has no meaning. Just as "mount -t proc proc /proc" is equivalent to "mount -t proc xyzzy /proc", the above can also be: sudo mount -t ramfs -o size=4G xyzzy <path> And it will be the same as "sudo mount -t ramfs -o size=4G ext4 <path>". - Ted The limit is implemented as < 65000, which was somewhat below the hard limit of the ext4_inode.i_links_count __u16 field of 65535. There is no hard reason for 65000 except to allow some margin for detecting overflow, and reserved values if there was a need. I don't think it makes a big practical difference whether directory link counts are accurate up to 65000 or 65534 subdirectories, since very few systems have this many subdirectories. In theory, we could add an i_links_count_hi field to extend this to a 32-bit value, but it isn't clear if there is a big benefit from doing this? FYI, the "link count == 1" meaning is historical for ancient filesystems which didn't track the link count of the directory at all. In that respect, it didn't mean "lots of subdirectories", but rather "I don't know the subdirectory count at all". (In reply to Andreas Dilger from comment #5) > the "link count == 1" meaning is historical for ancient filesystems ... > it didn't mean "lots of subdirectories", but rather "I don't know the > subdirectory count at all". This meaning is still true for ext4. For example: $ mkdir d d/{1..64998} $ rm -fr d $ mkdir d $ ls -ld d drwxr-xr-x 2 eggert eggert 4096 Jul 18 15:14 d $ mkdir d/{1..64998} $ rmdir d/* $ ls -ld d drwxr-xr-x 1 eggert eggert 1441792 Jul 18 15:14 d That last link count of 1 means "I don't know the subdirectory count", even though d has no subdirectories whatsoever. Created attachment 257593 [details]
sample program illustrating how ext4 st_nlink behavior breaks glibc
Run this program in an empty ext4 directory to illustrate the problem: fts will misbehave and will not find the two files "needle" in the large-directory haystack.
(In reply to Theodore Tso from comment #1) > In order to support a large > number of directories, when we have a 16-bit link count field, we use > a link count of 1 to mean, "lots". Where is this behavior documented? I don't see it mentioned in e2fsprogs/doc/libext2fs.texinfo. Where should I look? > various userspace utilities that try to optimize > directory walking by depending on the directory's link count to > understand when they have found all of the subdirectories know that > directory link count of 1 means "lots". Although I've been contributing to the GNU core utilities for many years, this behavior is news to me. I just checked the GNU coreutils source, and these utilities do not know that 1 means "lots". Although the code happens to work, it is pure luck. GNU findutils is similar. The GNU C library's fts functions can misbehave on file systems where st_nlink undercounts the number of subdirectories. To reproduce the problem, run the program fts-test.c (attached to this bug report) in a fresh directory. On an ext4 file system, the program outputs "0 needles found (should be 2)" and fails, due to the incompatibility between ext4 st_nlink behavior and what the GNU C library expects. I took a look at the find code at http://git.savannah.gnu.org/cgit/findutils.git and while the "ftsfind.c" code does not check nlinks directly (presumably that is done in fts), the "oldfind.c::process_dir()" code has a check if st_nlink < 2 to disable the "subdirs" heuristic. (In reply to Andreas Dilger from comment #9) > while the "ftsfind.c" > code does not check nlinks directly (presumably that is done in fts), the > "oldfind.c::process_dir()" code has a check if st_nlink < 2 Sure, but (as its name suggest) oldfind.c is obsolete and is no longer used. The mainline 'find' code does not have a special case for st_nlink < 2, and works only by accident. On Wed, Jul 19, 2017 at 08:02:25AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=196405 > > --- Comment #10 from Paul Eggert (eggert@cs.ucla.edu) --- > (In reply to Andreas Dilger from comment #9) > > while the "ftsfind.c" > > code does not check nlinks directly (presumably that is done in fts), the > > "oldfind.c::process_dir()" code has a check if st_nlink < 2 > > Sure, but (as its name suggest) oldfind.c is obsolete and is no longer used. > The mainline 'find' code does not have a special case for st_nlink < 2, and > works only by accident. Well, ext4's dir_nlinks feature has been around for the last *decade*: commit f8628a14a27eb4512a1ede43de1d9db4d9f92bc3 Author: Andreas Dilger <adilger@clusterfs.com> Date: Wed Jul 18 08:38:01 2007 -0400 ext4: Remove 65000 subdirectory limit This patch adds support to ext4 for allowing more than 65000 subdirectories. Currently the maximum number of subdirectories is capped at 32000. If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there..... .... and it is based on behavior that has been around for some legacy file systems for much longer than that. If the mainline "find" code has regressed, I suggest you file a bug with the findutils/glibc folks, as that would appear to be their bug, not ext4's. As I've said, I've looked at the Posix and SUS specs, and '.' and '..' are specified to be "special filenames" that have to be honored when resolving pathnames. There is no requirement that they have to be implemented as hard links, and so therefore there is no guarantee that st_nlink can be used as a hint for the number of subdirectories. And that's good, because there are file systems that don't have hard links at all (NTFS, for example; and there have been versions of Windows that have gotten Posix certification). - Ted (In reply to Theodore Tso from comment #1) > If you don't want this behavior because strict POSIX compliance is > preferable to failing with ENOSPC when you hit that overflow > condition, you can create the file system with the dir_nlink feature > turned off. POSIX says that mkdir should fail with EMLINK if the link count in the parent directory would exceed LINK_MAX. If the POSIX-compliant mode is setting errno==ENOSPC then that sounds like a bug; it should fail with errno==EMLINK. Although the documentation for tune2fs says that dir_nlink can be cleared in an existing ext4 filesystem, this did not work for me on Fedora 26 x86-64. I unmounted the filesystem, ran "tune2fs -O ^dir_nlink /dev/sda2", remounted the filesystem, and verified that dir_nlink was turned off by running "dumpe2fs -h /dev/sda2". The fts-test.c program attached to this bug report still failed in the same way. Either this is a filesystem bug that should be fixed, or the documentation should be changed so that it does not incorrectly say that tune2fs can usefully change the dir_nlink option. I then created a new ext4 file system with dir_nlink disabled from birth. Here, the test program failed in mkdir ("d/32757", 0777) with ENOSPC. Yet there was plenty of space in the file system, and the link count was only 32758, which is way lower than LINK_MAX (which is 65000). Presumably the ext4 code arbitrarily limits the link count to 32758 (i.e., to 2**15 - 10) when dir_nlink is clear. This sounds like a bug; the limit should be 65000. One more thing. In the typical case where dir_nlink is set, there is an off-by-one error, as LINK_MAX is 65000 and in fts-test.c the link count should be 65000 so this should work. That is, the link count should wrap around to 1 when it would otherwise reach 65001, not when it would otherwise reach 65000. (In reply to Theodore Tso from comment #11) > If the mainline "find" code has regressed It hasn't regressed; it still works as long as the directory has fewer than 2**32 subdirectories (assuming x86), and as long as the compiler generates code compatible with -fwrapv semantics, and this is good enough in practice. It's still a matter of luck that findutils works, though. And glibc itself does not work, as shown in the fts-test.c program attached to this bug report. > I've looked at the Posix and SUS specs, and '.' and '..' are > specified to be "special filenames" that have to be honored when > resolving pathnames. There is no requirement that they have to be > implemented as hard links Yes, of course. However, 'find', etc. have optimizations for GNU/Linux, e.g., code like this: #if defined __linux__ && HAVE_FSTATFS && HAVE_STRUCT_STATFS_F_TYPE [special code that runs only on GNU/Linux platforms, and that significantly improves performance on those platforms] #else [generic code that runs on any POSIX platform, albeit more slowly] #endif and the GNU/Linux-specific code is broken on ext4 because the ext4 st_nlink is not a link count. Obviously we could fix the problem by using the generic code on GNU/Linux too; but this would hurt GNU/Linux performance significantly in some cases. > there are file systems that don't have hard links > at all (NTFS, for example; and there have been versions of Windows > that have gotten Posix certification). The code in question already deals with both these issues, by avoiding st_nlink-based optimizations on NTFS and other filesystems where st_nlink is unreliable. The ext4 problem, though, is new to me, and evidently to everyone else who's maintained the glibc etc. code, and this is why glibc is currently broken on ext4 directories with 64998 or more subdirectories. How about this idea for moving forward? 1. Clearly document that setting dir_nlink can break user-mode code, such as glibc's fts functions. 2. Fix the four ext4 bugs that I mentioned in Comment 12. 3. For GNU utilities, override glibc's fts functions to work around the bugs when they operate on ext4 filesystems. 4. File a glibc bug report for the bug exhibited in fts-test.c. 5. Disable dir_nlink in new ext4 filesystems, unless it is specifically requested. The combination of these changes should fix the problem in the long run. I can volunteer to do (3) and (4). Can you do (1), (2), and (5)? One unfortunate situation is that "getconf LINK_MAX" is invented by glibc based on the fstype reported by statfs() and not actually extracted from the kernel. That makes LINK_MAX accurate only in a subset of cases, depending on the version of ext2/ext3/ext4 in use and filesystem features enabled, and it definitely isn't reporting values from the filesystem on a mount-by-mount basis. I ran into this problem in the past when running the POSIX test suite for Lustre, and consider LINK_MAX to be the minimum number of subdirectories that can be created in a directory, rather than the maximum. Checking the ext4 code, it is returning -EMLINK for ext4_link() on regular files and for ext4_mkdir() for subdirectories, and has been since before it was forked from ext3. I'm not sure where your ENOSPC is coming from. I found an old RHEL6 system that didn't have dir_nlink set to test this, but the feature was enabled once the directory passed ~65000 subdirs, so I didn't get an error. I did try testing on a small newly created ext4 filesystem with 1024-byte blocks (in case the limit was with the 2-level htree), and hit ENOSPC because I ran out of inodes... 32757 has never been a directory limit imposed by ext2/3/4, so I suspect you hit a similar problem (11 inodes are reserved by ext* for internal use). As for wrap-around at 65000 vs. 65001 links, I can agree that is a minor bug that could be fixed. Strangely, in continued testing on my old RHEL6 box (with a larger filesystem and dir_nlink disabled) I was able to successfully create 64998 subdirectories, and ls -l reported 65000 links on the parent directory, so it may be that the 64998 overflow is a problem that was added after the dir_nlink feature was first created. The most important issue is that nlinks=1 on the directory causing fts() to miss entries during scanning. It doesn't make sense for it to take nlinks=1 and subtract 2 links for "." and ".." and expect to find "-1" subdirectories. It may be that this causes an unsigned underflow and tools like "find" will not stop scanning until they hit 2^32-1 entries or similar? At least in my tests "find" correctly found the "needle" files even if fts-test.c did not. Also worthy of note, on my Mac (OSX 10.12.5, HFS+ Journaled fs), running fts-test.c with 65536 subdirectories has "ls -ld d" reporting 0 links, but fts-test.c still passes. I don't think you need to disable the optimization for all of Linux. All you need to do is to disable the optimization if the link count on the directory is 1. A traditional Unix directory will always have a link count of 2, because if /mnt/dir is a directory, there will be one hard link for "/mnt/dir", and another hard link for "/mnt/dir/." Hence it should be very simple for glibc to detect the case where the link count is 1 and realize that it shouldn't try to use the optimization. There are other Linux file systems which use this same convention. For example, directories in /proc: # stat /proc/sys/kernel/ File: /proc/sys/kernel/ Size: 0 Blocks: 0 IO Block: 1024 directory Device: 4h/4d Inode: 10298 Links: 1 ... ^^^^^^^^^ The reason why I thought this was a regression in find is because you said that code which understood the n_links=1 convention was in the old find code? Regardless, this behavior has been around for decades. I suspect if I checked a Linux 0.99 kernel, it would show this behavior in procfs. There are a few things which I think we are getting wrong. First, the documentation is not quite right. It claims that the limit is 65,000 subdirectories, when in fact what dir_nlink does is to exempt the 65,000 maximum number of hard links limitation from applying to subdirectories in a directory. Secondly, the ext4 code will silently set the dir_link feature flag if there is an attempt to create a subdirectory which exceeds the EXT4_MAX_LINK and the directory is using directory indexing. There have been times in the past when ext4 will silently set feature flags, but I believe that's a bad thing to do. Back in 2007 is was apparently still tolerated, but I think we should change things such that if the dir_nlink feature is not enabled, the kernel should return an error if creating a subdirectory would violate EXT4_MAX_LINK instead of automagically setting the feature flag. Finally, allowing tune2fs to clear the dir_nlink flag is not a safe thing to do. We could allow it if tune2fs were to scan the whole file system making sure there are no directories with an i_links_count of 1. But it's easier to just disallow it clearing the flag. I disagree that we should disable dir_nlink in the future. Old find utilities apparently had old code that would do the right thing. The fact that it is not in ftw is unfortunate, but I will note that ftw will break for Linux's /proc/sys directories as well, and this behavior has been around for a long, Long, LONG time. The fact that glibc was mistaken in assuming the optimization was always safe for Linux is a glibc bug. I don't understand why you resist the suggestion of disabling the optimization iff st_nlinks==1. That is a clearly safe thing to do. As far as other programs who might make the same mistake glibc did, since Posix does not guarantee that '.' and '..' are implemented as hard links, having an st_link of 1 for directories is completely allowed by Posix. (i.e., a Posix environment which does this is a conforming environment). Hence, a Strictly Conforming (or Strictly Portable) Posix application should not be making this assumption. The fact that we've gone ten years without anyone noticing or complaining is a pretty strong indicator to me that this isn't a serious portability problem. In terms of checking the ext4 code, I think you're confused. It's always done what I've described, although how it does the check is a bit confusing. See the following in ext4.h: #define is_dx(dir) (ext4_has_feature_dir_index((dir)->i_sb) && \ ext4_test_inode_flag((dir), EXT4_INODE_INDEX)) #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir) && (dir)->i_nlink >= EXT4_LINK_MAX) Then see the very beginning of ext4_mkdir() and ext4_inc_count() in fs/ext4/namei.c. I believe we should add a check for ext4_has_feature_dir_nlink(), as described above, but the behavior that ext4 has been exhibiting hasn't changed in a very long time. That's why you saw the behavior you did on your old RHEL6 system. (In reply to Andreas Dilger from comment #14) > I did try testing on a small newly created ext4 > filesystem with 1024-byte blocks (in case the limit was with the 2-level > htree), and hit ENOSPC because I ran out of inodes... Yes, apparently that was my problem too. Thanks for catching that. I fixed that, and ran into another problem: disabling dir_nlink is ineffective, i.e., mkdir continues to set the parent directory's link count to 1 when it overflows. That is, if I run the following as root: # fallocate -l 1G ~eggert/junk/image.iso # mkfs.ext4 -O ^dir_nlink -N 110000 ~eggert/junk/image.iso # mount ~eggert/junk/image.iso /mnt # chmod a+rwx /mnt and then run the test program in the /mnt directory, the test program still fails in the same way, creating a parent directory with st_nlink == 1 in the process. Afterwards, the file system's dir_nlink flag is set even though I did not set it. (Note added later: I see that Theodore Tso also noticed this problem.) So dir_nlink is not really working for ext4, in the sense that st_nlink cannot be made to work in a POSIX-compatible way. > That makes LINK_MAX accurate only in a subset of cases, depending on > the version of ext2/ext3/ext4 in use and filesystem features > enabled, and it definitely isn't reporting values from the > filesystem on a mount-by-mount basis. Ouch, I didn't know that. This is another POSIX-compatibility problem, but one thing at a time.... > The most important issue is that nlinks=1 on the directory causing fts() to > miss entries during scanning. It doesn't make sense for it to take nlinks=1 > and subtract 2 links for "." and ".." and expect to find "-1" > subdirectories. No, clearly the glibc code assumes GNU/Linux directories always have a link count of at least 2. > It may be that this causes an unsigned underflow and tools > like "find" will not stop scanning until they hit 2^32-1 entries or similar? I think "find" is OK because it doesn't happen to hit this particular fts bug. I think there may well be similar fts bugs elsewhere, though -- possibly bugs that "find" could hit. > Also worthy of note, on my Mac (OSX 10.12.5, HFS+ Journaled fs), running > fts-test.c with 65536 subdirectories has "ls -ld d" reporting 0 links, but > fts-test.c still passes. Yes, macOS fts is different. It would not surprise me if it didn't have the bug we're talking about (also, it's probably significantly slower). (In reply to Theodore Tso from comment #15) > I don't think you need to disable the optimization for all of Linux. All > you need to do is to disable the optimization if the link count on the > directory is 1. Yes, that makes sense, and I plan to do that: these are steps 3 and 4 in my Comment 13 for this bug. Unfortunately, there is a reasonable amount of code that assumes the traditional Unix behavior (not just in glibc), and I doubt whether I will be able to track it all down. > I thought this was a regression in find is because you said > that code which understood the n_links=1 convention was in the old find code? Yes it was. The current 'find' code does not know about the convention. Although 'find' happens to work as a matter of luck for this particular test case, I have the sneaking suspicion that there are other test cases where it does not work. The assumption is used in multiple places in 'find' and I have not checked them all. Similarly for 'tar' and other GNU applications. > allowing tune2fs to clear the dir_nlink flag is not a safe thing to do. That depends on what the dir_nlink flag is supposed to mean. (Since the flag does not work now, we can define it to mean what we like. :-) If dir_nlink 1 means "set a directory link count to 1 if it would overflow", and if a link count of 1 never changes regardless of what dir_nlink is set to, then why would it be a problem to allow tunefs to alter the dir_nlink flag? dir_nlink would affect only future calls to mkdir, not past ones. > ftw will break for Linux's /proc/sys directories as well Yes. However, ftw is normally applied to user files, so it's significantly more important that ftw work there. > As far as other programs who might make the same mistake glibc did, since > Posix does not guarantee that I'm worried about code intended to run on traditional Unix and GNU/Linux, not about portable POSIX code. There is a reasonable amount of code that uses st_nlink as a way to avoid unnecessary stat calls when traversing a file system. This provides a significant performance boost on traditional Unix and GNU/Linux, and it would be a shame to lose this performance benefit. > The fact that we've gone ten years without anyone noticing or complaining More accurately, we've gone ten years before people connected the dots. This time, the original bug report was about 'ls'. This isn't a bug in 'ls' so it got turned into a bug report for 'lstat'. But this isn't about lstat either, so it got turned into a bug report for ext4. I'm sure other people have noticed the problem before, it's just that few people are dogged and expert enough to track the bug down to the actual cause. From my point of view The worst thing about all this, is that the dir_nlink feature is misdocumented and does not work as intended (i.e., it's a flag that in effect cannot be turned off). Either dir_nlink needs to be documented and fixed; or failing that, the dir_nlink flag should be withdrawn and the ext4 documentation should clearly say that the link count of a directory is permanently set to 1 after it overflows past 64999. If you take the latter approach, you needn't update the ext4 code at all, just the documentation (though the documentation should note that 64999 is off-by-one compared to the 65000 that is nominally the maximum link count). On Fri, Jul 21, 2017 at 08:22:05AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > I'm worried about code intended to run on traditional Unix and GNU/Linux, not > about portable POSIX code. There is a reasonable amount of code that uses > st_nlink as a way to avoid unnecessary stat calls when traversing a file > system. This provides a significant performance boost on traditional Unix and > GNU/Linux, and it would be a shame to lose this performance benefit. One of the things which confuses me is why you think there's so much code which tries to use the st_nlink hack. It's ***much*** simpler to just rely on d_type if it exists (and it does on most systems). If the answer is that d_type might not be supported on all file systems, and a recursive descent might enter a file system which didn't support d_type: 1) The assumption that st_nlink always has the property that it is >2 and can be used to derive the number of subdirectories was never valid across all file system types, so you could descend into a file system type where that wasn't true. 2) If you did descend into a file system which didn't support d_type, d_type would be DT_UNKNOWN instead of DT_REG or DT_DIR 3) Using DT_DIR is means you can skip the stat check for all directory entries. If you are doing a recursive descent where you care about the name, you need to call readdir() on all of the directory entries anyway, so you will have access to d_type. If you are doing a recursive descent where you are checking on file ownership, you are doing the stat(2) anyway, so why not check S_ISDIR(st.st_mode) instead of blindly using the st_nlink hack? 4) Using d_type/DT_DIR is implemented in anything that was BSD derived, and many of the SysV derived systems (to the extent that they imported in the BSD Fast Filesystem), would have also had d_type support. So if your argument is what about legacy Unix code, I would think that most of them would have used the much more performant and simpler-to-use d_type interface. > > allowing tune2fs to clear the dir_nlink flag is not a safe thing to do. > That depends on what the dir_nlink flag is supposed to mean. (Since the flag > does not work now, we can define it to mean what we like. :-) If dir_nlink 1 > means "set a directory link count to 1 if it would overflow", and if a link > count of 1 never changes regardless of what dir_nlink is set to, then why > would > it be a problem to allow tunefs to alter the dir_nlink flag? dir_nlink would > affect only future calls to mkdir, not past ones. Well, it's mostly safe now because ten years have passed and even the most pathetically obsolete Enterprice Distro installation has updated to something more recent. But the reason why dir_nlink was defined as an RO_INCOMPAT feature (EXT4_FEATURE_RO_COMPAT_DIR_NLINK) was because a dir_nlink oblivious kernel could get upset when trying to rmdir a directory where n_link was 1. > > The fact that we've gone ten years without anyone noticing or complaining > More accurately, we've gone ten years before people connected the dots. This > time, the original bug report was about 'ls'. Can you give me a pointer to the original bug report? I'm curious how things were misbehaving. > From my point of view The worst thing about all this, is that the dir_nlink > feature is misdocumented and does not work as intended (i.e., it's a flag > that > in effect cannot be turned off). Either dir_nlink needs to be documented and > fixed; or failing that, the dir_nlink flag should be withdrawn and the ext4 > documentation should clearly say that the link count of a directory is > permanently set to 1 after it overflows past 64999. If you take the latter > approach, you needn't update the ext4 code at all, just the documentation > (though the documentation should note that 64999 is off-by-one compared to > the > 65000 that is nominally the maximum link count). There was a time when we tried documenting things in terms that users could understand, as opposed to going into gory details that only a fs programmer would love. And that retrospect as a mistake. We should have done both, with the civilian-friendly bit coming first. It was also a mistake to have dir_nlink be automatically turned on in the case of ext4 file systems. As I said, we used to do this much more often, and ten years ago we weren't so strict about this rule. The issue is that at this point there are multiple implementations of the ext2/3/4 file system format, and not just Linux, so we can't assume that turning on some feature which is supported by a particular Linux kernel won't break things on some other implementation (e.g., Grub, BSD etc.). It was a bad idea back then as well because sometimes people want to downgrade to old kernels, so having a new kernel automatically do something which causes the file system to become unmountable by a new kernel is unfriendly. Since dir_nlink was close to one of the first ext4 features that was added, it was perhaps more excusable --- but it was still wrong. The problem withdrawing the feature is that it would break a lot of users who want to have more than 65,000 subdirectories. Ext4 has been out there for a long time, and while it's true that many people don't create directory trees which are that bushy, I've gotten people screaming at us for much more minor bugs. So that's why I'm curious to see the original ls bug. (Maybe because most people don't use ls -lR on huge directory trees, because they don't like to grow old waiting for it to complete? :-) So I think the right thing to do is to fix the documentation. We actually added the feature first, and only added the documentation much later, so it is the documentation which is wrong. So this is not a case of writing the spec first and then implementing to the spec. This is much more like Unix came first, and Posix came later to document how Unix worked. Except we weren't smart enough to add a clause, as Posix did, that anything that was allowed to use the Unix(tm) trademark was automatically spec/documentation compliant. :-) And even before we added the documentation, it wasn't like it was a secret; it was just not that well known. But there were some blog entries that talked about it[1], and the description of how it worked was in the git commit message. [1] https://www.whatastruggle.com/maximum-number-of-sub-directories-in-ext4 Ted, I think the right approach is to stop the ext4 kernel code from enabling dir_nlink automatically, but continue to set it by default at format time for new filesystems. I suspect the number of users that want to get an error returned when their directory grows large is very few, but at least they will be able to turn off dir_nlink and/or create the filesystem without this feature in the first place. This will make the dir_nlink feature more consistent with other features as well. What I also just noticed is that e2fsck does not enable the dir_nlink feature in the superblock in case i_links_count > EXT2_LINK_MAX. I'm just working on a patch for this. (In reply to Theodore Tso from comment #18) > > More accurately, we've gone ten years before people connected the dots. > This > > time, the original bug report was about 'ls'. > > Can you give me a pointer to the original bug report? I'm curious how > things were misbehaving. > [...] > The problem withdrawing the feature is that it would break a lot of > users who want to have more than 65,000 subdirectories. Ext4 has been > out there for a long time, and while it's true that many people don't > create directory trees which are that bushy, I've gotten people > screaming at us for much more minor bugs. So that's why I'm curious > to see the original ls bug. Don't get your hopes up. I filed the original bug report https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27739 because I noticed that `ll` (an alias to `ls -alF`) prints something that doesn't make sense. That's also why I filed the bug report on the GNU bug tracker: Clearly, I was still able to access my directories, so they definitely still existed, but `ll` printed `1` which didn't make sense to me, especially not as I never saw a directory with a hard link count lower than 2 before. After successfully replicating¹ it on different machines in an as-simple-as-possible way and thereby already having written bash code capable of replicating the bug, I provided them with that simple example which just creates empty folders. After the GNU guys said it's not their bug, Paul replicated it on his machine (different distro) and filed this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1471967 After Red Hat said it's not their bug, he filed the one whose discussion you're reading right now. ----- ¹ Well, kind of. I actually tried it on a third system (didn't mention it until now, afaik) which runs FreeBSD but the code failed because either FreeBSD or its default FS (or whatever FS that machine uses) doesn't support enough subdirectories or something like that (it failed to create the high number of subdirectories for some reason, so I didn't even get to run `ls` under the right conditions) (ergo: Linux > FreeBSD). But it didn't fail in any way related to `ls` and worked on 2 different machines, so I filed the original bug report. Michael, as discussed here, I don't think the problem is the number of links, but rather that fts is "optimizing" the traversal of a directory with a hard link count of 1 incorrectly. In your debbugs.gnu.org 27739 bug report this problem wasn't mentioned at all, only that the link count was set to 1, which is why I think it was closed as "not a bug". I would suggest to re-open 27739, or file a new bug that describes the problem as Paul did in the first comment here, namely that fts is incorrectly optimizing the traversal of subdirectories when the directory link count is 1, when it should be treating this as "the subdirectory count is not being tracked". Right, the fact that the two bugs that Michael cited was about the st_nlinks being "wrong" is in fact a good thing. Paul is asserting that there is vast amount of breakage because ext4 can return an st_nlinks value of 1 on a directory, to the extent that he believes we should withdrawing the dir_nlinks feature. My argument is if we've lived with this for ten years without users screaming about things being broken, then we're probably ok and we should fix this by changing the documentation. Paul then responded by pointing out the bugs. But if the bugs are about st_nlinks being "wrong" causing users to be confused, and not about real life use cases breaking, then I'm going to be much less concerned. For the record, the documentation is not wrong; the maximum link count is 65000 --- for files. # touch 1 # for i in $(seq 2 65000) ; do ln 1 $i ; done # ls -l 1 -rw-r--r-- 65000 root root 0 Jul 23 12:11 1 # ln 1 65001 ln: failed to create hard link to '1': Too many links The fact dir_nlink happens to let st_nlink go from 64999 to 1 is (1) not an overflow, and (2) isn't actually contradicting any documentation, insofar as all we say is that dir_nlink no longer limits the number of subdirectories due to EXT4_MAX_LINK count. The documentation is only subtly wrong when it says: dir_nlink This ext4 feature allows more than 65000 subdirectories per directory. That is actually a true statement, it *does* allow more than 65000 subdirectories per directory. What is not quite true is that without that feature, we are limited to 65000 subdirectories. Today, because with ext4 the feature gets silently enabled, and with ext2, the failure happens after the 64998th subdirectory (because the directory entry for the parent directory plus the '.' link counts as an st_nlink). I'll point out that POSIX/SUSv3 doesn't even guarantee *any* value for st_nlink for directories, since hard links to directories are not allowed, and what st_nlink means in the absence of hard links, or directories in general, are not specified. So just as GCC developers delight in torturing kernel developers by allowing anything random (including launching nethack) for anything which is not concretely specified by the C standard, technically speaking a random Linux file system could let st_nlink random around randomly, or increase according to a fibbonacci or geometric series, etc. More seriously, for a network/clustered file system, if other subdirectories are being created while the file walk is taking place (or even in the case of a local file system), relying on st_nlink for an optimization is going to be dangerous. Using d_type is going to be ***much*** safer. An update to e2fsprogs to clarify how dir_nlink works in the ext4 man page. - Ted commit 7d8f358cdce948df57b1001b9c278f33519afa86 Author: Theodore Ts'o <tytso@mit.edu> Date: Sun Jul 23 18:51:22 2017 -0400 Clarify how the description of the dir_nlink feature in the ext4 man page Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff --git a/misc/ext4.5.in b/misc/ext4.5.in index c1e67abbc..3511dae62 100644 --- a/misc/ext4.5.in +++ b/misc/ext4.5.in @@ -69,7 +69,12 @@ ext2 file systems. .TP .B dir_nlink .br -This ext4 feature allows more than 65000 subdirectories per directory. +Normally ext4 allows an inode to have no more than 65,000 hard links. +This applies to files as well as directories, which means that there can +be no more than 64,998 subdirectories in a directory (because each of +the '..' entries counts as a hard link). This feature lifts this limit +by causing ext4 to use a links count of 1 to indicate that the number of +hard links to a directory is not known. .TP .B encrypt .br (In reply to Theodore Tso from comment #24) There are a couple of things I think you should change. • Writing the number as "65'000" (fast to read and unambiguous to everyone) or "65 000" (SI style) would be better. • The comma after the introductory adverb "Normally" is missing. • "to regular files as well as directories" instead of "to files as well as directories". Remember: Directories are files. • Thousands separator, again. • If you choose to mention '..', mention '.', too. Or mention neither, idc. • It called a "link count", not a "links count". (In reply to Theodore Tso from comment #18) > One of the things which confuses me is why you think there's so much > code which tries to use the st_nlink hack. It's ***much*** simpler to > just rely on d_type if it exists (and it does on most systems). This is true only for one particular optimization; it is not true for others. For example, Gnulib takes advantage of the fact a directory with st_nlink==2 has no subdirectories, if the directory is in a file system where this optimizatino is known to work. One can't easily use d_type for this. > 1) The assumption that st_nlink always has the property that it is >2 > and can be used to derive the number of subdirectories was never > valid across all file system types Yes, and Gnulib exploits the st_nlink assumption only on file systems where it is useful and/or known to work. > 2) If you did descend into a file system which didn't support d_type, > d_type would be DT_UNKNOWN instead of DT_REG or DT_DIR Yes, and Gnulib doesn't use the optimization if d_type is DT_UNKNOWN. > 3) Using DT_DIR is means you can skip the stat check for all directory > entries. If you are doing a recursive descent where you care about > the name, you need to call readdir() on all of the directory > entries anyway, so you will have access to d_type. If you are > doing a recursive descent where you are checking on file ownership, > you are doing the stat(2) anyway, so why not check > S_ISDIR(st.st_mode) instead of blindly using the st_nlink hack? No, you can do even better than that in some cases, if st_nlink works. Suppose we are implementing the equivalent of 'find . -type d'. If we come across a directory whose st_nlink == 2, then we don't need to readdir from the directory at all, much less stat its entries. > 4) ... if your argument is what about legacy Unix code There is more of that floating around than I'd like, yes. But I'm mostly worried about GNU code. > Can you give me a pointer to the original bug report? I'm curious how > things were misbehaving. https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27739 (In reply to Theodore Tso from comment #22) > Paul is asserting that there is vast > amount of breakage because ext4 can return an st_nlinks value of 1 on a > directory, to the extent that he believes we should withdrawing the > dir_nlinks feature. I think this gets my most recent proposal backwards. At the end of Comment 17, I proposed that the ext4 code act as if dir_nlink is always set. That's what the code has been doing for a decade anyway. All that's missing is documentation which says "the dir_nlink setting is irrelevant, and the file system always acts as if dir_nlink is set", or words to that effect. Although in hindsight perhaps the dir_nlink flag should have been implemented properly, it wasn't and there's little point to implementing it properly now: every application using ext4 must work with the current behavior anyway. |