Bug 196405

Summary:	mkdir mishandles st_nlink in ext4 directory with 64997 subdirectories
Product:	File System	Reporter:	Paul Eggert (eggert)
Component:	ext4	Assignee:	fs_ext4 (fs_ext4)
Status:	NEW ---
Severity:	normal	CC:	adilger.kernelbugzilla, michelbach94, tytso
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	4.13-rc1	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	sample program illustrating how ext4 st_nlink behavior breaks glibc

Description Paul Eggert 2017-07-17 21:23:44 UTC

See the following shell transcript, run in an ext4 directory, for an illustration of the bug. The first four commands set things up. The 5th command 'mkdir d/64998' calls 'mkdir("d/64998, 0777)' and succeeds. The 6th command shows that d's resulting st_nlink value is 1, which is incorrect; it should be 65000. I discovered this bug in Fedora 26, which runs Linux 4.11.10; from inspecting the source it appears the bug is still there in 4.13-rc1.

$ LC_ALL=C
$ export LC_ALL
$ mkdir d d/{1..64997}
$ strace -ve trace=%lstat ls -ld d
lstat("d", {st_dev=makedev(8, 2), st_ino=41565723, st_mode=S_IFDIR|0755, st_nlink=64999, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=2744, st_size=1404928, st_atime=1500319812 /* 2017-07-17T12:30:12.934815840-0700 */, st_atime_nsec=934815840, st_mtime=1500319822 /* 2017-07-17T12:30:22.896866173-0700 */, st_mtime_nsec=896866173, st_ctime=1500319822 /* 2017-07-17T12:30:22.896866173-0700 */, st_ctime_nsec=896866173}) = 0
drwxr-xr-x. 64999 eggert eggert 1404928 Jul 17 12:30 d
+++ exited with 0 +++
$ mkdir d/64998
$ strace -ve trace=%lstat ls -ld d
lstat("d", {st_dev=makedev(8, 2), st_ino=41565723, st_mode=S_IFDIR|0755, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=2744, st_size=1404928, st_atime=1500319812 /* 2017-07-17T12:30:12.934815840-0700 */, st_atime_nsec=934815840, st_mtime=1500319845 /* 2017-07-17T12:30:45.104978381-0700 */, st_mtime_nsec=104978381, st_ctime=1500319845 /* 2017-07-17T12:30:45.104978381-0700 */, st_ctime_nsec=104978381}) = 0
drwxr-xr-x. 1 eggert eggert 1404928 Jul 17 12:30 d


This bug report follows up on the following downstream bug reports:

https://bugzilla.redhat.com/show_bug.cgi?id=1471967
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27739

Comment 1 Theodore Tso 2017-07-18 19:41:01 UTC

On Mon, Jul 17, 2017 at 09:23:44PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> See the following shell transcript, run in an ext4 directory, for an
> illustration of the bug. The first four commands set things up. The 5th
> command
> 'mkdir d/64998' calls 'mkdir("d/64998, 0777)' and succeeds. The 6th command
> shows that d's resulting st_nlink value is 1, which is incorrect; it should
> be
> 65000. I discovered this bug in Fedora 26, which runs Linux 4.11.10; from
> inspecting the source it appears the bug is still there in 4.13-rc1.

This is actually working as intended.  In order to support a large
number of directories, when we have a 16-bit link count field, we use
a link count of 1 to mean, "lots".

If you don't want this behavior because strict POSIX compliance is
preferable to failing with ENOSPC when you hit that overflow
condition, you can create the file system with the dir_nlink feature
turned off.

I will note that various userspace utilities that try to optimize
directory walking by depending on the directory's link count to
understand when they have found all of the subdirectories know that
directory link count of 1 means "lots".  This is why a link count of 1
was chosen.

						- Ted

Comment 2 michelbach94 2017-07-18 21:07:37 UTC

Thank you, Ted.

Is it necessary for it to "break" (quotes because 1 seems to be a kind-of wildcard) at such a low and weird value?

This behavior doesn't occur with ext4 ram disks. Run the same command, possibly with an even higher number, but within an ext4 ram disk, created via

    $ sudo mount -t ramfs -o size=4G ext4 <path>

Example output:

    $ ll
    total 4
    drwxr-xr-x      3 christoph root         0 Jul 18 22:59 ./
    drwxrwxrwt     18 root      root      4096 Jul 18 23:00 ../
    drwxrwxr-x 446639 christoph christoph    0 Jul 18 23:05 d/

Comment 3 Theodore Tso 2017-07-18 21:37:32 UTC

> Is it necessary for it to "break" (quotes because 1 seems to be a kind-of
> wildcard) at such a low and weird value?

The reason for choosing 1 is because there were other file systems
where they don't have a '..' where the link count of the directory is
1.  In fact, I just checked; I can't find anything in POSIX or the
Single Unix Specification which guarantees that '.' and '..' are hard
links.  It is that way historically, but there are file systems such
as VFAT where this is not true.

That's important because there are programs which, if they see a link
count of 42, if they are doing a directory tree walk, and they see 42
directories, they will assume that there are no more subdirectories at
that level.  I suppose we could have used a link count of 0xFFFFFFFF
as the magic "lots" value, but that could potentially trigger overflow
bugs in programs.  And given that most of these programs already
understand that 1 is "lots" so they can do the right thing when you
run "find" on a mounted thumb drive using VFAT.

> This behavior doesn't occur with ext4 ram disks. Run the same command,
> possibly
> with an even higher number, but within an ext4 ram disk, created via
> 
>     $ sudo mount -t ramfs -o size=4G ext4 <path>

That's not a ext4 ram disk.  That's a tmpfs partition (ramfs is an
alias for tmpfs).  "ext4" in the above invocation can be anything, and
it has no meaning.  Just as "mount -t proc proc /proc" is equivalent
to "mount -t proc xyzzy /proc", the above can also be:

	sudo mount -t ramfs -o size=4G xyzzy <path>

And it will be the same as "sudo mount -t ramfs -o size=4G ext4 <path>".

       	       	   	   - Ted

Comment 4 Andreas Dilger 2017-07-18 21:54:52 UTC

The limit is implemented as < 65000, which was somewhat below the hard limit of the ext4_inode.i_links_count __u16 field of 65535.  There is no hard reason for 65000 except to allow some margin for detecting overflow, and reserved values if there was a need.  I don't think it makes a big practical difference whether directory link counts are accurate up to 65000 or 65534 subdirectories, since very few systems have this many subdirectories.

In theory, we could add an i_links_count_hi field to extend this to a 32-bit value, but it isn't clear if there is a big benefit from doing this?

Comment 5 Andreas Dilger 2017-07-18 21:57:35 UTC

FYI, the "link count == 1" meaning is historical for ancient filesystems which didn't track the link count of the directory at all.  In that respect, it didn't mean "lots of subdirectories", but rather "I don't know the subdirectory count at all".

Comment 6 Paul Eggert 2017-07-18 22:19:14 UTC

(In reply to Andreas Dilger from comment #5)
> the "link count == 1" meaning is historical for ancient filesystems ...
> it didn't mean "lots of subdirectories", but rather "I don't know the
> subdirectory count at all".

This meaning is still true for ext4. For example:

$ mkdir d d/{1..64998}
$ rm -fr d
$ mkdir d
$ ls -ld d
drwxr-xr-x 2 eggert eggert 4096 Jul 18 15:14 d
$ mkdir d/{1..64998}
$ rmdir d/*
$ ls -ld d
drwxr-xr-x 1 eggert eggert 1441792 Jul 18 15:14 d

That last link count of 1 means "I don't know the subdirectory count", even though d has no subdirectories whatsoever.

Comment 7 Paul Eggert 2017-07-18 23:12:58 UTC

Created attachment 257593 [details]
sample program illustrating how ext4 st_nlink behavior breaks glibc

Run this program in an empty ext4 directory to illustrate the problem: fts will misbehave and will not find the two files "needle" in the large-directory haystack.

Comment 8 Paul Eggert 2017-07-18 23:15:40 UTC

(In reply to Theodore Tso from comment #1)
> In order to support a large
> number of directories, when we have a 16-bit link count field, we use
> a link count of 1 to mean, "lots".

Where is this behavior documented? I don't see it mentioned in e2fsprogs/doc/libext2fs.texinfo. Where should I look?

> various userspace utilities that try to optimize
> directory walking by depending on the directory's link count to
> understand when they have found all of the subdirectories know that
> directory link count of 1 means "lots".

Although I've been contributing to the GNU core utilities for many years, this behavior is news to me. I just checked the GNU coreutils source, and these utilities do not know that 1 means "lots". Although the code happens to work, it is pure luck. GNU findutils is similar.

The GNU C library's fts functions can misbehave on file systems where st_nlink undercounts the number of subdirectories. To reproduce the problem, run the program fts-test.c (attached to this bug report) in a fresh directory. On an ext4 file system, the program outputs "0 needles found (should be 2)" and fails, due to the incompatibility between ext4 st_nlink behavior and what the GNU C library expects.

Comment 9 Andreas Dilger 2017-07-19 05:35:21 UTC

I took a look at the find code at http://git.savannah.gnu.org/cgit/findutils.git and while the "ftsfind.c" code does not check nlinks directly (presumably that is done in fts), the "oldfind.c::process_dir()" code has a check if st_nlink < 2 to disable the "subdirs" heuristic.

Comment 10 Paul Eggert 2017-07-19 08:02:25 UTC

(In reply to Andreas Dilger from comment #9)
> while the "ftsfind.c"
> code does not check nlinks directly (presumably that is done in fts), the
> "oldfind.c::process_dir()" code has a check if st_nlink < 2

Sure, but (as its name suggest) oldfind.c is obsolete and is no longer used. The mainline 'find' code does not have a special case for st_nlink < 2, and works only by accident.

Comment 11 Theodore Tso 2017-07-19 14:49:33 UTC

On Wed, Jul 19, 2017 at 08:02:25AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196405
> 
> --- Comment #10 from Paul Eggert (eggert@cs.ucla.edu) ---
> (In reply to Andreas Dilger from comment #9)
> > while the "ftsfind.c"
> > code does not check nlinks directly (presumably that is done in fts), the
> > "oldfind.c::process_dir()" code has a check if st_nlink < 2
> 
> Sure, but (as its name suggest) oldfind.c is obsolete and is no longer used.
> The mainline 'find' code does not have a special case for st_nlink < 2, and
> works only by accident.

Well, ext4's dir_nlinks feature has been around for the last *decade*:

commit f8628a14a27eb4512a1ede43de1d9db4d9f92bc3
Author: Andreas Dilger <adilger@clusterfs.com>
Date:   Wed Jul 18 08:38:01 2007 -0400

    ext4: Remove 65000 subdirectory limit

    This patch adds support to ext4 for allowing more than 65000
    subdirectories. Currently the maximum number of subdirectories is capped
    at 32000.

    If we exceed 65000 subdirectories in an htree directory it sets the
    inode link count to 1 and no longer counts subdirectories.  The
    directory link count is not actually used when determining if a
    directory is empty, as that only counts subdirectories and not regular
    files that might be in there.....

.... and it is based on behavior that has been around for some legacy
file systems for much longer than that.  If the mainline "find" code
has regressed, I suggest you file a bug with the findutils/glibc
folks, as that would appear to be their bug, not ext4's.  As I've
said, I've looked at the Posix and SUS specs, and '.' and '..' are
specified to be "special filenames" that have to be honored when
resolving pathnames.  There is no requirement that they have to be
implemented as hard links, and so therefore there is no guarantee that
st_nlink can be used as a hint for the number of subdirectories.  And
that's good, because there are file systems that don't have hard links
at all (NTFS, for example; and there have been versions of Windows
that have gotten Posix certification).

	     	     	      	    	 - Ted

Comment 12 Paul Eggert 2017-07-19 19:44:49 UTC

(In reply to Theodore Tso from comment #1)
> If you don't want this behavior because strict POSIX compliance is
> preferable to failing with ENOSPC when you hit that overflow
> condition, you can create the file system with the dir_nlink feature
> turned off.

POSIX says that mkdir should fail with EMLINK if the link count in the parent directory would exceed LINK_MAX. If the POSIX-compliant mode is setting errno==ENOSPC then that sounds like a bug; it should fail with errno==EMLINK.

Although the documentation for tune2fs says that dir_nlink can be cleared in an existing ext4 filesystem, this did not work for me on Fedora 26 x86-64. I unmounted the filesystem, ran "tune2fs -O ^dir_nlink /dev/sda2", remounted the filesystem, and verified that dir_nlink was turned off by running "dumpe2fs -h /dev/sda2". The fts-test.c program attached to this bug report still failed in the same way. Either this is a filesystem bug that should be fixed, or the documentation should be changed so that it does not incorrectly say that tune2fs can usefully change the dir_nlink option.

I then created a new ext4 file system with dir_nlink disabled from birth. Here, the test program failed in mkdir ("d/32757", 0777) with ENOSPC. Yet there was plenty of space in the file system, and the link count was only 32758, which is way lower than LINK_MAX (which is 65000). Presumably the ext4 code arbitrarily limits the link count to 32758 (i.e., to 2**15 - 10) when dir_nlink is clear. This sounds like a bug; the limit should be 65000.

One more thing. In the typical case where dir_nlink is set, there is an off-by-one error, as LINK_MAX is 65000 and in fts-test.c the link count should be 65000 so this should work. That is, the link count should wrap around to 1 when it would otherwise reach 65001, not when it would otherwise reach 65000.

Comment 13 Paul Eggert 2017-07-19 19:59:41 UTC

(In reply to Theodore Tso from comment #11)
> If the mainline "find" code has regressed

It hasn't regressed; it still works as long as the directory has fewer than 2**32 subdirectories (assuming x86), and as long as the compiler generates code compatible with -fwrapv semantics, and this is good enough in practice. It's still a matter of luck that findutils works, though. And glibc itself does not work, as shown in the fts-test.c program attached to this bug report.

> I've looked at the Posix and SUS specs, and '.' and '..' are
> specified to be "special filenames" that have to be honored when
> resolving pathnames.  There is no requirement that they have to be
> implemented as hard links

Yes, of course. However, 'find', etc. have optimizations for GNU/Linux, e.g., code like this:

#if defined __linux__ && HAVE_FSTATFS && HAVE_STRUCT_STATFS_F_TYPE
[special code that runs only on GNU/Linux platforms, and that significantly improves performance on those platforms]
#else
[generic code that runs on any POSIX platform, albeit more slowly]
#endif

and the GNU/Linux-specific code is broken on ext4 because the ext4 st_nlink is not a link count. Obviously we could fix the problem by using the generic code on GNU/Linux too; but this would hurt GNU/Linux performance significantly in some cases.

> there are file systems that don't have hard links
> at all (NTFS, for example; and there have been versions of Windows
> that have gotten Posix certification).

The code in question already deals with both these issues, by avoiding st_nlink-based optimizations on NTFS and other filesystems where st_nlink is unreliable. The ext4 problem, though, is new to me, and evidently to everyone else who's maintained the glibc etc. code, and this is why glibc is currently broken on ext4 directories with 64998 or more subdirectories.

How about this idea for moving forward?

1. Clearly document that setting dir_nlink can break user-mode code, such as glibc's fts functions.

2. Fix the four ext4 bugs that I mentioned in Comment 12.

3. For GNU utilities, override glibc's fts functions to work around the bugs when they operate on ext4 filesystems.

4. File a glibc bug report for the bug exhibited in fts-test.c.

5. Disable dir_nlink in new ext4 filesystems, unless it is specifically requested.

The combination of these changes should fix the problem in the long run.

I can volunteer to do (3) and (4). Can you do (1), (2), and (5)?

Comment 14 Andreas Dilger 2017-07-19 22:22:37 UTC

One unfortunate situation is that "getconf LINK_MAX" is invented by glibc based on the fstype reported by statfs() and not actually extracted from the kernel. That makes LINK_MAX accurate only in a subset of cases, depending on the version of ext2/ext3/ext4 in use and filesystem features enabled, and it definitely isn't reporting values from the filesystem on a mount-by-mount basis. I ran into this problem in the past when running the POSIX test suite for Lustre, and consider LINK_MAX to be the minimum number of subdirectories that can be created in a directory, rather than the maximum.

Checking the ext4 code, it is returning -EMLINK for ext4_link() on regular files and for ext4_mkdir() for subdirectories, and has been since before it was forked from ext3. I'm not sure where your ENOSPC is coming from. I found an old RHEL6 system that didn't have dir_nlink set to test this, but the feature was enabled once the directory passed ~65000 subdirs, so I didn't get an error. I did try testing on a small newly created ext4 filesystem with 1024-byte blocks (in case the limit was with the 2-level htree), and hit ENOSPC because I ran out of inodes... 32757 has never been a directory limit imposed by ext2/3/4, so I suspect you hit a similar problem (11 inodes are reserved by ext* for internal use).

As for wrap-around at 65000 vs. 65001 links, I can agree that is a minor bug that could be fixed. Strangely, in continued testing on my old RHEL6 box (with a larger filesystem and dir_nlink disabled) I was able to successfully create 64998 subdirectories, and ls -l reported 65000 links on the parent directory, so it may be that the 64998 overflow is a problem that was added after the dir_nlink feature was first created.

The most important issue is that nlinks=1 on the directory causing fts() to miss entries during scanning. It doesn't make sense for it to take nlinks=1 and subtract 2 links for "." and ".." and expect to find "-1" subdirectories. It may be that this causes an unsigned underflow and tools like "find" will not stop scanning until they hit 2^32-1 entries or similar? At least in my tests "find" correctly found the "needle" files even if fts-test.c did not.

Also worthy of note, on my Mac (OSX 10.12.5, HFS+ Journaled fs), running fts-test.c with 65536 subdirectories has "ls -ld d" reporting 0 links, but fts-test.c still passes.

Comment 15 Theodore Tso 2017-07-20 00:59:11 UTC

I don't think you need to disable the optimization for all of Linux. All you need to do is to disable the optimization if the link count on the directory is 1. A traditional Unix directory will always have a link count of 2, because if /mnt/dir is a directory, there will be one hard link for "/mnt/dir", and another hard link for "/mnt/dir/." Hence it should be very simple for glibc to detect the case where the link count is 1 and realize that it shouldn't try to use the optimization.

There are other Linux file systems which use this same convention. For example, directories in /proc:

# stat /proc/sys/kernel/
File: /proc/sys/kernel/
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 4h/4d Inode: 10298 Links: 1
... ^^^^^^^^^

The reason why I thought this was a regression in find is because you said that code which understood the n_links=1 convention was in the old find code? Regardless, this behavior has been around for decades. I suspect if I checked a Linux 0.99 kernel, it would show this behavior in procfs.

There are a few things which I think we are getting wrong. First, the documentation is not quite right. It claims that the limit is 65,000 subdirectories, when in fact what dir_nlink does is to exempt the 65,000 maximum number of hard links limitation from applying to subdirectories in a directory.

Secondly, the ext4 code will silently set the dir_link feature flag if there is an attempt to create a subdirectory which exceeds the EXT4_MAX_LINK and the directory is using directory indexing. There have been times in the past when ext4 will silently set feature flags, but I believe that's a bad thing to do. Back in 2007 is was apparently still tolerated, but I think we should change things such that if the dir_nlink feature is not enabled, the kernel should return an error if creating a subdirectory would violate EXT4_MAX_LINK instead of automagically setting the feature flag.

Finally, allowing tune2fs to clear the dir_nlink flag is not a safe thing to do. We could allow it if tune2fs were to scan the whole file system making sure there are no directories with an i_links_count of 1. But it's easier to just disallow it clearing the flag.

I disagree that we should disable dir_nlink in the future. Old find utilities apparently had old code that would do the right thing. The fact that it is not in ftw is unfortunate, but I will note that ftw will break for Linux's /proc/sys directories as well, and this behavior has been around for a long, Long, LONG time. The fact that glibc was mistaken in assuming the optimization was always safe for Linux is a glibc bug. I don't understand why you resist the suggestion of disabling the optimization iff st_nlinks==1. That is a clearly safe thing to do.

As far as other programs who might make the same mistake glibc did, since Posix does not guarantee that '.' and '..' are implemented as hard links, having an st_link of 1 for directories is completely allowed by Posix. (i.e., a Posix environment which does this is a conforming environment). Hence, a Strictly Conforming (or Strictly Portable) Posix application should not be making this assumption.

The fact that we've gone ten years without anyone noticing or complaining is a pretty strong indicator to me that this isn't a serious portability problem.

In terms of checking the ext4 code, I think you're confused. It's always done what I've described, although how it does the check is a bit confusing. See the following in ext4.h:

#define is_dx(dir) (ext4_has_feature_dir_index((dir)->i_sb) && \
ext4_test_inode_flag((dir), EXT4_INODE_INDEX))
#define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir) && (dir)->i_nlink >= EXT4_LINK_MAX)

Then see the very beginning of ext4_mkdir() and ext4_inc_count() in fs/ext4/namei.c.

I believe we should add a check for ext4_has_feature_dir_nlink(), as described above, but the behavior that ext4 has been exhibiting hasn't changed in a very long time. That's why you saw the behavior you did on your old RHEL6 system.

Comment 16 Paul Eggert 2017-07-21 07:48:28 UTC

(In reply to Andreas Dilger from comment #14)
> I did try testing on a small newly created ext4
> filesystem with 1024-byte blocks (in case the limit was with the 2-level
> htree), and hit ENOSPC because I ran out of inodes...

Yes, apparently that was my problem too. Thanks for catching that. I fixed that, and ran into another problem: disabling dir_nlink is ineffective, i.e., mkdir continues to set the parent directory's link count to 1 when it overflows. That is, if I run the following as root:

# fallocate -l 1G ~eggert/junk/image.iso
# mkfs.ext4 -O ^dir_nlink -N 110000 ~eggert/junk/image.iso
# mount ~eggert/junk/image.iso /mnt
# chmod a+rwx /mnt

and then run the test program in the /mnt directory, the test program still fails in the same way, creating a parent directory with st_nlink == 1 in the process. Afterwards, the file system's dir_nlink flag is set even though I did not set it. (Note added later: I see that Theodore Tso also noticed this problem.)

So dir_nlink is not really working for ext4, in the sense that st_nlink cannot be made to work in a POSIX-compatible way.

> That makes LINK_MAX accurate only in a subset of cases, depending on
> the version of ext2/ext3/ext4 in use and filesystem features
> enabled, and it definitely isn't reporting values from the
> filesystem on a mount-by-mount basis.

Ouch, I didn't know that. This is another POSIX-compatibility problem, but one thing at a time....

> The most important issue is that nlinks=1 on the directory causing fts() to
> miss entries during scanning.  It doesn't make sense for it to take nlinks=1
> and subtract 2 links for "." and ".." and expect to find "-1"
> subdirectories.

No, clearly the glibc code assumes GNU/Linux directories always have a link count of at least 2.

> It may be that this causes an unsigned underflow and tools
> like "find" will not stop scanning until they hit 2^32-1 entries or similar?

I think "find" is OK because it doesn't happen to hit this particular fts bug. I think there may well be similar fts bugs elsewhere, though -- possibly bugs that "find" could hit.

> Also worthy of note, on my Mac (OSX 10.12.5, HFS+ Journaled fs), running
> fts-test.c with 65536 subdirectories has "ls -ld d" reporting 0 links, but
> fts-test.c still passes.

Yes, macOS fts is different. It would not surprise me if it didn't have the bug we're talking about (also, it's probably significantly slower).

Comment 17 Paul Eggert 2017-07-21 08:22:05 UTC

(In reply to Theodore Tso from comment #15)
> I don't think you need to disable the optimization for all of Linux.  All
> you need to do is to disable the optimization if the link count on the
> directory is 1.
Yes, that makes sense, and I plan to do that: these are steps 3 and 4 in my Comment 13 for this bug. Unfortunately, there is a reasonable amount of code that assumes the traditional Unix behavior (not just in glibc), and I doubt whether I will be able to track it all down.

> I thought this was a regression in find is because you said
> that code which understood the n_links=1 convention was in the old find code?
Yes it was. The current 'find' code does not know about the convention. Although 'find' happens to work as a matter of luck for this particular test case, I have the sneaking suspicion that there are other test cases where it does not work. The assumption is used in multiple places in 'find' and I have not checked them all. Similarly for 'tar' and other GNU applications.

> allowing tune2fs to clear the dir_nlink flag is not a safe thing to do.
That depends on what the dir_nlink flag is supposed to mean. (Since the flag does not work now, we can define it to mean what we like. :-) If dir_nlink 1 means "set a directory link count to 1 if it would overflow", and if a link count of 1 never changes regardless of what dir_nlink is set to, then why would it be a problem to allow tunefs to alter the dir_nlink flag? dir_nlink would affect only future calls to mkdir, not past ones.

> ftw will break for Linux's /proc/sys directories as well
Yes. However, ftw is normally applied to user files, so it's significantly more important that ftw work there.

> As far as other programs who might make the same mistake glibc did, since
> Posix does not guarantee that
I'm worried about code intended to run on traditional Unix and GNU/Linux, not about portable POSIX code. There is a reasonable amount of code that uses st_nlink as a way to avoid unnecessary stat calls when traversing a file system. This provides a significant performance boost on traditional Unix and GNU/Linux, and it would be a shame to lose this performance benefit.

> The fact that we've gone ten years without anyone noticing or complaining
More accurately, we've gone ten years before people connected the dots. This time, the original bug report was about 'ls'. This isn't a bug in 'ls' so it got turned into a bug report for 'lstat'. But this isn't about lstat either, so it got turned into a bug report for ext4. I'm sure other people have noticed the problem before, it's just that few people are dogged and expert enough to track the bug down to the actual cause.

From my point of view The worst thing about all this, is that the dir_nlink feature is misdocumented and does not work as intended (i.e., it's a flag that in effect cannot be turned off). Either dir_nlink needs to be documented and fixed; or failing that, the dir_nlink flag should be withdrawn and the ext4 documentation should clearly say that the link count of a directory is permanently set to 1 after it overflows past 64999. If you take the latter approach, you needn't update the ext4 code at all, just the documentation (though the documentation should note that 64999 is off-by-one compared to the 65000 that is nominally the maximum link count).

Comment 18 Theodore Tso 2017-07-21 15:25:34 UTC

On Fri, Jul 21, 2017 at 08:22:05AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> I'm worried about code intended to run on traditional Unix and GNU/Linux, not
> about portable POSIX code. There is a reasonable amount of code that uses
> st_nlink as a way to avoid unnecessary stat calls when traversing a file
> system. This provides a significant performance boost on traditional Unix and
> GNU/Linux, and it would be a shame to lose this performance benefit.

One of the things which confuses me is why you think there's so much
code which tries to use the st_nlink hack.  It's ***much*** simpler to
just rely on d_type if it exists (and it does on most systems).  If
the answer is that d_type might not be supported on all file systems,
and a recursive descent might enter a file system which didn't support
d_type:

1) The assumption that st_nlink always has the property that it is >2
   and can be used to derive the number of subdirectories was never
   valid across all file system types, so you could descend into a
   file system type where that wasn't true.

2) If you did descend into a file system which didn't support d_type,
   d_type would be DT_UNKNOWN instead of DT_REG or DT_DIR

3) Using DT_DIR is means you can skip the stat check for all directory
   entries.  If you are doing a recursive descent where you care about
   the name, you need to call readdir() on all of the directory
   entries anyway, so you will have access to d_type.  If you are
   doing a recursive descent where you are checking on file ownership,
   you are doing the stat(2) anyway, so why not check
   S_ISDIR(st.st_mode) instead of blindly using the st_nlink hack?

4) Using d_type/DT_DIR is implemented in anything that was BSD
   derived, and many of the SysV derived systems (to the extent that
   they imported in the BSD Fast Filesystem), would have also had
   d_type support.  So if your argument is what about legacy Unix
   code, I would think that most of them would have used the much more
   performant and simpler-to-use d_type interface.

> > allowing tune2fs to clear the dir_nlink flag is not a safe thing to do.
> That depends on what the dir_nlink flag is supposed to mean. (Since the flag
> does not work now, we can define it to mean what we like. :-) If dir_nlink 1
> means "set a directory link count to 1 if it would overflow", and if a link
> count of 1 never changes regardless of what dir_nlink is set to, then why
> would
> it be a problem to allow tunefs to alter the dir_nlink flag? dir_nlink would
> affect only future calls to mkdir, not past ones.

Well, it's mostly safe now because ten years have passed and even the
most pathetically obsolete Enterprice Distro installation has updated
to something more recent.  But the reason why dir_nlink was defined as
an RO_INCOMPAT feature (EXT4_FEATURE_RO_COMPAT_DIR_NLINK) was because
a dir_nlink oblivious kernel could get upset when trying to rmdir a
directory where n_link was 1.

> > The fact that we've gone ten years without anyone noticing or complaining
> More accurately, we've gone ten years before people connected the dots. This
> time, the original bug report was about 'ls'.

Can you give me a pointer to the original bug report?  I'm curious how
things were misbehaving.

> From my point of view The worst thing about all this, is that the dir_nlink
> feature is misdocumented and does not work as intended (i.e., it's a flag
> that
> in effect cannot be turned off). Either dir_nlink needs to be documented and
> fixed; or failing that, the dir_nlink flag should be withdrawn and the ext4
> documentation should clearly say that the link count of a directory is
> permanently set to 1 after it overflows past 64999. If you take the latter
> approach, you needn't update the ext4 code at all, just the documentation
> (though the documentation should note that 64999 is off-by-one compared to
> the
> 65000 that is nominally the maximum link count).

There was a time when we tried documenting things in terms that users
could understand, as opposed to going into gory details that only a fs
programmer would love.  And that retrospect as a mistake.  We should
have done both, with the civilian-friendly bit coming first.

It was also a mistake to have dir_nlink be automatically turned on in
the case of ext4 file systems.  As I said, we used to do this much
more often, and ten years ago we weren't so strict about this rule.
The issue is that at this point there are multiple implementations of
the ext2/3/4 file system format, and not just Linux, so we can't
assume that turning on some feature which is supported by a particular
Linux kernel won't break things on some other implementation (e.g.,
Grub, BSD etc.).  It was a bad idea back then as well because
sometimes people want to downgrade to old kernels, so having a new
kernel automatically do something which causes the file system to
become unmountable by a new kernel is unfriendly.  Since dir_nlink was
close to one of the first ext4 features that was added, it was perhaps
more excusable --- but it was still wrong.

The problem withdrawing the feature is that it would break a lot of
users who want to have more than 65,000 subdirectories.  Ext4 has been
out there for a long time, and while it's true that many people don't
create directory trees which are that bushy, I've gotten people
screaming at us for much more minor bugs.  So that's why I'm curious
to see the original ls bug.  (Maybe because most people don't use ls
-lR on huge directory trees, because they don't like to grow old
waiting for it to complete?  :-)

So I think the right thing to do is to fix the documentation.  We
actually added the feature first, and only added the documentation
much later, so it is the documentation which is wrong.  So this is not
a case of writing the spec first and then implementing to the spec.
This is much more like Unix came first, and Posix came later to
document how Unix worked.  Except we weren't smart enough to add a
clause, as Posix did, that anything that was allowed to use the
Unix(tm) trademark was automatically spec/documentation compliant.  :-)

And even before we added the documentation, it wasn't like it was a
secret; it was just not that well known.  But there were some blog
entries that talked about it[1], and the description of how it worked
was in the git commit message.

[1] https://www.whatastruggle.com/maximum-number-of-sub-directories-in-ext4

Comment 19 Andreas Dilger 2017-07-21 18:34:16 UTC

Ted, I think the right approach is to stop the ext4 kernel code from enabling dir_nlink automatically, but continue to set it by default at format time for new filesystems.  I suspect the number of users that want to get an error returned when their directory grows large is very few, but at least they will be able to turn off dir_nlink and/or create the filesystem without this feature in the first place.  This will make the dir_nlink feature more consistent with other features as well.

What I also just noticed is that e2fsck does not enable the dir_nlink feature in the superblock in case i_links_count > EXT2_LINK_MAX.  I'm just working on a patch for this.

Comment 20 michelbach94 2017-07-21 21:14:09 UTC

(In reply to Theodore Tso from comment #18)
> > More accurately, we've gone ten years before people connected the dots.
> This
> > time, the original bug report was about 'ls'.
> 
> Can you give me a pointer to the original bug report?  I'm curious how
> things were misbehaving.
> [...]
> The problem withdrawing the feature is that it would break a lot of
> users who want to have more than 65,000 subdirectories.  Ext4 has been
> out there for a long time, and while it's true that many people don't
> create directory trees which are that bushy, I've gotten people
> screaming at us for much more minor bugs.  So that's why I'm curious
> to see the original ls bug.

Don't get your hopes up. I filed the original bug report https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27739 because I noticed that `ll` (an alias to `ls -alF`) prints something that doesn't make sense. That's also why I filed the bug report on the GNU bug tracker: Clearly, I was still able to access my directories, so they definitely still existed, but `ll` printed `1` which didn't make sense to me, especially not as I never saw a directory with a hard link count lower than 2 before. After successfully replicating¹ it on different machines in an as-simple-as-possible way and thereby already having written bash code capable of replicating the bug, I provided them with that simple example which just creates empty folders.

After the GNU guys said it's not their bug, Paul replicated it on his machine (different distro) and filed this bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1471967 After Red Hat said it's not their bug, he filed the one whose discussion you're reading right now.


-----
¹ Well, kind of. I actually tried it on a third system (didn't mention it until now, afaik) which runs FreeBSD but the code failed because either FreeBSD or its default FS (or whatever FS that machine uses) doesn't support enough subdirectories or something like that (it failed to create the high number of subdirectories for some reason, so I didn't even get to run `ls` under the right conditions) (ergo: Linux > FreeBSD). But it didn't fail in any way related to `ls` and worked on 2 different machines, so I filed the original bug report.

Comment 21 Andreas Dilger 2017-07-21 21:47:03 UTC

Michael, as discussed here, I don't think the problem is the number of links, but rather that fts is "optimizing" the traversal of a directory with a hard link count of 1 incorrectly.  In your debbugs.gnu.org 27739 bug report this problem wasn't mentioned at all, only that the link count was set to 1, which is why I think it was closed as "not a bug".

I would suggest to re-open 27739, or file a new bug that describes the problem as Paul did in the first comment here, namely that fts is incorrectly optimizing the traversal of subdirectories when the directory link count is 1, when it should be treating this as "the subdirectory count is not being tracked".

Comment 22 Theodore Tso 2017-07-22 14:41:47 UTC

Right, the fact that the two bugs that Michael cited was about the st_nlinks being "wrong" is in fact a good thing.  Paul is asserting that there is vast amount of breakage because ext4 can return an st_nlinks value of 1 on a directory, to the extent that he believes we should withdrawing the dir_nlinks feature.   My argument is if we've lived with this for ten years without users screaming about things being broken, then we're probably ok and we should fix this by changing the documentation.   Paul then responded by pointing out the bugs.  But if the bugs are about st_nlinks being "wrong" causing users to be confused, and not about real life use cases breaking, then I'm going to be much less concerned.

Comment 23 Theodore Tso 2017-07-23 16:23:55 UTC

For the record, the documentation is not wrong; the maximum link count is 65000 --- for files.

# touch 1
# for i in $(seq 2 65000) ; do ln 1 $i ; done
# ls -l 1
-rw-r--r-- 65000 root root 0 Jul 23 12:11 1
# ln 1 65001
ln: failed to create hard link to '1': Too many links

The fact dir_nlink happens to let st_nlink go from 64999 to 1 is (1) not an overflow, and (2) isn't actually contradicting any documentation, insofar as all we say is that dir_nlink no longer limits the number of subdirectories due to EXT4_MAX_LINK count.   The documentation is only subtly wrong when it says:

       dir_nlink
              This ext4 feature allows more than 65000 subdirectories per directory.

That is actually a true statement, it *does* allow more than 65000 subdirectories per directory.  What is not quite true is that without that feature, we are limited to 65000 subdirectories.   Today, because with ext4 the feature gets silently enabled, and with ext2, the failure happens after the 64998th subdirectory (because the directory entry for the parent directory plus the '.' link counts as an st_nlink).

I'll point out that POSIX/SUSv3 doesn't even guarantee *any* value for st_nlink for directories, since hard links to directories are not allowed, and what st_nlink means in the absence of hard links, or directories in general, are not specified.  So just as GCC developers delight in torturing kernel developers by allowing anything random (including launching nethack) for anything which is not concretely specified by the C standard, technically speaking a random Linux file system could let st_nlink random around randomly, or increase according to a fibbonacci or geometric series, etc.   More seriously, for a network/clustered file system, if other subdirectories are being created while the file walk is taking place (or even in the case of a local file system), relying on st_nlink for an optimization is going to be dangerous.  Using d_type is going to be ***much*** safer.

Comment 24 Theodore Tso 2017-07-23 22:55:14 UTC

An update to e2fsprogs to clarify how dir_nlink works in the ext4 man
page.

				- Ted

commit 7d8f358cdce948df57b1001b9c278f33519afa86
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sun Jul 23 18:51:22 2017 -0400

    Clarify how the description of the dir_nlink feature in the ext4 man page
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/misc/ext4.5.in b/misc/ext4.5.in
index c1e67abbc..3511dae62 100644
--- a/misc/ext4.5.in
+++ b/misc/ext4.5.in
@@ -69,7 +69,12 @@ ext2 file systems.
 .TP
 .B dir_nlink
 .br
-This ext4 feature allows more than 65000 subdirectories per directory.
+Normally ext4 allows an inode to have no more than 65,000 hard links.
+This applies to files as well as directories, which means that there can
+be no more than 64,998 subdirectories in a directory (because each of
+the '..' entries counts as a hard link).  This feature lifts this limit
+by causing ext4 to use a links count of 1 to indicate that the number of
+hard links to a directory is not known.
 .TP
 .B encrypt
 .br

Comment 25 michelbach94 2017-07-24 18:48:05 UTC

(In reply to Theodore Tso from comment #24)

There are a couple of things I think you should change.

• Writing the number as "65'000" (fast to read and unambiguous to everyone) or "65 000" (SI style) would be better.
• The comma after the introductory adverb "Normally" is missing.
• "to regular files as well as directories" instead of "to files as well as directories". Remember: Directories are files.
• Thousands separator, again.
• If you choose to mention '..', mention '.', too. Or mention neither, idc.
• It called a "link count", not a "links count".

Comment 26 Paul Eggert 2017-07-25 08:56:05 UTC

(In reply to Theodore Tso from comment #18)
> One of the things which confuses me is why you think there's so much
> code which tries to use the st_nlink hack.  It's ***much*** simpler to
> just rely on d_type if it exists (and it does on most systems).

This is true only for one particular optimization; it is not true for others. For example, Gnulib takes advantage of the fact a directory with st_nlink==2 has no subdirectories, if the directory is in a file system where this optimizatino is known to work. One can't easily use d_type for this.

> 1) The assumption that st_nlink always has the property that it is >2
>    and can be used to derive the number of subdirectories was never
>    valid across all file system types

Yes, and Gnulib exploits the st_nlink assumption only on file systems where it is useful and/or known to work.

> 2) If you did descend into a file system which didn't support d_type,
>    d_type would be DT_UNKNOWN instead of DT_REG or DT_DIR

Yes, and Gnulib doesn't use the optimization if d_type is DT_UNKNOWN.

> 3) Using DT_DIR is means you can skip the stat check for all directory
>    entries.  If you are doing a recursive descent where you care about
>    the name, you need to call readdir() on all of the directory
>    entries anyway, so you will have access to d_type.  If you are
>    doing a recursive descent where you are checking on file ownership,
>    you are doing the stat(2) anyway, so why not check
>    S_ISDIR(st.st_mode) instead of blindly using the st_nlink hack?

No, you can do even better than that in some cases, if st_nlink works. Suppose we are implementing the equivalent of 'find . -type d'. If we come across a directory whose st_nlink == 2, then we don't need to readdir from the directory at all, much less stat its entries.

> 4) ... if your argument is what about legacy Unix code

There is more of that floating around than I'd like, yes. But I'm mostly worried about GNU code.

> Can you give me a pointer to the original bug report?  I'm curious how
> things were misbehaving.

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27739

Comment 27 Paul Eggert 2017-07-25 09:05:59 UTC

(In reply to Theodore Tso from comment #22)
> Paul is asserting that there is vast
> amount of breakage because ext4 can return an st_nlinks value of 1 on a
> directory, to the extent that he believes we should withdrawing the
> dir_nlinks feature.

I think this gets my most recent proposal backwards. At the end of Comment 17, I proposed that the ext4 code act as if dir_nlink is always set. That's what the code has been doing for a decade anyway. All that's missing is documentation which says "the dir_nlink setting is irrelevant, and the file system always acts as if dir_nlink is set", or words to that effect.

Although in hindsight perhaps the dir_nlink flag should have been implemented properly, it wasn't and there's little point to implementing it properly now: every application using ext4 must work with the current behavior anyway.