Bug 205957 - Ext4 64 bit hash breaks 32 bit glibc 2.28+
Summary: Ext4 64 bit hash breaks 32 bit glibc 2.28+
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: ARM Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL: https://sourceware.org/bugzilla/show_...
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-24 14:07 UTC by aladjev.andrew@gmail.com
Modified: 2020-10-02 10:34 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.3.16 (any version)
Tree: Mainline
Regression: No


Attachments
getdents64_x32 kernel patch (15.35 KB, patch)
2019-12-25 22:50 UTC, aladjev.andrew@gmail.com
Details | Diff
getdents64_x32 glibc patch (1.09 KB, patch)
2019-12-25 22:51 UTC, aladjev.andrew@gmail.com
Details | Diff
getdents64_x32 qemu patch (20.77 KB, patch)
2019-12-25 22:51 UTC, aladjev.andrew@gmail.com
Details | Diff
getdents64_x32 libseccomp patch (6.34 KB, patch)
2019-12-25 22:52 UTC, aladjev.andrew@gmail.com
Details | Diff
getdents64_x32 go patch (16.04 KB, patch)
2019-12-25 22:52 UTC, aladjev.andrew@gmail.com
Details | Diff
getdents64_x32 bildah patch (13.74 KB, patch)
2019-12-25 22:52 UTC, aladjev.andrew@gmail.com
Details | Diff
add ioctl(EXT4_IOC_SET_DIRLIMIT) to limit directory cookie size (2.45 KB, patch)
2019-12-28 11:20 UTC, Andreas Dilger
Details | Diff

Description aladjev.andrew@gmail.com 2019-12-24 14:07:36 UTC
Hello. Please see the following glibc issue: https://sourceware.org/bugzilla/show_bug.cgi?id=23960.

I am running arm system using qemu on host system with x86_64 kernel. I've received 64 bits value for "d_off" instead of 32 bits from "getdents64".

Than I've tried i386 instead of arm and it works fine. I've received 32 bits value for "d_off" from "getdents64".

I think that the problem is here
arch/x86/include/generated/uapi/asm/unistd_x32.h:
#define __NR_getdents64 (__X32_SYSCALL_BIT + 217)

arch/x86/include/asm/compat.h:
static inline bool in_x32_syscall(void)
{
#ifdef CONFIG_X86_X32_ABI
        if (task_pt_regs(current)->orig_ax & __X32_SYSCALL_BIT)
                return true;
#endif
        return false;
}

fs/ext4/dir.c:
static inline int is_32bit_api(void)
{
#ifdef CONFIG_COMPAT
        return in_compat_syscall();
#else
        return (BITS_PER_LONG == 32);
#endif
}

static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
{
        if ((filp->f_mode & FMODE_32BITHASH) ||
            (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
                return major >> 1;
        else
                return ((__u64)(major >> 1) << 32) | (__u64)minor;
}

I think that i386 makes a special syscall with "__X32_SYSCALL_BIT" enabled. This special bit makes "in_x32_syscall", "in_compat_syscall" and "is_32bit_api" to return true. This thing affects "hash2pos" and it works like "FMODE_32BITHASH" is enabled.

ARM has no such special system, it uses generic "in_compat_syscall".
include/linux/compat.h:
* For most but not all architectures, "am I in a compat syscall?" and
* "am I a compat task?" are the same question.  For architectures on which
* they aren't the same question, arch code can override in_compat_syscall.

I don't know how to fix this issue.

1. Is it possible to implement special arm syscalls in the same way as x86?

2. Is it possible to add special compatibility syscall like "__NR_compat_getdents64" that will make generic "in_compat_syscall" return true?

3. Is it possible to force enable FMODE_32BITHASH for arm syscalls in runtime in another way?

4. We can't shift "d_off >> 32" now, because we will have to restore "<< 32" in all third party applications (not only glibc). Is it possible to replace "((__u64)(major >> 1) << 32) | (__u64)minor" with "((__u64)minor << 32) | (__u64)(major >> 1)"? So we will just ignore minor overflow.

5. Is it possible to refactor ext4 and remove hash from pos completely? It provides so much pain.
Comment 1 dflogeras2 2019-12-24 15:09:25 UTC
I too have been dealing with this situation in a few Qemu+chroot environments I use for 32bit arm on a x86_64 host.  I have asked about it via this Qemu bug:

https://bugs.launchpad.net/qemu/+bug/1805913


as well as the glibc discussion in this bug may provide more insight:

https://sourceware.org/bugzilla/show_bug.cgi?id=23960


and finally this older LKML discussion has some more information:

https://lkml.org/lkml/2018/12/27/155
Comment 2 Andreas Dilger 2019-12-24 18:16:37 UTC
IMHO, it is broken to be calling a 64-bit interface like getdents64() and then be unhappy when it is returning 64-bit values.  As previously stated, it would be possible to add a "32bitapi" mount option to force ext4 to always return 32-bit offset values, as is done with NFS. 

The alternative is for QEMU's telldir() to see that d_off is a 64-bit value, but is exporting a 32-bit interface and downshift the offset to fit into a 32-bit field. It would store a '64BITOFFSET' flag in the file descriptor in this case, and if seekdir() is called on that fd it will upshift the offset to a 64-bit value again.  That will lose the low bits of the offset, but that is unlikely to be noticeable until there are more than 65000 entries in a directory.  Since the hash values are uniformly distributed, they will almost immediately exceed 32 bits, and telldir() is uncommon for use after the first entry is read, so this detection will work reliably.
Comment 3 aladjev.andrew@gmail.com 2019-12-25 22:50:38 UTC
Created attachment 286447 [details]
getdents64_x32 kernel patch
Comment 4 aladjev.andrew@gmail.com 2019-12-25 22:51:04 UTC
Created attachment 286449 [details]
getdents64_x32 glibc patch
Comment 5 aladjev.andrew@gmail.com 2019-12-25 22:51:35 UTC
Created attachment 286451 [details]
getdents64_x32 qemu patch
Comment 6 aladjev.andrew@gmail.com 2019-12-25 22:52:11 UTC
Created attachment 286453 [details]
getdents64_x32 libseccomp patch
Comment 7 aladjev.andrew@gmail.com 2019-12-25 22:52:30 UTC
Created attachment 286455 [details]
getdents64_x32 go patch
Comment 8 aladjev.andrew@gmail.com 2019-12-25 22:52:49 UTC
Created attachment 286457 [details]
getdents64_x32 bildah patch
Comment 9 aladjev.andrew@gmail.com 2019-12-25 22:53:13 UTC
I've created heavy (but easy) bunch of patches that fixes issue. This is the order of right patches applyment (for gentoo):

+----------------------------------------------------------------------------+
|                                                                            |
| amd64 host:            +-------------------------------------------------+ |
|                        |                                                 | |
| linux-headers: kernel  | amd64 container:      +-----------------------+ | |
| gentoo-sources: kernel |                       |                       | | |
| libseccomp             | linux-headers: kernel | arm container (qemu): | | |
| go                     | qemu                  |                       | | |
| buildah                |                       | linux-headers: kernel | | |
|                        |                       | glibc                 | | |
|                        |                       |                       | | |
|                        |                       +-----------------------+ | |
|                        |                                                 | |
|                        +-------------------------------------------------+ |
|                                                                            |
+----------------------------------------------------------------------------+

You can find full example here https://github.com/andrew-aladev/test-images/tree/master/cross/arm-unknown-linux-gnueabi. Now I am able to emerge several application inside arm container (qemu), so glibc with "getdents64_x32" works perfect.

I understand that kernel developers don't like additional syscalls, so they won't accept such kernel patch. But this syscall is the easiest way for me to provide a proof of concept.

How to understand these patches? Please read "glibc.patch". Our amd64 kernel is the common kernel for amd64, i386 and arm (via qemu). amd64 is using "getdents64" and arm (without LFS) is using "getdents64_x32". We need to have 2 syscalls simultaneously: "getdents64" and "getdents64_x32" or maybe single syscall with some "x32" flag. I think this will be the most reliable solution.

I want to ask glibc and kernel developers to cooperate and find the right solution. This is important not only for arm, it looks like this issue affects any x32 abi.

Thank you.
Comment 10 Andreas Dilger 2019-12-28 07:50:36 UTC
IMHO, adding a new syscall for this is a lot of complexity that could be avoided.  Using an ioctl() or fadvise() to set the "32-bitness" of the file descriptor would seem like a simpler implementation that could be handled directly by an ext4 patch rather than having to change every architecture just to pass this flag.

Then, in the cases that QEMU is running in the confusing "I want 32-bit values returned from a 64-bit system call" mode, it would call the ioctl(fd, FS_IOC_32BITHASH) once before calling getdents64(fd, ...).
Comment 11 aladjev.andrew@gmail.com 2019-12-28 09:00:42 UTC
I think ext4 patch for qemu will be too weak solution, kernel wants to be a black box for users and hide x32 implementation details. It will save opportunity to maneuver for ext4 developers in future.
Comment 12 Andreas Dilger 2019-12-28 11:20:10 UTC
Created attachment 286491 [details]
add ioctl(EXT4_IOC_SET_DIRLIMIT) to limit directory cookie size

Totally untested prototype patch to add ioctl(fd, EXT4_IOC_SET_DIRLIMIT, 32) that could be used by glibc/QEMU to force ext4 to return a 32-bit directory offset cookie for the 64-bit getdirent64() syscall on a per-fd basis.

The glibc patch would look something like the following, though this could potentially be done only once per open:

 +  // It is affected by "__USE_FILE_OFFSET64" and "__USE_LARGEFILE64".
 +  if (sizeof (outp->u.d_off) != sizeof (inp->k.d_off))
 +    (void *) ioctl (fd, EXT4_IOC_SET_DIRLIMIT, 32); // ignore error return
 +
    retval = INLINE_SYSCALL_CALL (getdents64, fd, kbuf, kbytes);


If glibc/QEMU want ext4 to be totally transparent w.r.t. the behavior of the syscalls, then glibc/QEMU would need set the kernel task state so that is_32bit_api()->in_compat_syscall() return true and hash2pos() and related functions can determine this directly to return a 32-bit pos in a manner similar to x86.  That is outside my area of expertise, so I can't really suggest how it might be done.  It would also be possible to add some different logic to is_32bit_api() for ARM/etc. so that it is more transparent to userspace, but I don't know what it should check.
Comment 13 aladjev.andrew@gmail.com 2020-01-05 21:47:47 UTC
I am trying to work around mips issues and find this one https://www.linux-mips.org/archives/linux-mips/2012-11/msg00031.html. This issue from 2012 is the complete copy of our current issue. I can say that old issue with mips is still alive: readdir is still broken on mips n32.
Comment 14 John Paul Adrian Glaubitz 2020-01-05 22:08:52 UTC
Thanks for working on this! I think the suffix _x32 might be misleading as this is the name of the 32-bit ABI on x86_64, see: https://en.wikipedia.org/wiki/X32_ABI.
Comment 15 aladjev.andrew@gmail.com 2020-01-08 23:11:27 UTC
I've investigated mips related linux-user qemu and glibc source and mosaic looks complete.

See qemu:
https://github.com/qemu/qemu/blob/master/linux-user/syscall.c#L9465-L9536
emulates getdents via getdents or getdents64, but don't emulate getdents64 using getdents.

Is it possible to emulate getdents64 using getdents? Yes, it was implemented in glibc for mips 64:
https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/mips/mips64/getdents64.c#L49-L139
This emulation has been done because getdents64 was not available for some ancient kernels on mips n64.

But this emulation should be a part of kernel. It has almost nothing to do with qemu or glibc. Qemu should just launch compat syscall instead of regular and that's it. Please see the following easy example:
https://github.com/torvalds/linux/blob/master/fs/read_write.c#L322-L332

SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
{
	return ksys_lseek(fd, offset, whence);
}

#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE3(lseek, unsigned int, fd, compat_off_t, offset, unsigned int, whence)
{
	return ksys_lseek(fd, offset, whence);
}
#endif

We need to declear compat syscall in the same way, like:

COMPAT_SYSCALL_DEFINE3(getdents64, unsigned int, fd, struct compat_linux_dirent64 __user *, dirent, unsigned int, count)

"COMPAT_SYSCALL_DEFINE" will guarantee that "in_compat_syscall" will return true and ext4 will use 32 bit hash.

Than we need to add this syscall to https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L361-L402
Comment 16 aladjev.andrew@gmail.com 2020-01-09 20:02:26 UTC
We don't need to create a patch for kernel about this compat syscall, because it already supports "__X32_SYSCALL_BIT". We need just to enable it using "CONFIG_X86_X32=y" kernel config. Recompiled kernel will expose all x86 related syscalls in x32 mode.

I've tested that "__X32_SYSCALL_BIT + 217" works completely the same as my "SYS_getdents64_x32". Function "in_x32_syscall" works perfect.

We can patch glibc to redirect getdents64 syscall to x32 compatible one. My previous assumption was wrong. Patching qemu will be very bad solution, because it means that large file support will be broken in all applications, not only glibc.

For now only glibc wants to convert dirent64 to dirent just by copying fields. Regular application don't want to emulate getdents using getdents64 and it will not touch dirent64. So we shouldn't interfere in qemu syscalls.
Comment 17 Danny Milosavljevic 2020-10-02 10:03:42 UTC
The right place to fix this is in the distributions enabling LFS support, not in ext4, not in qemu (which is completely blameless and should not be changed) and only marginally in glibc.

See also my added comments on https://sourceware.org/bugzilla/show_bug.cgi?id=23960 and also on https://lists.gnu.org/archive/html/guix-patches/2020-10/msg00059.html , the latter is where the actual testing goes on.

Summary:

* glibc calls getdents64 and then is surprised (and fails when _FILE_OFFSET_BITS < 64) when it gets a 64 bit result back
* The distribution does not enable _FILE_OFFSET_BITS=64
* It is possible for anyone to make it return a 64 bit result by writing a FUSE filesystem, without being root
* readdir is thus unreliable in these environments, and it depends on filesystem internals when the first value > 2**32 is returned, at which point readdir stops reading and sets errno.
* Nobody reads errno in those cases.  Even if they did, what are they supposed to do in those cases?
* X86_32 syscall emulation does not exist on aarch64 and other 64 bit archs--they couldn't use the kernel workaround even where it does exist.

Solution:

* The distribution should globally enable _FILE_OFFSET_BITS=64

Extra:

* Now (after 15 years of 64 bit) glibc should be made to emit a warning or error if users try to use file stuff (like readdir) and _FILE_OFFSET_BITS!=64.

There is no need to fiddle with ext4, qemu or kernel syscalls in order to fix this problem.

If distributions didn't enable LFS (large file support) in the last 15 years, they are weird.  Everyone has drives > 4 GiB, usually > 1000 GiB, nowadays.  But without LFS users can't even create a file > 4 GiB on those.  How are there still distributions which actually have this problem?
Comment 18 John Paul Adrian Glaubitz 2020-10-02 10:34:52 UTC
> The right place to fix this is in the distributions enabling LFS support, not
> in ext4, not in qemu (which is completely blameless and should not be
> changed) and only marginally in glibc.

(...)

> If distributions didn't enable LFS (large file support) in the last 15 years,
> they are weird.  Everyone has drives > 4 GiB, usually > 1000 GiB, nowadays. 
> But without LFS users can't even create a file > 4 GiB on those.  How are
> there still distributions which actually have this problem?

You are absolutely right but the real world isn't perfect, unfortunately and while I have done my best to find packages affected by this and force them to use _FILE_OFFSET_BITS=64, I'm still occasionally running into packages affected by this bug which is why I'm still patching glibc locally.

Note You need to log in before you can comment on or make changes to this bug.