After upgrading from a Linux distro with kernel 4.4 to a distro running kernel 4.9, I have found that I can no longer completely copy files from a UFS filesystem, in this case FreeBSD 11.0. I use the following mount command on distros with older kernels, and UFS file copies work correctly: sudo mount -t ufs -o ufstype=ufs2 -o ro /dev/sdc2 /mnt/adisk When I use this mount command on the distro with kernel 4.9 the UFS filesystem mounts fine, but when I use mc, cp or pcmanfm to copy from the UFS filesystem, any file over 2.1 GB is truncated to 2.1 GB. There are no relevant messages in dmesg and when files are copied using a terminal emulator, no error message appears when the truncating occurs. This problem appears to be present in kernel 4.10, as well. I have tried other Linux distros with kernel 4.9 and they also have this issue. Thank you.
I can duplicate this bug too by copying a 2.5 GB file from a FreeBSD 11.0 ufs2 filesystem to a Linux jfs filesystem. No bug: Slackware 14.2 Linux 4.4.70 copy ufs2 works fine to jfs Slackware current Linux 4.4.70 copy ufs2 works fine to jfs Bug: Slackware 14.2 Linux 4.9.30 copy ufs2 fails to jfs Slackware current Linux 4.9.30 copy ufs2 fails to jfs Slackware current Linux 4.12-rc3 copy ufs2 fails to jfs and to ext4 The first thing I wonder is: Is the kernel configuration for Slackware current Linux-4.9.30 correct?
To clarify, the bug truncates a large file down to 2147483648 bytes = 2**31 bytes = x'80000000' bytes. After further testing I find that Linux 4.7 does not have the bug, but Linux 4.8 does: No Bug: Slackware current Linux 4.7.9 copy ufs2 works fine to jfs Bug: Slackware current Linux 4.8.9 copy ufs2 fails to jfs
The bug began with Linux 4.8.4 No Bug: Slackware current Linux 4.8.3 copy ufs2 works fine to jfs Bug: Slackware current Linux 4.8.4 copy ufs2 fails to jfs
The bug is in this patch to 4.8.4: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.8.y&id=3d549dcfbbb0ecdaa571431a27ee5da9f2466716 If I take a 4.8.3 kernel and just apply the single patch, then the bug appears. If I remove the patch then the bug disappears from 4.8.3. Here is a short version of the patch: --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1687,6 +1687,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos, unsigned int prev_offset; int error = 0; + if (unlikely(*ppos >= inode->i_sb->s_maxbytes)) + return -EINVAL; + iov_iter_truncate(iter, inode->i_sb->s_maxbytes); + index = *ppos >> PAGE_SHIFT; prev_index = ra->prev_pos >> PAGE_SHIFT; prev_offset = ra->prev_pos & (PAGE_SIZE-1); --
To test, I create a large file of more than 2.1GB on FreeBSD and then reboot with Linux and copy the file to the /tmp/ directory. The error message from the cp command on Linux 4.8.4 is this: cp: error reading 'usr-local.tar': Invalid argument
More clarification. I am testing with 64-bit Linux kernels. The troublesome patch was written for a 32-bit kernel bug.
Linus Torvalds changed the patch last December, 2016: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/mm/filemap.c?h=linux-4.8.y&id=23d179acb363b0670ac698c342ec9a18b3d8759a So, now when testing with Linux 4.9.30 and later kernels, the copy command fails quietly with no "Invalid Argument" error message and no error message/condition at all.
Thanks for all of the footwork Richard. Since this bug causes corrupted data from the truncated copy operation it seems to be a fairly important issue. Hopefully it will get some attention soon.
Besides removing the problem patch, another way of fixing the problem is to set a higher default value for the maximum number of bytes per file in the ufs super block. The current default is 2GB (MAX_NON_LFS), the same value as the truncated file size. Many file systems to use MAX_LFS_FILESIZE and if I also use this for the ufs filesystem, then the bug goes away: --------------------------------------------------------------------------- --- a/fs/ufs/super.c.orig 2017-05-28 17:20:53.000000000 -0700 +++ b/fs/ufs/super.c 2017-06-03 17:08:33.340100262 -0700 @@ -812,9 +812,8 @@ static int ufs_fill_super(struct super_b uspi->s_dirblksize = UFS_SECTOR_SIZE; super_block_offset=UFS_SBLOCK; - /* Keep 2Gig file limit. Some UFS variants need to override - this but as I don't know which I'll let those in the know loosen - the rules */ + sb->s_maxbytes = MAX_LFS_FILESIZE; + switch (sbi->s_mount_opt & UFS_MOUNT_UFSTYPE) { case UFS_MOUNT_UFSTYPE_44BSD: UFSD("ufstype=44bsd\n"); ----------------------------------------------------------------------------
Created attachment 256853 [details] patch to fs/ufs/super.c for maximum bytes per file The patch changes the default maximum filesize in a ufs filesystem.
Wow, thanks to Linux Torvalds! The suggested patch is now in the Linux 4.12-rc4: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=239e250e4acbc0104d514307029c0839e834a51a It may not be the best patch but for now it is better than no patch.
I just verified that the bug exists for NetBSD 7.1 ufs2, OpenBSD 6.0 ufs-1, and Bitrig 1.0 ufs-2 filesystems. And the patch fixes them for now. I would really like to see the original Linux 3.8.4 patch, "vfs,mm: fix return value of read() at s_maxbytes", removed and replaced with a better patch. It seems bizarre to me that a file read function, do_generic_file_read(), is limited by a maximum file size limit even before a read is attempted.
On Mon, Jun 5, 2017 at 11:00 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > It seems bizarre to me that a file read function, do_generic_file_read(), is > limited by a maximum file size limit even before a read is attempted. We do need to protect against the page cache index wrapping case. Now, we could use other methods than s_maxbytes for this code. For example, we could just use the MAX_LFS_FILESIZE constant here, because what we're protecting against isn't really a filesystem limit, but a VFS limit (well, not even a real VFS limit, it's literally just a page cache indexing limit on 32-bit hosts - you could have filesystem methods that use offsets past that, and in fact /dev/mem does, but doesn't use the page cache). At the same time, s_maxbytes really _is_ supposed to describe the filesystem limits, so this really was a UFS bug, not a misfeature of do_generic_file_read() itself. Of course the generic routines will do the wrong things if the filesystems don't fill in the data structures that describe them correctly. We arguably should also check the inode size before even trying to populate the page cache, because as it is now we're just pointlessly generating zeroes pages for the "read past the end of a file", but that's a separate issue from the index wrap-around (we'd also have to check the inode sizde *again* after getting the page, in order to avoid a race with truncate). So this code could easily merit some more changes, but I think the UFS bug was the real issue. Linus
On Mon, Jun 5, 2017 at 11:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Jun 05, 2017 at 11:19:05AM -0700, Linus Torvalds wrote: > >> We arguably should also check the inode size before even trying to >> populate the page cache, because as it is now we're just pointlessly >> generating zeroes pages for the "read past the end of a file", but >> that's a separate issue from the index wrap-around (we'd also have to >> check the inode sizde *again* after getting the page, in order to >> avoid a race with truncate). > > Unless you are willing to issue ->getattr() before reading a page, > you can't do that - remote filesystems would have obvious problems. > Inode size checks like that belong in filesystem, no in > do_generic_file_read(). We already check for the inode size, just look a few lines lower down in "page_ok". We do *not* try to give any remote size consistency guarantees in plain read/write. The normal remote filesystem consistency guarantees are purely open/close consistency. Filesystems can add their own additional consistency guarantees in their read routines (before/after calling the generic routines, or by bypassing the generic ones entirely). But no, we're not doing extra callbacks down to filesystems when we have pages cached. Not here in do_generif_file_read(), and not in the mmap path. If somebody has some insane "total consistency" model, they can do their own thing entirely. And nobody will use that shit-for-brains filesystem because it's going to perform horribly badly. Linus
On Mon, Jun 5, 2017 at 12:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> We already check for the inode size, just look a few lines lower down >> in "page_ok". > > Umm... Wasn't that only for uptodate pages? Had been a while since > I looked at that code... Right. But it doesn't really change the synchronization (or lack there-of). The inode size is definitely meaningful by the time we hit "do_generic_file_read()". For an example of the same issue, see filemap_fault(), and notice how we check the size of the inode *before* reading the page (and return SIGBUS if it's past the size). We then _re-check_ the inode size afterwards, as protection against races with truncate(). Although I wonder how relevant that is any more, I think the mapping check should be sufficient, so I think we're doing a bit of a double check there. Linus
On Mon, Jun 05, 2017 at 11:50:55AM -0700, Linus Torvalds wrote: > On Mon, Jun 5, 2017 at 11:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Mon, Jun 05, 2017 at 11:19:05AM -0700, Linus Torvalds wrote: > > > >> We arguably should also check the inode size before even trying to > >> populate the page cache, because as it is now we're just pointlessly > >> generating zeroes pages for the "read past the end of a file", but > >> that's a separate issue from the index wrap-around (we'd also have to > >> check the inode sizde *again* after getting the page, in order to > >> avoid a race with truncate). > > > > Unless you are willing to issue ->getattr() before reading a page, > > you can't do that - remote filesystems would have obvious problems. > > Inode size checks like that belong in filesystem, no in > do_generic_file_read(). > > We already check for the inode size, just look a few lines lower down > in "page_ok". Umm... Wasn't that only for uptodate pages? Had been a while since I looked at that code...
On Mon, Jun 05, 2017 at 11:19:05AM -0700, Linus Torvalds wrote: > At the same time, s_maxbytes really _is_ supposed to describe the > filesystem limits, so this really was a UFS bug, not a misfeature of > do_generic_file_read() itself. Of course the generic routines will do > the wrong things if the filesystems don't fill in the data structures > that describe them correctly. > > We arguably should also check the inode size before even trying to > populate the page cache, because as it is now we're just pointlessly > generating zeroes pages for the "read past the end of a file", but > that's a separate issue from the index wrap-around (we'd also have to > check the inode sizde *again* after getting the page, in order to > avoid a race with truncate). Unless you are willing to issue ->getattr() before reading a page, you can't do that - remote filesystems would have obvious problems. Inode size checks like that belong in filesystem, no in do_generic_file_read(). > So this code could easily merit some more changes, but I think the UFS > bug was the real issue. More than one. I'm digging through the ->i_blocks mess right now; IMO we should (both for UFS and ext2) go with static tree-imposed limits and deal with ->i_blocks overflows in balloc. IOW, treat an attempt to allocate 2Tb of data+metadata same as we would e.g. hitting a quota limit. In any case, we need to restore ->i_blocks updates ;-/
(In reply to Linus Torvalds from comment #13) > On Mon, Jun 5, 2017 at 11:00 AM, <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > > > It seems bizarre to me that a file read function, do_generic_file_read(), > is > > limited by a maximum file size limit even before a read is attempted. > > We do need to protect against the page cache index wrapping case. > > Now, we could use other methods than s_maxbytes for this code. For > example, we could just use the MAX_LFS_FILESIZE constant here, because > what we're protecting against isn't really a filesystem limit, but a > VFS limit (well, not even a real VFS limit, it's literally just a page > cache indexing limit on 32-bit hosts - you could have filesystem > methods that use offsets past that, and in fact /dev/mem does, but > doesn't use the page cache). > > At the same time, s_maxbytes really _is_ supposed to describe the > filesystem limits, so this really was a UFS bug, not a misfeature of > do_generic_file_read() itself. Of course the generic routines will do > the wrong things if the filesystems don't fill in the data structures > that describe them correctly. > > We arguably should also check the inode size before even trying to > populate the page cache, because as it is now we're just pointlessly > generating zeroes pages for the "read past the end of a file", but > that's a separate issue from the index wrap-around (we'd also have to > check the inode sizde *again* after getting the page, in order to > avoid a race with truncate). > > So this code could easily merit some more changes, but I think the UFS > bug was the real issue. Thanks for this, it makes more sense now. I noticed during testing with Linux 4.8.4 and with the original "vfs,mm: fix return value of read() at s_maxbytes" patch that the cp command got the EINVAL return code (Invalid Argument) and error exited with a nice error message. So as a user I knew that the copy failed. When testing with 4.9.30, the EINVAL was no longer returned and the copy did not appear to fail, but it did fail quietly. Maybe the return EINVAL could be kept if the original test were changed to a greater-than instead of a greater-than-equal? --- filemap.c.orig 2017-06-04 16:47:43.000000000 -0700 +++ filemap.c 2017-06-06 07:34:46.131824320 -0700 @@ -1785,8 +1785,8 @@ static ssize_t do_generic_file_read(stru unsigned int prev_offset; int error = 0; - if (unlikely(*ppos >= inode->i_sb->s_maxbytes)) - return 0; + if (unlikely(*ppos > inode->i_sb->s_maxbytes)) + return -EINVAL; iov_iter_truncate(iter, inode->i_sb->s_maxbytes); index = *ppos >> PAGE_SHIFT; This could fix the case for *ppos being equal to the maximum FAT filesize. (I tested this patch, it seems to work, https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342) And if a user tries to read past the maximum FAT filesize he gets an "Invalid Argument" error. And if any other current/future filesystems have a bad s_maxbytes then it will be seen.
On Tue, Jun 6, 2017 at 8:59 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > Maybe the return EINVAL could be kept if the original test were changed to a > greater-than instead of a greater-than-equal? No. The POSIX semantics are very clear: a read past the end of a file returns EOF (ie zero), not error. Linus