Bug 195721

Summary: [UFS] Large files are truncated when copying from UFS filesystem
Product: File System Reporter: Will B (will.brokenbourgh2877)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: high CC: richard, torvalds, will.brokenbourgh2877
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.9, 4.10 Subsystem:
Regression: No Bisected commit-id:
Attachments: patch to fs/ufs/super.c for maximum bytes per file

Description Will B 2017-05-11 16:19:50 UTC
After upgrading from a Linux distro with kernel 4.4 to a distro running kernel 4.9, I have found that I can no longer completely copy files from a UFS filesystem, in this case FreeBSD 11.0.

I use the following mount command on distros with older kernels, and UFS file copies work correctly:

    sudo mount -t ufs -o ufstype=ufs2 -o ro /dev/sdc2 /mnt/adisk

When I use this mount command on the distro with kernel 4.9 the UFS filesystem mounts fine, but when I use mc, cp or pcmanfm to copy from the UFS filesystem, any file over 2.1 GB is truncated to 2.1 GB.

There are no relevant messages in dmesg and when files are copied using a terminal emulator, no error message appears when the truncating occurs.

This problem appears to be present in kernel 4.10, as well.  I have tried other Linux distros with kernel 4.9 and they also have this issue.

Thank you.
Comment 1 Richard Narron 2017-05-31 19:16:45 UTC
I can duplicate this bug too by copying a 2.5 GB file from a FreeBSD 11.0 ufs2 filesystem to a Linux jfs filesystem.

No bug:
   Slackware 14.2    Linux 4.4.70 copy ufs2 works fine to jfs
   Slackware current Linux 4.4.70 copy ufs2 works fine to jfs

Bug:
   Slackware 14.2    Linux 4.9.30   copy ufs2 fails to jfs
   Slackware current Linux 4.9.30   copy ufs2 fails to jfs
   Slackware current Linux 4.12-rc3 copy ufs2 fails to jfs and to ext4

The first thing I wonder is: 

  Is the kernel configuration for Slackware current Linux-4.9.30 correct?
Comment 2 Richard Narron 2017-06-01 05:11:30 UTC
To clarify, the bug truncates a large file down to 2147483648 bytes = 2**31 bytes = x'80000000' bytes.

After further testing I find that Linux 4.7 does not have the bug, but Linux 4.8 does:

No Bug:
  Slackware current Linux 4.7.9 copy ufs2 works fine to jfs

Bug:
  Slackware current Linux 4.8.9 copy ufs2 fails to jfs
Comment 3 Richard Narron 2017-06-01 20:08:03 UTC
The bug began with Linux 4.8.4

No Bug:
  Slackware current Linux 4.8.3 copy ufs2 works fine to jfs

Bug:
  Slackware current Linux 4.8.4 copy ufs2 fails to jfs
Comment 4 Richard Narron 2017-06-02 00:38:44 UTC
The bug is in this patch to 4.8.4:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.8.y&id=3d549dcfbbb0ecdaa571431a27ee5da9f2466716

If I take a 4.8.3 kernel and just apply the single patch, then the bug appears.

If I remove the patch then the bug disappears from 4.8.3.

Here is a short version of the patch:

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1687,6 +1687,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
        unsigned int prev_offset;
        int error = 0;

+       if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
+               return -EINVAL;
+       iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
+
        index = *ppos >> PAGE_SHIFT;
        prev_index = ra->prev_pos >> PAGE_SHIFT;
        prev_offset = ra->prev_pos & (PAGE_SIZE-1);
--
Comment 5 Richard Narron 2017-06-02 17:06:24 UTC
To test, I create a large file of more than 2.1GB on FreeBSD and then reboot with Linux and copy the file to the /tmp/ directory.

The error message from the cp command on Linux 4.8.4 is this:

cp: error reading 'usr-local.tar': Invalid argument
Comment 6 Richard Narron 2017-06-02 23:17:29 UTC
More clarification.  I am testing with 64-bit Linux kernels.  The troublesome patch was written for a 32-bit kernel bug.
Comment 7 Richard Narron 2017-06-04 01:02:33 UTC
Linus Torvalds changed the patch last December, 2016:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/mm/filemap.c?h=linux-4.8.y&id=23d179acb363b0670ac698c342ec9a18b3d8759a

So, now when testing with Linux 4.9.30 and later kernels, the copy command fails quietly with no "Invalid Argument" error message and no error message/condition at all.
Comment 8 Will B 2017-06-04 04:32:41 UTC
Thanks for all of the footwork Richard.

Since this bug causes corrupted data from the truncated copy operation it seems to be a fairly important issue.  Hopefully it will get some attention soon.
Comment 9 Richard Narron 2017-06-04 06:21:06 UTC
Besides removing the problem patch, another way of fixing the problem is to set a higher default value for the maximum number of bytes per file in the ufs super block.  

The current default is 2GB (MAX_NON_LFS), the same value as the truncated file size.

Many file systems to use MAX_LFS_FILESIZE and if I also use this for the ufs filesystem, then the bug goes away:

---------------------------------------------------------------------------
--- a/fs/ufs/super.c.orig       2017-05-28 17:20:53.000000000 -0700
+++ b/fs/ufs/super.c    2017-06-03 17:08:33.340100262 -0700
@@ -812,9 +812,8 @@ static int ufs_fill_super(struct super_b
        uspi->s_dirblksize = UFS_SECTOR_SIZE;
        super_block_offset=UFS_SBLOCK;
 
-       /* Keep 2Gig file limit. Some UFS variants need to override 
-          this but as I don't know which I'll let those in the know loosen
-          the rules */
+        sb->s_maxbytes = MAX_LFS_FILESIZE;
+
        switch (sbi->s_mount_opt & UFS_MOUNT_UFSTYPE) {
        case UFS_MOUNT_UFSTYPE_44BSD:
                UFSD("ufstype=44bsd\n");
----------------------------------------------------------------------------
Comment 10 Richard Narron 2017-06-04 06:25:12 UTC
Created attachment 256853 [details]
patch to fs/ufs/super.c for maximum bytes per file

The patch changes the default maximum filesize in a ufs filesystem.
Comment 11 Richard Narron 2017-06-05 02:07:45 UTC
Wow, thanks to Linux Torvalds!

The suggested patch is now in the Linux 4.12-rc4:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=239e250e4acbc0104d514307029c0839e834a51a

It may not be the best patch but for now it is better than no patch.
Comment 12 Richard Narron 2017-06-05 18:00:11 UTC
I just verified that the bug exists for NetBSD 7.1 ufs2, OpenBSD 6.0 ufs-1, and Bitrig 1.0 ufs-2 filesystems.

And the patch fixes them for now. 

I would really like to see the original Linux 3.8.4 patch, "vfs,mm: fix return value of read() at s_maxbytes", removed and replaced with a better patch.

It seems bizarre to me that a file read function, do_generic_file_read(), is limited by a maximum file size limit even before a read is attempted.
Comment 13 Linus Torvalds 2017-06-05 18:19:09 UTC
On Mon, Jun 5, 2017 at 11:00 AM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> It seems bizarre to me that a file read function, do_generic_file_read(), is
> limited by a maximum file size limit even before a read is attempted.

We do need to protect against the page cache index wrapping case.

Now, we could use other methods than s_maxbytes for this code. For
example, we could just use the MAX_LFS_FILESIZE constant here, because
what we're protecting against isn't really a filesystem limit, but a
VFS limit (well, not even a real VFS limit, it's literally just a page
cache indexing limit on 32-bit hosts - you could have filesystem
methods that use offsets past that, and in fact /dev/mem does, but
doesn't use the page cache).

At the same time, s_maxbytes really _is_ supposed to describe the
filesystem limits, so this really was a UFS bug, not a misfeature of
do_generic_file_read() itself. Of course the generic routines will do
the wrong things if the filesystems don't fill in the data structures
that describe them correctly.

We arguably should also check the inode size before even trying to
populate the page cache, because as it is now we're just pointlessly
generating zeroes pages for the "read past the end of a file", but
that's a separate issue from the index wrap-around (we'd also have to
check the inode sizde *again* after getting the page, in order to
avoid a race with truncate).

So this code could easily merit some more changes, but I think the UFS
bug was the real issue.

                 Linus
Comment 14 Linus Torvalds 2017-06-05 18:50:58 UTC
On Mon, Jun 5, 2017 at 11:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Jun 05, 2017 at 11:19:05AM -0700, Linus Torvalds wrote:
>
>> We arguably should also check the inode size before even trying to
>> populate the page cache, because as it is now we're just pointlessly
>> generating zeroes pages for the "read past the end of a file", but
>> that's a separate issue from the index wrap-around (we'd also have to
>> check the inode sizde *again* after getting the page, in order to
>> avoid a race with truncate).
>
> Unless you are willing to issue ->getattr() before reading a page,
> you can't do that - remote filesystems would have obvious problems.
> Inode size checks like that belong in filesystem, no in
> do_generic_file_read().

We already check for the inode size, just look a few lines lower down
in "page_ok".

We do *not* try to give any remote size consistency guarantees in
plain read/write. The normal remote filesystem consistency guarantees
are purely open/close consistency.

Filesystems can add their own additional consistency guarantees in
their read routines (before/after calling the generic routines, or by
bypassing the generic ones entirely).

But no, we're not doing extra callbacks down to filesystems when we
have pages cached. Not here in do_generif_file_read(), and not in the
mmap path.

If somebody has some insane "total consistency" model, they can do
their own thing entirely. And nobody will use that shit-for-brains
filesystem because it's going to perform horribly badly.

                      Linus
Comment 15 Linus Torvalds 2017-06-05 19:13:51 UTC
On Mon, Jun 5, 2017 at 12:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> We already check for the inode size, just look a few lines lower down
>> in "page_ok".
>
> Umm...  Wasn't that only for uptodate pages?  Had been a while since
> I looked at that code...

Right. But it doesn't really change the synchronization (or lack there-of).

The inode size is definitely meaningful by the time we hit
"do_generic_file_read()".

For an example of the same issue, see filemap_fault(), and notice how
we check the size of the inode *before* reading the page (and return
SIGBUS if it's past the size).

We then _re-check_ the inode size afterwards, as protection against
races with truncate(). Although I wonder how relevant that is any
more, I think the mapping check should be sufficient, so I think we're
doing a bit of a double check there.

               Linus
Comment 16 viro 2017-06-05 20:02:26 UTC
On Mon, Jun 05, 2017 at 11:50:55AM -0700, Linus Torvalds wrote:
> On Mon, Jun 5, 2017 at 11:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Mon, Jun 05, 2017 at 11:19:05AM -0700, Linus Torvalds wrote:
> >
> >> We arguably should also check the inode size before even trying to
> >> populate the page cache, because as it is now we're just pointlessly
> >> generating zeroes pages for the "read past the end of a file", but
> >> that's a separate issue from the index wrap-around (we'd also have to
> >> check the inode sizde *again* after getting the page, in order to
> >> avoid a race with truncate).
> >
> > Unless you are willing to issue ->getattr() before reading a page,
> > you can't do that - remote filesystems would have obvious problems.
> > Inode size checks like that belong in filesystem, no in
> do_generic_file_read().
> 
> We already check for the inode size, just look a few lines lower down
> in "page_ok".

Umm...  Wasn't that only for uptodate pages?  Had been a while since
I looked at that code...
Comment 17 viro 2017-06-05 20:02:29 UTC
On Mon, Jun 05, 2017 at 11:19:05AM -0700, Linus Torvalds wrote:

> At the same time, s_maxbytes really _is_ supposed to describe the
> filesystem limits, so this really was a UFS bug, not a misfeature of
> do_generic_file_read() itself. Of course the generic routines will do
> the wrong things if the filesystems don't fill in the data structures
> that describe them correctly.
> 
> We arguably should also check the inode size before even trying to
> populate the page cache, because as it is now we're just pointlessly
> generating zeroes pages for the "read past the end of a file", but
> that's a separate issue from the index wrap-around (we'd also have to
> check the inode sizde *again* after getting the page, in order to
> avoid a race with truncate).

Unless you are willing to issue ->getattr() before reading a page,
you can't do that - remote filesystems would have obvious problems.
Inode size checks like that belong in filesystem, no in do_generic_file_read().

> So this code could easily merit some more changes, but I think the UFS
> bug was the real issue.

More than one.  I'm digging through the ->i_blocks mess right now; IMO
we should (both for UFS and ext2) go with static tree-imposed limits
and deal with ->i_blocks overflows in balloc.  IOW, treat an attempt
to allocate 2Tb of data+metadata same as we would e.g. hitting
a quota limit.  In any case, we need to restore ->i_blocks updates ;-/
Comment 18 Richard Narron 2017-06-06 15:59:35 UTC
(In reply to Linus Torvalds from comment #13)
> On Mon, Jun 5, 2017 at 11:00 AM,  <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> >
> > It seems bizarre to me that a file read function, do_generic_file_read(),
> is
> > limited by a maximum file size limit even before a read is attempted.
> 
> We do need to protect against the page cache index wrapping case.
> 
> Now, we could use other methods than s_maxbytes for this code. For
> example, we could just use the MAX_LFS_FILESIZE constant here, because
> what we're protecting against isn't really a filesystem limit, but a
> VFS limit (well, not even a real VFS limit, it's literally just a page
> cache indexing limit on 32-bit hosts - you could have filesystem
> methods that use offsets past that, and in fact /dev/mem does, but
> doesn't use the page cache).
> 
> At the same time, s_maxbytes really _is_ supposed to describe the
> filesystem limits, so this really was a UFS bug, not a misfeature of
> do_generic_file_read() itself. Of course the generic routines will do
> the wrong things if the filesystems don't fill in the data structures
> that describe them correctly.
> 
> We arguably should also check the inode size before even trying to
> populate the page cache, because as it is now we're just pointlessly
> generating zeroes pages for the "read past the end of a file", but
> that's a separate issue from the index wrap-around (we'd also have to
> check the inode sizde *again* after getting the page, in order to
> avoid a race with truncate).
> 
> So this code could easily merit some more changes, but I think the UFS
> bug was the real issue.


Thanks for this, it makes more sense now.

I noticed during testing with Linux 4.8.4 and with the original
"vfs,mm: fix return value of read() at s_maxbytes" patch that the cp command got the EINVAL return code (Invalid Argument) and error exited with a nice error message. So as a user I knew that the copy failed.  

When testing with 4.9.30, the EINVAL was no longer returned and the copy did not appear to fail, but it did fail quietly.

Maybe the return EINVAL could be kept if the original test were changed to a greater-than instead of a greater-than-equal?

--- filemap.c.orig      2017-06-04 16:47:43.000000000 -0700
+++ filemap.c   2017-06-06 07:34:46.131824320 -0700
@@ -1785,8 +1785,8 @@ static ssize_t do_generic_file_read(stru
        unsigned int prev_offset;
        int error = 0;

-       if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
-               return 0;
+       if (unlikely(*ppos > inode->i_sb->s_maxbytes))
+               return -EINVAL;
        iov_iter_truncate(iter, inode->i_sb->s_maxbytes);

        index = *ppos >> PAGE_SHIFT;


This could fix the case for *ppos being equal to the maximum FAT filesize.
(I tested this patch, it seems to work, https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342)

And if a user tries to read past the maximum FAT filesize he gets an "Invalid Argument" error. And if any other current/future filesystems have a bad s_maxbytes then it will be seen.
Comment 19 Linus Torvalds 2017-06-06 16:28:39 UTC
On Tue, Jun 6, 2017 at 8:59 AM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> Maybe the return EINVAL could be kept if the original test were changed to a
> greater-than instead of a greater-than-equal?

No. The POSIX semantics are very clear: a read past the end of a file
returns EOF (ie zero), not error.

               Linus