Bug 55651 - Non-atomic short write problem with O_APPEND
Summary: Non-atomic short write problem with O_APPEND
Status: RESOLVED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: VFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jan Kara
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-23 01:19 UTC by Quentin Barnes
Modified: 2013-04-02 00:26 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.9-rc4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
C program to demonstrate problem with non-atomic short writes. (897 bytes, text/plain)
2013-03-23 01:19 UTC, Quentin Barnes
Details
Fully automatic C test for demostrating problem (1.46 KB, text/plain)
2013-03-27 18:40 UTC, Quentin Barnes
Details

Description Quentin Barnes 2013-03-23 01:19:23 UTC
Created attachment 96041 [details]
C program to demonstrate problem with non-atomic short writes.

With recent kernels, short writes to a regular file opened with
O_APPEND can be non-atomic when receiving a signal, and the write
syscall is still returning bytes written == nbytes.

I tried this with v3.8.4 (latest stable) and as far back as v3.2.41
and a smattering in between.  All show this problem.  However, it
does not reproduce on v3.0.70 or earlier kernels.  All testing was
done with on x86_64 SMP systems and on an ext3 or ext4 file system.
(I also tried running with only a single CPU enabled.  It made no
difference.)

This is likely going to not be a problem specifically with the ext
file system, but it seemed the best Bugzilla match that I saw
based on the information gathered so far.

The problem did not reproduce on an OpenSUSE 12.1 system
(3.1.10-1.19-default).  In the world of RHEL kernels, this problem
does not appear with RHEL5.7 or earlier RHEL kernels I tried.
However, it does appear starting with RHEL5.8 and 5.9 and also on
RHEL6 kernels that I tried (6.2 & 6.4).

I can't find anything wrong with the attached demo program, or find
anything in the POSIX standard or with the Linux open(2) man pages
that might allow for this apparent erroneous behavior.

To reproduce the problem, compile and run the program below.  After
it's been running a say 4-6 seconds, press ^C.  Notice there's been
no output from the program (for every write call made, all returned
success and with scnt==wcnt).  Now run "grep '^P.*P' test.log".  If
there is no problem, there will be no output.  If you see output,
you'll see lines that were stepped on or otherwise not fully written
out.
Comment 1 Quentin Barnes 2013-03-25 20:59:31 UTC
I have a small mistake in the above report.  The problem does not reproduce on RHEL6.2 (or earlier RHEL6 releases).  Only on RHEL6.3 and later.

So it looks like the bug came in between: v3.1 and v3.2, RHEL5.7 and RHEL5.8, and RHEL6.2 and RHEL6.3.

I'm still poking about to see if I can figure out which patch that is so I can better have a guess at which subsystem.
Comment 2 Quentin Barnes 2013-03-27 18:40:35 UTC
Created attachment 96411 [details]
Fully automatic C test for demostrating problem
Comment 3 Quentin Barnes 2013-03-27 18:46:35 UTC
I tracked the problem down to the specific patch a50527b1.

Since now I know the problem is not specific to ext[34], I'm
updating the ticket to be the VFS layer.  (HOWEVER, Bugzilla
won't let me change the assignee to fs_vfs@kernel-bugs.osdl.org.
Can someone do that for me?)

With the patch in place (even up to and including 3.9-rc4), the
attached test (mlt-auto.c) fails.

If I git revert a50527b1 out of 3.9-rc4, the test passes.
Comment 4 Theodore Tso 2013-03-27 18:57:24 UTC
The commit in question (a50527b1) was authored by Jan Kara so I've changed the assignee of this bug to him.
Comment 5 Jan Kara 2013-03-29 16:45:42 UTC
Thanks for report! So first let me explain what is happening: A process is writing a messages to the file. Since write to page cache happens page-by-page the writes straddling page boundary can be split in two. So if the signal arrives after the first part of the split write happens and before the second part happening, the write is interrupted. You don't ever observe wcnt != scnt because as soon as we return from the syscall we terminate the process so you don't have a chance to observe the error message.

Now this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned (please correct me if I'm missing something). So I'd say the program is incorrect. But OTOH I agree that this was not possible before a50527b1 and we don't want to break userspace. I'd hate to revert that commit since it allows us to interrupt processes doing large writes (especially when something goes wrong) but if you explain to us why this behavior is a problem for you then I guess I'll have to revert it.
Comment 6 Quentin Barnes 2013-04-02 00:26:47 UTC
I see your point.

The problem as originally reported to me seemed to indicate a race
with overwriting of data that was occurring during signal or exit
processing.  That's just not the case.  The problem has nothing to
do with races of multiple writers to the same file.  Having the
multiple writers just increased the odds of seeing the truncated
write transaction happen.  A single writer can certainly see this
issue.

As I understand the issue now, it is simply that a pending
write transaction is being terminated by a killing signal while
in progress with the effect of a partial write.  The old kernel
behavior had the write syscall atomically complete under a pending
signal.

Unfortunately as far as I can determine, in the POSIX and Linux
standards, there is no way to work around this new behavior.
There's no way to ensure that some amount of data no matter now small,
even just two bytes, are written out to a file as an atomic transfer
(either aborted and no bytes written or is completely written out.)

Fortunately though, at least for the case where this came up, we
can work around not having a way to write to an ordinary file
atomically, so please don't back the patch out on our account.
Since what we experienced though is not an obscure or rare situation,
I suspect other people are going to bump into it as well.

Note You need to log in before you can comment on or make changes to this bug.