Created attachment 96041 [details] C program to demonstrate problem with non-atomic short writes. With recent kernels, short writes to a regular file opened with O_APPEND can be non-atomic when receiving a signal, and the write syscall is still returning bytes written == nbytes. I tried this with v3.8.4 (latest stable) and as far back as v3.2.41 and a smattering in between. All show this problem. However, it does not reproduce on v3.0.70 or earlier kernels. All testing was done with on x86_64 SMP systems and on an ext3 or ext4 file system. (I also tried running with only a single CPU enabled. It made no difference.) This is likely going to not be a problem specifically with the ext file system, but it seemed the best Bugzilla match that I saw based on the information gathered so far. The problem did not reproduce on an OpenSUSE 12.1 system (3.1.10-1.19-default). In the world of RHEL kernels, this problem does not appear with RHEL5.7 or earlier RHEL kernels I tried. However, it does appear starting with RHEL5.8 and 5.9 and also on RHEL6 kernels that I tried (6.2 & 6.4). I can't find anything wrong with the attached demo program, or find anything in the POSIX standard or with the Linux open(2) man pages that might allow for this apparent erroneous behavior. To reproduce the problem, compile and run the program below. After it's been running a say 4-6 seconds, press ^C. Notice there's been no output from the program (for every write call made, all returned success and with scnt==wcnt). Now run "grep '^P.*P' test.log". If there is no problem, there will be no output. If you see output, you'll see lines that were stepped on or otherwise not fully written out.
I have a small mistake in the above report. The problem does not reproduce on RHEL6.2 (or earlier RHEL6 releases). Only on RHEL6.3 and later. So it looks like the bug came in between: v3.1 and v3.2, RHEL5.7 and RHEL5.8, and RHEL6.2 and RHEL6.3. I'm still poking about to see if I can figure out which patch that is so I can better have a guess at which subsystem.
Created attachment 96411 [details] Fully automatic C test for demostrating problem
I tracked the problem down to the specific patch a50527b1. Since now I know the problem is not specific to ext[34], I'm updating the ticket to be the VFS layer. (HOWEVER, Bugzilla won't let me change the assignee to fs_vfs@kernel-bugs.osdl.org. Can someone do that for me?) With the patch in place (even up to and including 3.9-rc4), the attached test (mlt-auto.c) fails. If I git revert a50527b1 out of 3.9-rc4, the test passes.
The commit in question (a50527b1) was authored by Jan Kara so I've changed the assignee of this bug to him.
Thanks for report! So first let me explain what is happening: A process is writing a messages to the file. Since write to page cache happens page-by-page the writes straddling page boundary can be split in two. So if the signal arrives after the first part of the split write happens and before the second part happening, the write is interrupted. You don't ever observe wcnt != scnt because as soon as we return from the syscall we terminate the process so you don't have a chance to observe the error message. Now this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned (please correct me if I'm missing something). So I'd say the program is incorrect. But OTOH I agree that this was not possible before a50527b1 and we don't want to break userspace. I'd hate to revert that commit since it allows us to interrupt processes doing large writes (especially when something goes wrong) but if you explain to us why this behavior is a problem for you then I guess I'll have to revert it.
I see your point. The problem as originally reported to me seemed to indicate a race with overwriting of data that was occurring during signal or exit processing. That's just not the case. The problem has nothing to do with races of multiple writers to the same file. Having the multiple writers just increased the odds of seeing the truncated write transaction happen. A single writer can certainly see this issue. As I understand the issue now, it is simply that a pending write transaction is being terminated by a killing signal while in progress with the effect of a partial write. The old kernel behavior had the write syscall atomically complete under a pending signal. Unfortunately as far as I can determine, in the POSIX and Linux standards, there is no way to work around this new behavior. There's no way to ensure that some amount of data no matter now small, even just two bytes, are written out to a file as an atomic transfer (either aborted and no bytes written or is completely written out.) Fortunately though, at least for the case where this came up, we can work around not having a way to write to an ordinary file atomically, so please don't back the patch out on our account. Since what we experienced though is not an obscure or rare situation, I suspect other people are going to bump into it as well.