Most recent kernel where this bug did not occur: Distribution: gentoo RHAS4.3 FC5 Hardware Environment: various Software Environment: libaio Problem Description: after io_getevents reports that write/appen was done, the data in file is still unaccessible (neither reported by fstat() nor readable via io_submit()) Steps to reproduce: gcc aiotest-wait.c -laio -lpthread run attached test on SMP few times for UP edit source and initialize thread_write with fun_write1() NOTE: if synch is introduced (not fstating until write io_submit() returns) all is fine. For test that compile with gcc aiotest-wait.c -DDO_SYNCH -laio -lpthread
Created attachment 8547 [details] testcase showing the race
> after io_getevents reports that write/appen was done, the data in file is > still > unaccessible (neither reported by fstat() nor readable via io_submit()) Yeah, this looks like a real bug. The path under generic_file_direct_IO() is calling aio_complete() before its caller updates i_size. I think this test is seeing it because it has another thread waiting in io_getevents() while another thread is submitting the extending write. Usually the completion isn't seen until io_submit() returns after having updated i_size. I won't have time to craft a patch today, I don't think, but I'll try during OLS next week if no one beats me to it.
I reproduced the failure under 2.6.18-rc1-mm2 on a dual athlon. (b.k.o yelled at me when I tried to add it to the kernel version field.)
Zach, I'm not sure I understand You, as You only watch this thread > I won't have time to craft a patch today, You mean You going to create the patch?
> You mean You going to create the patch? Yeah, I sent it off to the kernel mailing list but haven't gotten any response. http://lkml.org/lkml/2006/7/26/257 We need to make sure that this fix doesn't break other uses of O_DIRECT and aio before we can merge it into the kernel.
Zach, I'm guessing that this patch is obsoleted by the patch series you sent out on September 5th (which cleans up the error handling in the dio code). Is that right?
Yeah, the 'dio: clean up completion phase' patch serious addresses this problem. It's the final patch in the series that does it: http://lkml.org/lkml/2006/9/5/268 The hunks that remove the EIOCBQUEUED translation from dio's callers are the indication. We're still working on testing the patch series. We had an unrelated hardware failure that is slowing things down :/. It's been solid enough so far that I might throw it in -mm once .19 opens. - z
Zach - recently (for few days) I'm struggling with strange AIO behaviour. I have HDD with some badblocks (or rather more than some;). In general case AIO reads seem to operate OK(return EIO), but very rarely read returns unmodified read-buffer and no error. The kernel is 2.6.9-34.ELsmp. I suspect aio bug - but no clear evidence. Do You happen to know anything of such possible bug?
> but very rarely read returns unmodified > read-buffer and no error. The kernel is 2.6.9-34.ELsmp. I suspect aio bug - but > no clear evidence. Do You happen to know anything of such possible bug? Hmm, I could imagine cases where it might lose an error that was generated from block lookups but I don't know of any bugs which would specifically do this. Is it possible that file system corruption has caused the file size to shrink and for reads to be issued past the end of the file? That would lead to reads that succeed with 0 bytes read. Are you in a position where you can try the patches to the kernel that were referred to in comment 7? Can you try the
> Is it possible that file system corruption has caused the file size to shrink > and for reads to be issued past the end of the file? That would lead to reads > that succeed with 0 bytes read. I'm reading raw device in my test. Read with 0 bytes and unmodified buffer is OK for me, hence I'm getting 4K data read and 4K unmodified buffer and no error. > Are you in a position where you can try the patches to the kernel that were > referred to in comment 7? I may try to try ;) That's because I'm not allowed to use kernel other than official RH and obviously can't run the test on another machine ;) It's going to last a little.
The first race that this bug was filed for was fixed mainline in commit 8459d86aff04fa53c2ab6a6b9f355b3063cc8014 about a year ago. We have a test which makes sure this bug doesn't regress: http://git.kernel.org/?p=linux/kernel/git/zab/aio-dio-regress.git;a=blob;f=c/aio-dio-extend-stat.c So this bug can be closed. Any further bugs with distribution kernels should be taken up with the distribution provider.