Bug 50981
Summary: | generic_file_aio_read ?: No locking means DATA CORRUPTION read and write on same 4096 page range | ||
---|---|---|---|
Product: | Memory Management | Reporter: | meetmehiro |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | NEW --- | ||
Severity: | high | CC: | akpm, alan, gnehzuil.liu, lczerner, tm, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.7-rc | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
meetmehiro
2012-11-25 16:31:58 UTC
which block size does the filesystem use, 1K or 4K? Yongqiang On Mon, Nov 26, 2012 at 12:31 AM, <bugzilla-daemon@bugzilla.kernel.org>wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=50981 > > Summary: ext4 : DATA CORRUPTION read and write on same 4096 > page range > Product: File System > Version: 2.5 > Kernel Version: 2.6.32 Red Hat Enterprise Linux Workstation release > 6.2 (Santiago) > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: ext4 > AssignedTo: fs_ext4@kernel-bugs.osdl.org > ReportedBy: meetmehiro@gmail.com > Regression: No > > > Kernal version and Red hat > > 2.6.32 , Red Hat Enterprise Linux Workstation release 6.2 (Santiago) > > > I have one scientific application, that is running on ext4 , after three to > fours days, I have started to seen that data is not correct.. then i tried > to > find simple reproducer for the same. > > Pl. find the reproducer for the same with this issue is hitting easily. > > Logs : > > ./sim ext4_write > creating the threads > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > reader thread has opened the file, fd : 4 > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > reader..checking data integrity > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > writing B at offset 0... > writing A at offset 0... > writing B at offset 0... > reader..checking data integrity > writing A at offset 0... > reader.. corrupted data, at offset : 511 > writing B at offset 0... > > > > C program : > #include <stdio.h> > #include <fcntl.h> > #include <pthread.h> > #include <assert.h> > #include <errno.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <unistd.h> > #include <sys/mman.h> > > pthread_t reader_id, writer_id; > > char write_buf [4096] = {0,}; > char read_buf[4096] = {0,}; > > char datafile[256] = {0,}; > int stop_write = 0; > > > > void * writer (void * arg) > { > > int fd = -1; > int count = 0; > /* this thread will open the file and write 4k data to it > alternatively > * of a different data set to the same offset > */ > /* opening the file */ > assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1); > > while (1) { > > if (stop_write) > { > close(fd); > printf("write stopped\n"); > return NULL; > } > if (count % 2 == 0) > memset(write_buf, 'A', sizeof(write_buf)); > else > memset(write_buf, 'B', sizeof(write_buf)); > > > assert(lseek(fd, 0, SEEK_SET) == 0); > printf("writing %c at offset 0...\n", write_buf[0]); > assert(write(fd, write_buf, 4096) == 4096); > count ++; > > } > //pthread_exit((void *)0); > close(fd); > return NULL; > } > > void * reader(void * arg) > { > > int fd = -1; > int count = 0; > /* this thread will open the file and write 4k data to it > alternatively > * of a different data set to the same offset > */ > /* opening the file */ > assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1); > printf("reader thread has opened the file, fd : %d\n", fd); > while (1) { > > memset(read_buf, '\0', 4096); > assert(lseek(fd, 0, SEEK_SET) == 0); > assert(read(fd, read_buf, 4096) == 4096); > count ++; > printf("reader..checking data integrity\n"); > int i=0; > for (i=0; i < 4095; i++) > { > int prev_char = read_buf[i]; > int next_char = read_buf[i+1]; > if (prev_char != next_char) > { > printf("reader.. corrupted data, at offset > : > %d\n", i); > stop_write = 1; > int output_fd = open("res.txt", > O_CREAT|O_RDWR, > 0777); > > assert(write(output_fd, read_buf, 4096) == > 4096); > close(output_fd); > close(fd); > return NULL; > } > } > > > } > //pthread_exit((void *)0); > close(fd); > return NULL; > } > int main (int argc, char *argv[]) > { > pthread_attr_t attr; > void * arg = NULL; > int *status; > int err = 0; > > if (argc != 2) { > printf("usage : ./reader_writer <file pathname>\n"); > exit(1); > } > > > memset(datafile, '\0', sizeof(datafile)); > strcpy(datafile, argv[1]); > > pthread_attr_init(&attr); > pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); > > /* create the reader and writer thread */ > printf("creating the threads\n"); > if ((err = pthread_create(&writer_id, &attr, writer, (void *)arg)) > != > 0) { > printf ("pthread_create failed err no :%d\n", err); > } > if ((err = pthread_create(&reader_id, &attr, reader, (void *)arg)) > != > 0) { > printf ("pthread_create failed err no :%d\n", err); > } > pthread_attr_destroy(&attr); > > /* Wait on the other threads */ > assert(pthread_join(writer_id, (void *)&status) == 0); > assert(pthread_join(reader_id, (void *)&status) == 0); > > pthread_exit(NULL); > } > > How to run > gcc program.c -o ext4_program -lpthread > ./ext4_program ext4_write > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are watching the assignee of the bug. > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > This is present upstream as well (3.7.0-rc7). In fact it is not even ext4 specific. I can easily reproduce this on btrfs, ext3, vfat. I suspect that this would be reproducible on every file system which does not implement its own read/write exclusion as xfs does. MM people should really take a look at this as well since we are expecting the read/write to/from the single page to be atomic, but this looks like that it might not be true. Thanks! -Lukas IMHO, the reason seems to be that no any locking is taken by generic_file_aio_read(). When ext4 does a buffered write, it will take i_mutex to ensure that there is no other writer. But ext4 (ext3, btrfs, and vfat) uses generic_file_aio_read() to do a buffered read, and it doesn't take i_mutex. So it is easy to read a inconsistent data while doing a write. I am not very familiar with xfs, but a rwlock will be taken when a read and/or a write is being done. So that is why xfs hasn't this problem. Yes, that's what I said XFS has its own read/write exclusion. However we're reading/writing a single page here and page read/write shoud be atomic so we should _not_ see page with mixed data. Basically the page should be locked when we copy data from/to it, -Lukas (In reply to comment #4) > Yes, that's what I said XFS has its own read/write exclusion. However we're > reading/writing a single page here and page read/write shoud be atomic so we > should _not_ see page with mixed data. Basically the page should be locked > when > we copy data from/to it, Sorry, as my understanding, a page won't be locked when it is accessed if it has been mark uptodate. If we write a page that has been written recently, the uptodate flag won't be clear. So reader won't try to lock page, and maybe that is not atomic. Am I missing something? Regards, Zheng It might be, as I said the problem appears to be in mm code, most likely there is a problem with page locking. However I am not familiar enough with the mm code to know that for sure. (In reply to comment #6) > It might be, as I said the problem appears to be in mm code, most likely > there > is a problem with page locking. However I am not familiar enough with the mm > code to know that for sure. Understood, thanks. :-) Regards, Zheng (retagged for 3.7-rc as per Lukas email, otherwise my scripts will keep wanting to close it as RHEL and not for upstream) Alan could you please also change the component and maybe even subject of the bug since it is not ext4 specific ? Thanks! -Lukas it's 4K (In reply to comment #1) > which block size does the filesystem use, 1K or 4K? > > Yongqiang > > > On Mon, Nov 26, 2012 at 12:31 AM, <bugzilla-daemon@bugzilla.kernel.org>wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=50981 > > > > Summary: ext4 : DATA CORRUPTION read and write on same 4096 > > page range > > Product: File System > > Version: 2.5 > > Kernel Version: 2.6.32 Red Hat Enterprise Linux Workstation release > > 6.2 (Santiago) > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: high > > Priority: P1 > > Component: ext4 > > AssignedTo: fs_ext4@kernel-bugs.osdl.org > > ReportedBy: meetmehiro@gmail.com > > Regression: No > > > > > > Kernal version and Red hat > > > > 2.6.32 , Red Hat Enterprise Linux Workstation release 6.2 (Santiago) > > > > > > I have one scientific application, that is running on ext4 , after three to > > fours days, I have started to seen that data is not correct.. then i tried > > to > > find simple reproducer for the same. > > > > Pl. find the reproducer for the same with this issue is hitting easily. > > > > Logs : > > > > ./sim ext4_write > > creating the threads > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > reader thread has opened the file, fd : 4 > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > reader..checking data integrity > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > writing B at offset 0... > > writing A at offset 0... > > writing B at offset 0... > > reader..checking data integrity > > writing A at offset 0... > > reader.. corrupted data, at offset : 511 > > writing B at offset 0... > > > > > > > > C program : > > #include <stdio.h> > > #include <fcntl.h> > > #include <pthread.h> > > #include <assert.h> > > #include <errno.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > #include <unistd.h> > > #include <sys/mman.h> > > > > pthread_t reader_id, writer_id; > > > > char write_buf [4096] = {0,}; > > char read_buf[4096] = {0,}; > > > > char datafile[256] = {0,}; > > int stop_write = 0; > > > > > > > > void * writer (void * arg) > > { > > > > int fd = -1; > > int count = 0; > > /* this thread will open the file and write 4k data to it > > alternatively > > * of a different data set to the same offset > > */ > > /* opening the file */ > > assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1); > > > > while (1) { > > > > if (stop_write) > > { > > close(fd); > > printf("write stopped\n"); > > return NULL; > > } > > if (count % 2 == 0) > > memset(write_buf, 'A', sizeof(write_buf)); > > else > > memset(write_buf, 'B', sizeof(write_buf)); > > > > > > assert(lseek(fd, 0, SEEK_SET) == 0); > > printf("writing %c at offset 0...\n", write_buf[0]); > > assert(write(fd, write_buf, 4096) == 4096); > > count ++; > > > > } > > //pthread_exit((void *)0); > > close(fd); > > return NULL; > > } > > > > void * reader(void * arg) > > { > > > > int fd = -1; > > int count = 0; > > /* this thread will open the file and write 4k data to it > > alternatively > > * of a different data set to the same offset > > */ > > /* opening the file */ > > assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1); > > printf("reader thread has opened the file, fd : %d\n", fd); > > while (1) { > > > > memset(read_buf, '\0', 4096); > > assert(lseek(fd, 0, SEEK_SET) == 0); > > assert(read(fd, read_buf, 4096) == 4096); > > count ++; > > printf("reader..checking data integrity\n"); > > int i=0; > > for (i=0; i < 4095; i++) > > { > > int prev_char = read_buf[i]; > > int next_char = read_buf[i+1]; > > if (prev_char != next_char) > > { > > printf("reader.. corrupted data, at offset > > : > > %d\n", i); > > stop_write = 1; > > int output_fd = open("res.txt", > > O_CREAT|O_RDWR, > > 0777); > > > > assert(write(output_fd, read_buf, 4096) == > > 4096); > > close(output_fd); > > close(fd); > > return NULL; > > } > > } > > > > > > } > > //pthread_exit((void *)0); > > close(fd); > > return NULL; > > } > > int main (int argc, char *argv[]) > > { > > pthread_attr_t attr; > > void * arg = NULL; > > int *status; > > int err = 0; > > > > if (argc != 2) { > > printf("usage : ./reader_writer <file pathname>\n"); > > exit(1); > > } > > > > > > memset(datafile, '\0', sizeof(datafile)); > > strcpy(datafile, argv[1]); > > > > pthread_attr_init(&attr); > > pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); > > > > /* create the reader and writer thread */ > > printf("creating the threads\n"); > > if ((err = pthread_create(&writer_id, &attr, writer, (void *)arg)) > > != > > 0) { > > printf ("pthread_create failed err no :%d\n", err); > > } > > if ((err = pthread_create(&reader_id, &attr, reader, (void *)arg)) > > != > > 0) { > > printf ("pthread_create failed err no :%d\n", err); > > } > > pthread_attr_destroy(&attr); > > > > /* Wait on the other threads */ > > assert(pthread_join(writer_id, (void *)&status) == 0); > > assert(pthread_join(reader_id, (void *)&status) == 0); > > > > pthread_exit(NULL); > > } > > > > How to run > > gcc program.c -o ext4_program -lpthread > > ./ext4_program ext4_write > > > > -- > > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > > ------- You are receiving this mail because: ------- > > You are watching the assignee of the bug. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > (In reply to comment #4) > Yes, that's what I said XFS has its own read/write exclusion. However we're > reading/writing a single page here and page read/write shoud be atomic so we > should _not_ see page with mixed data. Basically the page should be locked > when > we copy data from/to it, > > -Lukas Hello Lukas, as this is working properly with XFS, so in ext4/ext3...etc also we shouldn't require synchronization at the Application level,., FS should take care of locking... will we expecting the fix for the same ??? (In reply to comment #4) > Yes, that's what I said XFS has its own read/write exclusion. However we're > reading/writing a single page here and page read/write shoud be atomic so we > should _not_ see page with mixed data. Basically the page should be locked > when > we copy data from/to it, > > -Lukas Hello Lukas, as this is working properly with XFS, so in ext4/ext3...etc also we shouldn't require synchronization at the Application level,., FS should take care of locking... will we expecting the fix for the same ??? On Mon, Nov 26, 2012 at 04:33:28PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=50981 > > as this is working properly with XFS, so in ext4/ext3...etc also we shouldn't > require synchronization at the Application level,., FS should take care of > locking... will we expecting the fix for the same ??? Meetmehiro, At this point, there seems to be consensus that the kernel should take care of the locking, and that this is not something that needs be a worry for the application. Whether this should be done in the file system layer or in the mm layer is the current question at hand --- since this is a bug that also affects btrfs and other non-XFS file systems. So the question is whether every file system which supports AIO should add its own locking, or whether it should be done at the mm layer, and at which point the lock in the XFS layer could be removed as no longer necessary. I've added linux-mm and linux-fsdevel to make sure all of the relevant kernel developers are aware of this question/issue. Regards, - Ted Thanks a lot Theodore Ts'o ... Thanks to all (kernel,ext4 ..etc. )developer for quick debug and response... On Mon, Nov 26, 2012 at 10:15 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Nov 26, 2012 at 04:33:28PM +0000, > bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=50981 > > > > as this is working properly with XFS, so in ext4/ext3...etc also we > shouldn't > > require synchronization at the Application level,., FS should take care > of > > locking... will we expecting the fix for the same ??? > > Meetmehiro, > > At this point, there seems to be consensus that the kernel should take > care of the locking, and that this is not something that needs be a > worry for the application. Whether this should be done in the file > system layer or in the mm layer is the current question at hand --- > since this is a bug that also affects btrfs and other non-XFS file > systems. > > So the question is whether every file system which supports AIO should > add its own locking, or whether it should be done at the mm layer, and > at which point the lock in the XFS layer could be removed as no longer > necessary. > > I've added linux-mm and linux-fsdevel to make sure all of the relevant > kernel developers are aware of this question/issue. > > Regards, > > - Ted > On Mon, 26 Nov 2012, Theodore Ts'o wrote: > On Mon, Nov 26, 2012 at 04:33:28PM +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=50981 > > > > as this is working properly with XFS, so in ext4/ext3...etc also we > shouldn't > > require synchronization at the Application level,., FS should take care of > > locking... will we expecting the fix for the same ??? > > Meetmehiro, > > At this point, there seems to be consensus that the kernel should take > care of the locking, and that this is not something that needs be a > worry for the application. Gosh, that's a very sudden new consensus. The consensus over the past ten or twenty years has been that the Linux kernel enforce locking for consistent atomic writes, but skip that overhead on reads - hasn't it? > Whether this should be done in the file > system layer or in the mm layer is the current question at hand --- > since this is a bug that also affects btrfs and other non-XFS file > systems. > > So the question is whether every file system which supports AIO should > add its own locking, or whether it should be done at the mm layer, and > at which point the lock in the XFS layer could be removed as no longer > necessary. > > I've added linux-mm and linux-fsdevel to make sure all of the relevant > kernel developers are aware of this question/issue. Thanks, that's helpful; but I think linux-mm people would want to defer to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/, but a fundamental change to VFS locking philosophy is not mm's call. I don't see that page locking would have anything to do with it: if we are going to start guaranteeing reads atomic against concurrent writes, then surely it's the size requested by the user to be guaranteed, spanning however many pages and fs-blocks: i_mutex, or a more efficiently crafted alternative. Hugh Reply-To: hch@infradead.org On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote: > Gosh, that's a very sudden new consensus. The consensus over the past > ten or twenty years has been that the Linux kernel enforce locking for > consistent atomic writes, but skip that overhead on reads - hasn't it? I'm not sure there was much of a consensus ever. We XFS people always ttried to push everyone down the strict rule, but there was enough pushback that it didn't actually happen. > Thanks, that's helpful; but I think linux-mm people would want to defer > to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/, > but a fundamental change to VFS locking philosophy is not mm's call. > > I don't see that page locking would have anything to do with it: if we > are going to start guaranteeing reads atomic against concurrent writes, > then surely it's the size requested by the user to be guaranteed, > spanning however many pages and fs-blocks: i_mutex, or a more > efficiently crafted alternative. What XFS does is simply replace (or rather augment currently) i_mutex with a rw_semaphore (i_iolock in XFS) which is used the following way: exclusive: - buffer writes - pagecache flushing before direct I/O (then downgraded) - appending direct I/O writes - less than blocksize granularity direct I/O shared: - everything else (buffered reads, "normal" direct I/O) Doing this in the highest levels of the generic_file_ code would be trivial, and would allow us to get rid of a fair chunk of wrappers in XFS. Note that we've been thinking about replacing this lock with a range lock, but this will require more research. Reply-To: zab@zabbo.net On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote: > On Mon, 26 Nov 2012, Theodore Ts'o wrote: > > On Mon, Nov 26, 2012 at 04:33:28PM +0000, > bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=50981 > > > > > > as this is working properly with XFS, so in ext4/ext3...etc also we > shouldn't > > > require synchronization at the Application level,., FS should take care > of > > > locking... will we expecting the fix for the same ??? > > > > Meetmehiro, > > > > At this point, there seems to be consensus that the kernel should take > > care of the locking, and that this is not something that needs be a > > worry for the application. > > Gosh, that's a very sudden new consensus. The consensus over the past > ten or twenty years has been that the Linux kernel enforce locking for > consistent atomic writes, but skip that overhead on reads - hasn't it? I was wondering exactly the same thing. > > So the question is whether every file system which supports AIO should > > add its own locking, or whether it should be done at the mm layer, and > > at which point the lock in the XFS layer could be removed as no longer > > necessary. (This has nothing to do with AIO. Buffered reads have been copied from unlocked pages.. basically forever, right?) > Thanks, that's helpful; but I think linux-mm people would want to defer > to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/, > but a fundamental change to VFS locking philosophy is not mm's call. > > I don't see that page locking would have anything to do with it: if we > are going to start guaranteeing reads atomic against concurrent writes, > then surely it's the size requested by the user to be guaranteed, > spanning however many pages and fs-blocks: i_mutex, or a more > efficiently crafted alternative. Agreed. While this little racing test might be fixed, those baked in page_size == 4k == atomic granularity assumptions are pretty sketchy. So we're talking about holding multiple page locks? Or i_mutex? Or some fancy range locking? There's consensus on serializing overlapping buffered reads and writes? - z *readying the read(, mmap(), ) fault deadlock toy* On Mon, Nov 26, 2012 at 03:13:08PM -0500, Christoph Hellwig wrote: > On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote: > > Gosh, that's a very sudden new consensus. The consensus over the past > > ten or twenty years has been that the Linux kernel enforce locking for > > consistent atomic writes, but skip that overhead on reads - hasn't it? > > I'm not sure there was much of a consensus ever. We XFS people always > ttried to push everyone down the strict rule, but there was enough > pushback that it didn't actually happen. > > > Thanks, that's helpful; but I think linux-mm people would want to defer > > to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/, > > but a fundamental change to VFS locking philosophy is not mm's call. > > > > I don't see that page locking would have anything to do with it: if we > > are going to start guaranteeing reads atomic against concurrent writes, > > then surely it's the size requested by the user to be guaranteed, > > spanning however many pages and fs-blocks: i_mutex, or a more > > efficiently crafted alternative. > > What XFS does is simply replace (or rather augment currently) i_mutex > with a rw_semaphore (i_iolock in XFS) which is used the following way: > > exclusive: > - buffer writes > - pagecache flushing before direct I/O (then downgraded) > - appending direct I/O writes > - less than blocksize granularity direct I/O - splice write Also, direct extent manipulations that are outside the IO path such as: - truncate - preallocation - hole punching use the XFS_IOLOCK_EXCL to provide exclusion against new IO starting while such an operation is in progress. > shared: > - everything else (buffered reads, "normal" direct I/O) > > Doing this in the highest levels of the generic_file_ code would be > trivial, and would allow us to get rid of a fair chunk of wrappers in > XFS. We still need the iolock deep in the guts of the filesystem, though. I suspect that if we are going to change the VFS locking, then we should seriously consider allowing the filesystem to provide it's own locking implementation and the VFS just pass the type of lock required. Otherwise we are still going to need all the locking within the filesystem to serialise all the core pieces that the VFS locking doesn't serialise (e.g. EOF truncation on close/evict, extent swaps for online defrag, etc). > Note that we've been thinking about replacing this lock with a range > lock, but this will require more research. I'd say we need a working implementation in a filesystem before even considering a VFS implementation... Cheers, Dave. Reply-To: hch@infradead.org On Tue, Nov 27, 2012 at 08:28:45AM +1100, Dave Chinner wrote: > We still need the iolock deep in the guts of the filesystem, though. I don't think we do. The only thing that comes close to it is xfs_swap_extents passing the XFS_IOLOCK_EXCL to xfs_trans_ijoin so that the transaction commit automatically unlocks it, but that can be trivially replaced with a manual unlock. > I suspect that if we are going to change the VFS locking, then we > should seriously consider allowing the filesystem to provide it's > own locking implementation and the VFS just pass the type of lock > required. Otherwise we are still going to need all the locking > within the filesystem to serialise all the core pieces that the VFS > locking doesn't serialise (e.g. EOF truncation on close/evict, > extent swaps for online defrag, etc). The VFS currently doesn't hardcode i_mutex for any data plane operations, only a few generic helpers do it, most notably generic_file_aio_write (which can be bypassed by using a slightly lower level variant) and __blockdev_direct_IO when used in DIO_LOCKING mode. On Mon, Nov 26, 2012 at 03:13:08PM -0500, Christoph Hellwig wrote:
> On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote:
> > Gosh, that's a very sudden new consensus. The consensus over the past
> > ten or twenty years has been that the Linux kernel enforce locking for
> > consistent atomic writes, but skip that overhead on reads - hasn't it?
>
> I'm not sure there was much of a consensus ever. We XFS people always
> ttried to push everyone down the strict rule, but there was enough
> pushback that it didn't actually happen.
Christoph, can you give some kind of estimate for the overhead that
adding this locking in XFS actually costs in practice? And does XFS
provide any kind of consistency guarantees if the reads/write overlap
spans multiple pages? I assume the answer to that is no, correct?
Thanks,
- Ted
Reply-To: hch@infradead.org On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote: > Christoph, can you give some kind of estimate for the overhead that > adding this locking in XFS actually costs in practice? I don't know any real life measurements, but in terms of implementation the over head is: a) taking a the rw_semaphore in shared mode for every buffered read b) taking the slightly slower exclusive rw_semaphore for buffered writes instead of the plain mutex On the other hand it significantly simplifies the locking for direct I/O and allows parallel direct I/O writers. > And does XFS > provide any kind of consistency guarantees if the reads/write overlap > spans multiple pages? I assume the answer to that is no, correct? The answer is yes as the lock is taken globally on the inode. On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote: > On Mon, Nov 26, 2012 at 03:13:08PM -0500, Christoph Hellwig wrote: > > On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote: > > > Gosh, that's a very sudden new consensus. The consensus over the past > > > ten or twenty years has been that the Linux kernel enforce locking for > > > consistent atomic writes, but skip that overhead on reads - hasn't it? > > > > I'm not sure there was much of a consensus ever. We XFS people always > > ttried to push everyone down the strict rule, but there was enough > > pushback that it didn't actually happen. > > Christoph, can you give some kind of estimate for the overhead that > adding this locking in XFS actually costs in practice? It doesn't show up any significant numbers in profiles, if that is what you are asking. I've tested over random 4k reads and writes at over 2 million IOPS to a single file using concurrent direct IO, so the non-exclusive locking overhead is pretty minimal. If the workload is modified slightly to used buffered writes instead of direct IO writes and so triggering shared/exclusive lock contention, then the same workload tends to get limited at around 250,000 IOPS per file. That's a direct result of the exclusive locking limiting the workload to what a single CPU can sustain (i.e difference between 8p @ 250-300k iops vs 1p @ 250k iops on the exclusive locking workload). > And does XFS > provide any kind of consistency guarantees if the reads/write overlap > spans multiple pages? I assume the answer to that is no, correct? A buffered write is locked exclusive for the entire of the write. That includes multiple page writes as the locking is outside of the begin_write/end_write per-page iteration. Hence the atomicity of the entire buffered write against both buffered read and direct IO is guaranteed. Cheers, Dave. On Mon, Nov 26, 2012 at 05:09:08PM -0500, Christoph Hellwig wrote:
> On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> > Christoph, can you give some kind of estimate for the overhead that
> > adding this locking in XFS actually costs in practice?
>
> I don't know any real life measurements, but in terms of implementation
> the over head is:
>
> a) taking a the rw_semaphore in shared mode for every buffered read
> b) taking the slightly slower exclusive rw_semaphore for buffered writes
> instead of the plain mutex
>
> On the other hand it significantly simplifies the locking for direct
> I/O and allows parallel direct I/O writers.
I should probably just look at the XFS code, but.... if you're taking
an exclusve lock for buffered writes, won't this impact the
performance of buffered writes happening in parallel on different
CPU's?
- Ted
On Mon, Nov 26, 2012 at 08:32:54PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 26, 2012 at 05:09:08PM -0500, Christoph Hellwig wrote:
> > On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> > > Christoph, can you give some kind of estimate for the overhead that
> > > adding this locking in XFS actually costs in practice?
> >
> > I don't know any real life measurements, but in terms of implementation
> > the over head is:
> >
> > a) taking a the rw_semaphore in shared mode for every buffered read
> > b) taking the slightly slower exclusive rw_semaphore for buffered writes
> > instead of the plain mutex
> >
> > On the other hand it significantly simplifies the locking for direct
> > I/O and allows parallel direct I/O writers.
>
> I should probably just look at the XFS code, but.... if you're taking
> an exclusve lock for buffered writes, won't this impact the
> performance of buffered writes happening in parallel on different
> CPU's?
Indeed it does - see my previous email. But it's no worse than
generic_file_aio_write() that takes i_mutex across buffered writes,
which is what most filesystems currently do. And FWIW, we also take
the i_mutex outside the i_iolock for the buffered write case because
generic_file_buffered_write() is documented to require it held.
See xfs_rw_ilock() and friends for locking order semantics...
FWIW, this buffered write exclusion is why we have been considering
replacing the rwsem with a shared/exclusive range lock - so we can
do concurrent non-overlapping reads and writes (for both direct IO and
buffered IO) without compromising the POSIX atomic write guarantee
(i.e. that a read will see the entire write or none of it). Range
locking will allow us to do that for both buffered and direct IO...
Cheers,
Dave.
(In reply to comment #23) > On Mon, Nov 26, 2012 at 05:09:08PM -0500, Christoph Hellwig wrote: > > On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote: > > > Christoph, can you give some kind of estimate for the overhead that > > > adding this locking in XFS actually costs in practice? > > > > I don't know any real life measurements, but in terms of implementation > > the over head is: > > > > a) taking a the rw_semaphore in shared mode for every buffered read > > b) taking the slightly slower exclusive rw_semaphore for buffered writes > > instead of the plain mutex > > > > On the other hand it significantly simplifies the locking for direct > > I/O and allows parallel direct I/O writers. > > I should probably just look at the XFS code, but.... if you're taking > an exclusve lock for buffered writes, won't this impact the > performance of buffered writes happening in parallel on different > CPU's? > > - Ted Hi Ted, seems like I've already tried to implement the locking before. See http://patchwork.ozlabs.org/patch/91834/ it's far from perfect, but you can measure the performance impact on this simple implementation. -Lukas |