Bug 50981 - generic_file_aio_read ?: No locking means DATA CORRUPTION read and write on same 4096 page range
Summary: generic_file_aio_read ?: No locking means DATA CORRUPTION read and write on s...
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-25 16:31 UTC by meetmehiro
Modified: 2012-11-27 07:13 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.7-rc
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description meetmehiro 2012-11-25 16:31:58 UTC
Kernal version and Red hat

2.6.32 , Red Hat Enterprise Linux Workstation release 6.2 (Santiago)


I have one scientific application, that is running on ext4 , after three to fours days, I have started to seen that data is not correct.. then i tried to find simple reproducer for the same.

Pl. find the reproducer for the same with this issue is hitting easily.  

Logs : 

./sim ext4_write
creating the threads
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
reader thread has opened the file, fd : 4
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
reader..checking data integrity
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
writing B at offset 0...
writing A at offset 0...
writing B at offset 0...
reader..checking data integrity
writing A at offset 0...
reader.. corrupted data, at offset : 511
writing B at offset 0...



C program :
#include <stdio.h>
#include <fcntl.h>
#include <pthread.h>
#include <assert.h>
#include <errno.h>
       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>
       #include <unistd.h>
#include <sys/mman.h>

pthread_t reader_id, writer_id;

char write_buf [4096] = {0,};
char read_buf[4096] = {0,};

char  datafile[256] = {0,};
int stop_write = 0;



void * writer (void * arg)
{

        int fd = -1;
        int count = 0;
        /* this thread will open the file and write 4k data to it alternatively
         * of a different data set to the same offset
         */
        /* opening the file */
        assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1);

        while (1) {

                if (stop_write)
                {
                        close(fd);
                        printf("write stopped\n");
                        return NULL;
                }
                if (count % 2 == 0)
                        memset(write_buf, 'A', sizeof(write_buf));
                else
                        memset(write_buf, 'B', sizeof(write_buf));


                assert(lseek(fd, 0, SEEK_SET) == 0);
                printf("writing %c at offset 0...\n", write_buf[0]);
                assert(write(fd, write_buf, 4096) == 4096);
                count ++;

        }
        //pthread_exit((void *)0);
        close(fd);
        return NULL;
}

void * reader(void * arg)
{

        int fd = -1;
        int count = 0;
        /* this thread will open the file and write 4k data to it alternatively
         * of a different data set to the same offset
         */
        /* opening the file */
        assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1);
        printf("reader thread has opened the file, fd : %d\n", fd);
        while (1) {

                memset(read_buf, '\0', 4096);
                assert(lseek(fd, 0, SEEK_SET) == 0);
                assert(read(fd, read_buf, 4096) == 4096);
                count ++;
                printf("reader..checking data integrity\n");
                int i=0;
                for (i=0; i < 4095; i++)
                {
                        int prev_char = read_buf[i];
                        int next_char = read_buf[i+1];
                        if (prev_char != next_char)
                        {
                                printf("reader.. corrupted data, at offset : %d\n", i);
                                stop_write = 1;
                                int output_fd = open("res.txt", O_CREAT|O_RDWR, 0777);

                                assert(write(output_fd, read_buf, 4096) == 4096);
                                close(output_fd);
                                close(fd);
                                return NULL;
                        }
                }


        }
        //pthread_exit((void *)0);
        close(fd);
        return NULL;
}
int main (int argc, char *argv[])
{
        pthread_attr_t attr;
        void * arg = NULL;
        int *status;
        int err = 0;

        if (argc != 2) {
                printf("usage : ./reader_writer <file pathname>\n");
                exit(1);
        }


        memset(datafile, '\0', sizeof(datafile));
        strcpy(datafile, argv[1]);

        pthread_attr_init(&attr);
        pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

        /* create the reader and writer thread */
        printf("creating the threads\n");
        if ((err = pthread_create(&writer_id, &attr, writer, (void *)arg)) != 0) {
                printf ("pthread_create failed err no :%d\n", err);
        }
        if ((err = pthread_create(&reader_id, &attr, reader, (void *)arg)) != 0) {
                printf ("pthread_create failed err no :%d\n", err);
        }
        pthread_attr_destroy(&attr);

        /* Wait on the other threads */
        assert(pthread_join(writer_id, (void *)&status) == 0);
        assert(pthread_join(reader_id, (void *)&status) == 0);

        pthread_exit(NULL);
}

How to run
gcc program.c -o ext4_program -lpthread 
./ext4_program ext4_write
Comment 1 Yongqiang Yang 2012-11-26 05:55:56 UTC
which block size does the filesystem use, 1K or 4K?

Yongqiang


On Mon, Nov 26, 2012 at 12:31 AM, <bugzilla-daemon@bugzilla.kernel.org>wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=50981
>
>            Summary: ext4 : DATA CORRUPTION read and write on same 4096
>                     page  range
>            Product: File System
>            Version: 2.5
>     Kernel Version: 2.6.32 Red Hat Enterprise Linux Workstation release
>                     6.2 (Santiago)
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: ext4
>         AssignedTo: fs_ext4@kernel-bugs.osdl.org
>         ReportedBy: meetmehiro@gmail.com
>         Regression: No
>
>
> Kernal version and Red hat
>
> 2.6.32 , Red Hat Enterprise Linux Workstation release 6.2 (Santiago)
>
>
> I have one scientific application, that is running on ext4 , after three to
> fours days, I have started to seen that data is not correct.. then i tried
> to
> find simple reproducer for the same.
>
> Pl. find the reproducer for the same with this issue is hitting easily.
>
> Logs :
>
> ./sim ext4_write
> creating the threads
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> reader thread has opened the file, fd : 4
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> reader..checking data integrity
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> writing B at offset 0...
> writing A at offset 0...
> writing B at offset 0...
> reader..checking data integrity
> writing A at offset 0...
> reader.. corrupted data, at offset : 511
> writing B at offset 0...
>
>
>
> C program :
> #include <stdio.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <assert.h>
> #include <errno.h>
>        #include <sys/types.h>
>        #include <sys/stat.h>
>        #include <fcntl.h>
>        #include <unistd.h>
> #include <sys/mman.h>
>
> pthread_t reader_id, writer_id;
>
> char write_buf [4096] = {0,};
> char read_buf[4096] = {0,};
>
> char  datafile[256] = {0,};
> int stop_write = 0;
>
>
>
> void * writer (void * arg)
> {
>
>         int fd = -1;
>         int count = 0;
>         /* this thread will open the file and write 4k data to it
> alternatively
>          * of a different data set to the same offset
>          */
>         /* opening the file */
>         assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1);
>
>         while (1) {
>
>                 if (stop_write)
>                 {
>                         close(fd);
>                         printf("write stopped\n");
>                         return NULL;
>                 }
>                 if (count % 2 == 0)
>                         memset(write_buf, 'A', sizeof(write_buf));
>                 else
>                         memset(write_buf, 'B', sizeof(write_buf));
>
>
>                 assert(lseek(fd, 0, SEEK_SET) == 0);
>                 printf("writing %c at offset 0...\n", write_buf[0]);
>                 assert(write(fd, write_buf, 4096) == 4096);
>                 count ++;
>
>         }
>         //pthread_exit((void *)0);
>         close(fd);
>         return NULL;
> }
>
> void * reader(void * arg)
> {
>
>         int fd = -1;
>         int count = 0;
>         /* this thread will open the file and write 4k data to it
> alternatively
>          * of a different data set to the same offset
>          */
>         /* opening the file */
>         assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1);
>         printf("reader thread has opened the file, fd : %d\n", fd);
>         while (1) {
>
>                 memset(read_buf, '\0', 4096);
>                 assert(lseek(fd, 0, SEEK_SET) == 0);
>                 assert(read(fd, read_buf, 4096) == 4096);
>                 count ++;
>                 printf("reader..checking data integrity\n");
>                 int i=0;
>                 for (i=0; i < 4095; i++)
>                 {
>                         int prev_char = read_buf[i];
>                         int next_char = read_buf[i+1];
>                         if (prev_char != next_char)
>                         {
>                                 printf("reader.. corrupted data, at offset
> :
> %d\n", i);
>                                 stop_write = 1;
>                                 int output_fd = open("res.txt",
> O_CREAT|O_RDWR,
> 0777);
>
>                                 assert(write(output_fd, read_buf, 4096) ==
> 4096);
>                                 close(output_fd);
>                                 close(fd);
>                                 return NULL;
>                         }
>                 }
>
>
>         }
>         //pthread_exit((void *)0);
>         close(fd);
>         return NULL;
> }
> int main (int argc, char *argv[])
> {
>         pthread_attr_t attr;
>         void * arg = NULL;
>         int *status;
>         int err = 0;
>
>         if (argc != 2) {
>                 printf("usage : ./reader_writer <file pathname>\n");
>                 exit(1);
>         }
>
>
>         memset(datafile, '\0', sizeof(datafile));
>         strcpy(datafile, argv[1]);
>
>         pthread_attr_init(&attr);
>         pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
>
>         /* create the reader and writer thread */
>         printf("creating the threads\n");
>         if ((err = pthread_create(&writer_id, &attr, writer, (void *)arg))
> !=
> 0) {
>                 printf ("pthread_create failed err no :%d\n", err);
>         }
>         if ((err = pthread_create(&reader_id, &attr, reader, (void *)arg))
> !=
> 0) {
>                 printf ("pthread_create failed err no :%d\n", err);
>         }
>         pthread_attr_destroy(&attr);
>
>         /* Wait on the other threads */
>         assert(pthread_join(writer_id, (void *)&status) == 0);
>         assert(pthread_join(reader_id, (void *)&status) == 0);
>
>         pthread_exit(NULL);
> }
>
> How to run
> gcc program.c -o ext4_program -lpthread
> ./ext4_program ext4_write
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are watching the assignee of the bug.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Comment 2 Lukas Czerner 2012-11-26 08:34:18 UTC
This is present upstream as well (3.7.0-rc7). In fact it is not even ext4 specific. I can easily reproduce this on btrfs, ext3, vfat. I suspect that this would be reproducible on every file system which does not implement its own read/write exclusion as xfs does.

MM people should really take a look at this as well since we are expecting the read/write to/from the single page to be atomic, but this looks like that it might not be true.

Thanks!
-Lukas
Comment 3 Zheng Liu 2012-11-26 10:16:19 UTC
IMHO, the reason seems to be that no any locking is taken by generic_file_aio_read().  When ext4 does a buffered write, it will take i_mutex to ensure that there is no other writer.  But ext4 (ext3, btrfs, and vfat) uses generic_file_aio_read() to do a buffered read, and it doesn't take i_mutex.  So it is easy to read a inconsistent data while doing a write.  I am not very familiar with xfs, but a rwlock will be taken when a read and/or a write is being done.  So that is why xfs hasn't this problem.
Comment 4 Lukas Czerner 2012-11-26 10:30:13 UTC
Yes, that's what I said XFS has its own read/write exclusion. However we're reading/writing a single page here and page read/write shoud be atomic so we should _not_ see page with mixed data. Basically the page should be locked when we copy data from/to it,

-Lukas
Comment 5 Zheng Liu 2012-11-26 11:18:38 UTC
(In reply to comment #4)
> Yes, that's what I said XFS has its own read/write exclusion. However we're
> reading/writing a single page here and page read/write shoud be atomic so we
> should _not_ see page with mixed data. Basically the page should be locked
> when
> we copy data from/to it,

Sorry, as my understanding, a page won't be locked when it is accessed if it has been mark uptodate.  If we write a page that has been written recently, the uptodate flag won't be clear.  So reader won't try to lock page, and maybe that is not atomic.  Am I missing something?

Regards,
Zheng
Comment 6 Lukas Czerner 2012-11-26 11:27:04 UTC
It might be, as I said the problem appears to be in mm code, most likely there is a problem with page locking. However I am not familiar enough with the mm code to know that for sure.
Comment 7 Zheng Liu 2012-11-26 11:32:35 UTC
(In reply to comment #6)
> It might be, as I said the problem appears to be in mm code, most likely
> there
> is a problem with page locking. However I am not familiar enough with the mm
> code to know that for sure.

Understood, thanks. :-)

Regards,
Zheng
Comment 8 Alan 2012-11-26 12:17:52 UTC
(retagged for 3.7-rc as per Lukas email, otherwise my scripts will keep wanting to close it as RHEL and not for upstream)
Comment 9 Lukas Czerner 2012-11-26 12:22:42 UTC
Alan could you please also change the component and maybe even subject of the bug since it is not ext4 specific ?

Thanks!
-Lukas
Comment 10 meetmehiro 2012-11-26 12:31:10 UTC
it's 4K

(In reply to comment #1)
> which block size does the filesystem use, 1K or 4K?
> 
> Yongqiang
> 
> 
> On Mon, Nov 26, 2012 at 12:31 AM, <bugzilla-daemon@bugzilla.kernel.org>wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=50981
> >
> >            Summary: ext4 : DATA CORRUPTION read and write on same 4096
> >                     page  range
> >            Product: File System
> >            Version: 2.5
> >     Kernel Version: 2.6.32 Red Hat Enterprise Linux Workstation release
> >                     6.2 (Santiago)
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: ext4
> >         AssignedTo: fs_ext4@kernel-bugs.osdl.org
> >         ReportedBy: meetmehiro@gmail.com
> >         Regression: No
> >
> >
> > Kernal version and Red hat
> >
> > 2.6.32 , Red Hat Enterprise Linux Workstation release 6.2 (Santiago)
> >
> >
> > I have one scientific application, that is running on ext4 , after three to
> > fours days, I have started to seen that data is not correct.. then i tried
> > to
> > find simple reproducer for the same.
> >
> > Pl. find the reproducer for the same with this issue is hitting easily.
> >
> > Logs :
> >
> > ./sim ext4_write
> > creating the threads
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > reader thread has opened the file, fd : 4
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > reader..checking data integrity
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > writing B at offset 0...
> > writing A at offset 0...
> > writing B at offset 0...
> > reader..checking data integrity
> > writing A at offset 0...
> > reader.. corrupted data, at offset : 511
> > writing B at offset 0...
> >
> >
> >
> > C program :
> > #include <stdio.h>
> > #include <fcntl.h>
> > #include <pthread.h>
> > #include <assert.h>
> > #include <errno.h>
> >        #include <sys/types.h>
> >        #include <sys/stat.h>
> >        #include <fcntl.h>
> >        #include <unistd.h>
> > #include <sys/mman.h>
> >
> > pthread_t reader_id, writer_id;
> >
> > char write_buf [4096] = {0,};
> > char read_buf[4096] = {0,};
> >
> > char  datafile[256] = {0,};
> > int stop_write = 0;
> >
> >
> >
> > void * writer (void * arg)
> > {
> >
> >         int fd = -1;
> >         int count = 0;
> >         /* this thread will open the file and write 4k data to it
> > alternatively
> >          * of a different data set to the same offset
> >          */
> >         /* opening the file */
> >         assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1);
> >
> >         while (1) {
> >
> >                 if (stop_write)
> >                 {
> >                         close(fd);
> >                         printf("write stopped\n");
> >                         return NULL;
> >                 }
> >                 if (count % 2 == 0)
> >                         memset(write_buf, 'A', sizeof(write_buf));
> >                 else
> >                         memset(write_buf, 'B', sizeof(write_buf));
> >
> >
> >                 assert(lseek(fd, 0, SEEK_SET) == 0);
> >                 printf("writing %c at offset 0...\n", write_buf[0]);
> >                 assert(write(fd, write_buf, 4096) == 4096);
> >                 count ++;
> >
> >         }
> >         //pthread_exit((void *)0);
> >         close(fd);
> >         return NULL;
> > }
> >
> > void * reader(void * arg)
> > {
> >
> >         int fd = -1;
> >         int count = 0;
> >         /* this thread will open the file and write 4k data to it
> > alternatively
> >          * of a different data set to the same offset
> >          */
> >         /* opening the file */
> >         assert((fd = open(datafile, O_CREAT|O_RDWR, 0777)) != -1);
> >         printf("reader thread has opened the file, fd : %d\n", fd);
> >         while (1) {
> >
> >                 memset(read_buf, '\0', 4096);
> >                 assert(lseek(fd, 0, SEEK_SET) == 0);
> >                 assert(read(fd, read_buf, 4096) == 4096);
> >                 count ++;
> >                 printf("reader..checking data integrity\n");
> >                 int i=0;
> >                 for (i=0; i < 4095; i++)
> >                 {
> >                         int prev_char = read_buf[i];
> >                         int next_char = read_buf[i+1];
> >                         if (prev_char != next_char)
> >                         {
> >                                 printf("reader.. corrupted data, at offset
> > :
> > %d\n", i);
> >                                 stop_write = 1;
> >                                 int output_fd = open("res.txt",
> > O_CREAT|O_RDWR,
> > 0777);
> >
> >                                 assert(write(output_fd, read_buf, 4096) ==
> > 4096);
> >                                 close(output_fd);
> >                                 close(fd);
> >                                 return NULL;
> >                         }
> >                 }
> >
> >
> >         }
> >         //pthread_exit((void *)0);
> >         close(fd);
> >         return NULL;
> > }
> > int main (int argc, char *argv[])
> > {
> >         pthread_attr_t attr;
> >         void * arg = NULL;
> >         int *status;
> >         int err = 0;
> >
> >         if (argc != 2) {
> >                 printf("usage : ./reader_writer <file pathname>\n");
> >                 exit(1);
> >         }
> >
> >
> >         memset(datafile, '\0', sizeof(datafile));
> >         strcpy(datafile, argv[1]);
> >
> >         pthread_attr_init(&attr);
> >         pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
> >
> >         /* create the reader and writer thread */
> >         printf("creating the threads\n");
> >         if ((err = pthread_create(&writer_id, &attr, writer, (void *)arg))
> > !=
> > 0) {
> >                 printf ("pthread_create failed err no :%d\n", err);
> >         }
> >         if ((err = pthread_create(&reader_id, &attr, reader, (void *)arg))
> > !=
> > 0) {
> >                 printf ("pthread_create failed err no :%d\n", err);
> >         }
> >         pthread_attr_destroy(&attr);
> >
> >         /* Wait on the other threads */
> >         assert(pthread_join(writer_id, (void *)&status) == 0);
> >         assert(pthread_join(reader_id, (void *)&status) == 0);
> >
> >         pthread_exit(NULL);
> > }
> >
> > How to run
> > gcc program.c -o ext4_program -lpthread
> > ./ext4_program ext4_write
> >
> > --
> > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> > ------- You are receiving this mail because: -------
> > You are watching the assignee of the bug.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
Comment 11 meetmehiro 2012-11-26 16:28:19 UTC
(In reply to comment #4)
> Yes, that's what I said XFS has its own read/write exclusion. However we're
> reading/writing a single page here and page read/write shoud be atomic so we
> should _not_ see page with mixed data. Basically the page should be locked
> when
> we copy data from/to it,
> 
> -Lukas


Hello Lukas,

as this is working properly with XFS, so in ext4/ext3...etc also we shouldn't require synchronization at the Application level,., FS should take care of locking... will we expecting the fix for the same ???
Comment 12 meetmehiro 2012-11-26 16:33:28 UTC
(In reply to comment #4)
> Yes, that's what I said XFS has its own read/write exclusion. However we're
> reading/writing a single page here and page read/write shoud be atomic so we
> should _not_ see page with mixed data. Basically the page should be locked
> when
> we copy data from/to it,
> 
> -Lukas


Hello Lukas,

as this is working properly with XFS, so in ext4/ext3...etc also we shouldn't require synchronization at the Application level,., FS should take care of locking... will we expecting the fix for the same ???
Comment 13 Theodore Tso 2012-11-26 16:45:57 UTC
On Mon, Nov 26, 2012 at 04:33:28PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=50981
>
> as this is working properly with XFS, so in ext4/ext3...etc also we shouldn't
> require synchronization at the Application level,., FS should take care of
> locking... will we expecting the fix for the same ???

Meetmehiro,

At this point, there seems to be consensus that the kernel should take
care of the locking, and that this is not something that needs be a
worry for the application.  Whether this should be done in the file
system layer or in the mm layer is the current question at hand ---
since this is a bug that also affects btrfs and other non-XFS file
systems.

So the question is whether every file system which supports AIO should
add its own locking, or whether it should be done at the mm layer, and
at which point the lock in the XFS layer could be removed as no longer
necessary.

I've added linux-mm and linux-fsdevel to make sure all of the relevant
kernel developers are aware of this question/issue.

Regards,

						- Ted
Comment 14 meetmehiro 2012-11-26 18:59:40 UTC
 Thanks a lot Theodore Ts'o ...

 Thanks to all (kernel,ext4 ..etc. )developer for quick debug and
response...


On Mon, Nov 26, 2012 at 10:15 PM, Theodore Ts'o <tytso@mit.edu> wrote:

> On Mon, Nov 26, 2012 at 04:33:28PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=50981
> >
> > as this is working properly with XFS, so in ext4/ext3...etc also we
> shouldn't
> > require synchronization at the Application level,., FS should take care
> of
> > locking... will we expecting the fix for the same ???
>
> Meetmehiro,
>
> At this point, there seems to be consensus that the kernel should take
> care of the locking, and that this is not something that needs be a
> worry for the application.  Whether this should be done in the file
> system layer or in the mm layer is the current question at hand ---
> since this is a bug that also affects btrfs and other non-XFS file
> systems.
>
> So the question is whether every file system which supports AIO should
> add its own locking, or whether it should be done at the mm layer, and
> at which point the lock in the XFS layer could be removed as no longer
> necessary.
>
> I've added linux-mm and linux-fsdevel to make sure all of the relevant
> kernel developers are aware of this question/issue.
>
> Regards,
>
>                                                 - Ted
>
Comment 15 Hugh Dickins 2012-11-26 20:05:59 UTC
On Mon, 26 Nov 2012, Theodore Ts'o wrote:
> On Mon, Nov 26, 2012 at 04:33:28PM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=50981
> >
> > as this is working properly with XFS, so in ext4/ext3...etc also we
> shouldn't
> > require synchronization at the Application level,., FS should take care of
> > locking... will we expecting the fix for the same ???
> 
> Meetmehiro,
> 
> At this point, there seems to be consensus that the kernel should take
> care of the locking, and that this is not something that needs be a
> worry for the application.

Gosh, that's a very sudden new consensus.  The consensus over the past
ten or twenty years has been that the Linux kernel enforce locking for
consistent atomic writes, but skip that overhead on reads - hasn't it?

> Whether this should be done in the file
> system layer or in the mm layer is the current question at hand ---
> since this is a bug that also affects btrfs and other non-XFS file
> systems.
> 
> So the question is whether every file system which supports AIO should
> add its own locking, or whether it should be done at the mm layer, and
> at which point the lock in the XFS layer could be removed as no longer
> necessary.
> 
> I've added linux-mm and linux-fsdevel to make sure all of the relevant
> kernel developers are aware of this question/issue.

Thanks, that's helpful; but I think linux-mm people would want to defer
to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/,
but a fundamental change to VFS locking philosophy is not mm's call.

I don't see that page locking would have anything to do with it: if we
are going to start guaranteeing reads atomic against concurrent writes,
then surely it's the size requested by the user to be guaranteed,
spanning however many pages and fs-blocks: i_mutex, or a more
efficiently crafted alternative.

Hugh
Comment 16 Anonymous Emailer 2012-11-26 20:13:14 UTC
Reply-To: hch@infradead.org

On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote:
> Gosh, that's a very sudden new consensus.  The consensus over the past
> ten or twenty years has been that the Linux kernel enforce locking for
> consistent atomic writes, but skip that overhead on reads - hasn't it?

I'm not sure there was much of a consensus ever.  We XFS people always
ttried to push everyone down the strict rule, but there was enough
pushback that it didn't actually happen.

> Thanks, that's helpful; but I think linux-mm people would want to defer
> to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/,
> but a fundamental change to VFS locking philosophy is not mm's call.
> 
> I don't see that page locking would have anything to do with it: if we
> are going to start guaranteeing reads atomic against concurrent writes,
> then surely it's the size requested by the user to be guaranteed,
> spanning however many pages and fs-blocks: i_mutex, or a more
> efficiently crafted alternative.

What XFS does is simply replace (or rather augment currently) i_mutex
with a rw_semaphore (i_iolock in XFS) which is used the following way:

exclusive:
 - buffer writes
 - pagecache flushing before direct I/O (then downgraded)
 - appending direct I/O writes
 - less than blocksize granularity direct I/O

shared:
 - everything else (buffered reads, "normal" direct I/O)

Doing this in the highest levels of the generic_file_ code would be
trivial, and would allow us to get rid of a fair chunk of wrappers in
XFS.

Note that we've been thinking about replacing this lock with a range
lock, but this will require more research.
Comment 17 Anonymous Emailer 2012-11-26 20:15:07 UTC
Reply-To: zab@zabbo.net

On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote:
> On Mon, 26 Nov 2012, Theodore Ts'o wrote:
> > On Mon, Nov 26, 2012 at 04:33:28PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=50981
> > >
> > > as this is working properly with XFS, so in ext4/ext3...etc also we
> shouldn't
> > > require synchronization at the Application level,., FS should take care
> of
> > > locking... will we expecting the fix for the same ???
> > 
> > Meetmehiro,
> > 
> > At this point, there seems to be consensus that the kernel should take
> > care of the locking, and that this is not something that needs be a
> > worry for the application.
> 
> Gosh, that's a very sudden new consensus.  The consensus over the past
> ten or twenty years has been that the Linux kernel enforce locking for
> consistent atomic writes, but skip that overhead on reads - hasn't it?

I was wondering exactly the same thing.

> > So the question is whether every file system which supports AIO should
> > add its own locking, or whether it should be done at the mm layer, and
> > at which point the lock in the XFS layer could be removed as no longer
> > necessary.

(This has nothing to do with AIO.  Buffered reads have been copied from
unlocked pages.. basically forever, right?)

> Thanks, that's helpful; but I think linux-mm people would want to defer
> to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/,
> but a fundamental change to VFS locking philosophy is not mm's call.
> 
> I don't see that page locking would have anything to do with it: if we
> are going to start guaranteeing reads atomic against concurrent writes,
> then surely it's the size requested by the user to be guaranteed,
> spanning however many pages and fs-blocks: i_mutex, or a more
> efficiently crafted alternative.

Agreed.  While this little racing test might be fixed, those baked in
page_size == 4k == atomic granularity assumptions are pretty sketchy.

So we're talking about holding multiple page locks?  Or i_mutex?  Or
some fancy range locking?

There's consensus on serializing overlapping buffered reads and writes? 

- z
*readying the read(, mmap(), ) fault deadlock toy*
Comment 18 Dave Chinner 2012-11-26 21:28:48 UTC
On Mon, Nov 26, 2012 at 03:13:08PM -0500, Christoph Hellwig wrote:
> On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote:
> > Gosh, that's a very sudden new consensus.  The consensus over the past
> > ten or twenty years has been that the Linux kernel enforce locking for
> > consistent atomic writes, but skip that overhead on reads - hasn't it?
> 
> I'm not sure there was much of a consensus ever.  We XFS people always
> ttried to push everyone down the strict rule, but there was enough
> pushback that it didn't actually happen.
> 
> > Thanks, that's helpful; but I think linux-mm people would want to defer
> > to linux-fsdevel maintainers on this: mm/filemap.c happens to be in mm/,
> > but a fundamental change to VFS locking philosophy is not mm's call.
> > 
> > I don't see that page locking would have anything to do with it: if we
> > are going to start guaranteeing reads atomic against concurrent writes,
> > then surely it's the size requested by the user to be guaranteed,
> > spanning however many pages and fs-blocks: i_mutex, or a more
> > efficiently crafted alternative.
> 
> What XFS does is simply replace (or rather augment currently) i_mutex
> with a rw_semaphore (i_iolock in XFS) which is used the following way:
> 
> exclusive:
>  - buffer writes
>  - pagecache flushing before direct I/O (then downgraded)
>  - appending direct I/O writes
>  - less than blocksize granularity direct I/O
   - splice write

Also, direct extent manipulations that are outside the IO path such
as:
   - truncate
   - preallocation
   - hole punching

use the XFS_IOLOCK_EXCL to provide exclusion against new IO starting
while such an operation is in progress.

> shared:
>  - everything else (buffered reads, "normal" direct I/O)
> 
> Doing this in the highest levels of the generic_file_ code would be
> trivial, and would allow us to get rid of a fair chunk of wrappers in
> XFS.

We still need the iolock deep in the guts of the filesystem, though.

I suspect that if we are going to change the VFS locking, then we
should seriously consider allowing the filesystem to provide it's
own locking implementation and the VFS just pass the type of lock
required. Otherwise we are still going to need all the locking
within the filesystem to serialise all the core pieces that the VFS
locking doesn't serialise (e.g. EOF truncation on close/evict,
extent swaps for online defrag, etc).

> Note that we've been thinking about replacing this lock with a range
> lock, but this will require more research.

I'd say we need a working implementation in a filesystem before even
considering a VFS implementation...

Cheers,

Dave.
Comment 19 Anonymous Emailer 2012-11-26 21:39:10 UTC
Reply-To: hch@infradead.org

On Tue, Nov 27, 2012 at 08:28:45AM +1100, Dave Chinner wrote:
> We still need the iolock deep in the guts of the filesystem, though.

I don't think we do.  The only thing that comes close to it is
xfs_swap_extents passing the XFS_IOLOCK_EXCL to xfs_trans_ijoin so
that the transaction commit automatically unlocks it, but that can
be trivially replaced with a manual unlock.

> I suspect that if we are going to change the VFS locking, then we
> should seriously consider allowing the filesystem to provide it's
> own locking implementation and the VFS just pass the type of lock
> required. Otherwise we are still going to need all the locking
> within the filesystem to serialise all the core pieces that the VFS
> locking doesn't serialise (e.g. EOF truncation on close/evict,
> extent swaps for online defrag, etc).

The VFS currently doesn't hardcode i_mutex for any data plane
operations, only a few generic helpers do it, most notably
generic_file_aio_write (which can be bypassed by using a slightly
lower level variant) and __blockdev_direct_IO when used in DIO_LOCKING
mode.
Comment 20 Theodore Tso 2012-11-26 21:49:42 UTC
On Mon, Nov 26, 2012 at 03:13:08PM -0500, Christoph Hellwig wrote:
> On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote:
> > Gosh, that's a very sudden new consensus.  The consensus over the past
> > ten or twenty years has been that the Linux kernel enforce locking for
> > consistent atomic writes, but skip that overhead on reads - hasn't it?
> 
> I'm not sure there was much of a consensus ever.  We XFS people always
> ttried to push everyone down the strict rule, but there was enough
> pushback that it didn't actually happen.

Christoph, can you give some kind of estimate for the overhead that
adding this locking in XFS actually costs in practice?  And does XFS
provide any kind of consistency guarantees if the reads/write overlap
spans multiple pages?  I assume the answer to that is no, correct?

Thanks,

                                              - Ted
Comment 21 Anonymous Emailer 2012-11-26 22:09:13 UTC
Reply-To: hch@infradead.org

On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> Christoph, can you give some kind of estimate for the overhead that
> adding this locking in XFS actually costs in practice?

I don't know any real life measurements, but in terms of implementation
the over head is:

 a) taking a the rw_semaphore in shared mode for every buffered read
 b) taking the slightly slower exclusive rw_semaphore for buffered writes
    instead of the plain mutex

On the other hand it significantly simplifies the locking for direct
I/O and allows parallel direct I/O writers.

> And does XFS
> provide any kind of consistency guarantees if the reads/write overlap
> spans multiple pages?  I assume the answer to that is no, correct?

The answer is yes as the lock is taken globally on the inode.
Comment 22 Dave Chinner 2012-11-26 22:17:16 UTC
On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 26, 2012 at 03:13:08PM -0500, Christoph Hellwig wrote:
> > On Mon, Nov 26, 2012 at 12:05:57PM -0800, Hugh Dickins wrote:
> > > Gosh, that's a very sudden new consensus.  The consensus over the past
> > > ten or twenty years has been that the Linux kernel enforce locking for
> > > consistent atomic writes, but skip that overhead on reads - hasn't it?
> > 
> > I'm not sure there was much of a consensus ever.  We XFS people always
> > ttried to push everyone down the strict rule, but there was enough
> > pushback that it didn't actually happen.
> 
> Christoph, can you give some kind of estimate for the overhead that
> adding this locking in XFS actually costs in practice?

It doesn't show up any significant numbers in profiles, if that is
what you are asking.

I've tested over random 4k reads and writes at over 2 million IOPS
to a single file using concurrent direct IO, so the non-exclusive
locking overhead is pretty minimal. If the workload is modified
slightly to used buffered writes instead of direct IO writes and so
triggering shared/exclusive lock contention, then the same workload
tends to get limited at around 250,000 IOPS per file. That's a
direct result of the exclusive locking limiting the workload to what
a single CPU can sustain (i.e difference between 8p @ 250-300k iops
vs 1p @ 250k iops on the exclusive locking workload).

> And does XFS
> provide any kind of consistency guarantees if the reads/write overlap
> spans multiple pages?  I assume the answer to that is no, correct?

A buffered write is locked exclusive for the entire of the write.
That includes multiple page writes as the locking is outside of
the begin_write/end_write per-page iteration. Hence the atomicity of
the entire buffered write against both buffered read and direct IO
is guaranteed.

Cheers,

Dave.
Comment 23 Theodore Tso 2012-11-27 01:32:59 UTC
On Mon, Nov 26, 2012 at 05:09:08PM -0500, Christoph Hellwig wrote:
> On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> > Christoph, can you give some kind of estimate for the overhead that
> > adding this locking in XFS actually costs in practice?
> 
> I don't know any real life measurements, but in terms of implementation
> the over head is:
> 
>  a) taking a the rw_semaphore in shared mode for every buffered read
>  b) taking the slightly slower exclusive rw_semaphore for buffered writes
>     instead of the plain mutex
> 
> On the other hand it significantly simplifies the locking for direct
> I/O and allows parallel direct I/O writers.

I should probably just look at the XFS code, but.... if you're taking
an exclusve lock for buffered writes, won't this impact the
performance of buffered writes happening in parallel on different
CPU's?

						- Ted
Comment 24 Dave Chinner 2012-11-27 04:27:15 UTC
On Mon, Nov 26, 2012 at 08:32:54PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 26, 2012 at 05:09:08PM -0500, Christoph Hellwig wrote:
> > On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> > > Christoph, can you give some kind of estimate for the overhead that
> > > adding this locking in XFS actually costs in practice?
> > 
> > I don't know any real life measurements, but in terms of implementation
> > the over head is:
> > 
> >  a) taking a the rw_semaphore in shared mode for every buffered read
> >  b) taking the slightly slower exclusive rw_semaphore for buffered writes
> >     instead of the plain mutex
> > 
> > On the other hand it significantly simplifies the locking for direct
> > I/O and allows parallel direct I/O writers.
> 
> I should probably just look at the XFS code, but.... if you're taking
> an exclusve lock for buffered writes, won't this impact the
> performance of buffered writes happening in parallel on different
> CPU's?

Indeed it does - see my previous email. But it's no worse than
generic_file_aio_write() that takes i_mutex across buffered writes,
which is what most filesystems currently do. And FWIW, we also take
the i_mutex outside the i_iolock for the buffered write case because
generic_file_buffered_write() is documented to require it held.
See xfs_rw_ilock() and friends for locking order semantics...

FWIW, this buffered write exclusion is why we have been considering
replacing the rwsem with a shared/exclusive range lock - so we can
do concurrent non-overlapping reads and writes (for both direct IO and
buffered IO) without compromising the POSIX atomic write guarantee
(i.e. that a read will see the entire write or none of it). Range
locking will allow us to do that for both buffered and direct IO...

Cheers,

Dave.
Comment 25 Lukas Czerner 2012-11-27 07:13:07 UTC
(In reply to comment #23)
> On Mon, Nov 26, 2012 at 05:09:08PM -0500, Christoph Hellwig wrote:
> > On Mon, Nov 26, 2012 at 04:49:37PM -0500, Theodore Ts'o wrote:
> > > Christoph, can you give some kind of estimate for the overhead that
> > > adding this locking in XFS actually costs in practice?
> > 
> > I don't know any real life measurements, but in terms of implementation
> > the over head is:
> > 
> >  a) taking a the rw_semaphore in shared mode for every buffered read
> >  b) taking the slightly slower exclusive rw_semaphore for buffered writes
> >     instead of the plain mutex
> > 
> > On the other hand it significantly simplifies the locking for direct
> > I/O and allows parallel direct I/O writers.
> 
> I should probably just look at the XFS code, but.... if you're taking
> an exclusve lock for buffered writes, won't this impact the
> performance of buffered writes happening in parallel on different
> CPU's?
> 
>                         - Ted

Hi Ted,

seems like I've already tried to implement the locking before. See

http://patchwork.ozlabs.org/patch/91834/

it's far from perfect, but you can measure the performance impact on this simple implementation.

-Lukas

Note You need to log in before you can comment on or make changes to this bug.