Bug 2645 - msync() does not update the st_mtime and st_ctime fields
Summary: msync() does not update the st_mtime and st_ctime fields
Status: CLOSED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Anton Salikhmetov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-05-06 01:44 UTC by Jakob
Modified: 2009-10-12 22:39 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.4.X, 2.6.X
Subsystem:
Regression: No
Bisected commit-id:


Attachments
The test program (34 bytes, text/plain)
2008-01-01 14:02 UTC, Anton Salikhmetov
Details
The proposed patch (40 bytes, text/plain)
2008-01-05 23:13 UTC, Anton Salikhmetov
Details
The test program, version 2 (1.42 KB, text/plain)
2008-01-10 06:32 UTC, Anton Salikhmetov
Details
The cleanup patch (3.29 KB, patch)
2008-01-10 17:03 UTC, Anton Salikhmetov
Details | Diff
The functional changes patch (4.30 KB, patch)
2008-01-10 17:04 UTC, Anton Salikhmetov
Details | Diff
The test program, version 3 (2.51 KB, text/x-csrc)
2008-01-12 15:39 UTC, Anton Salikhmetov
Details
The cleanup patch, version 2 (3.74 KB, patch)
2008-01-12 20:51 UTC, Anton Salikhmetov
Details | Diff
The functional changes patch, version 2 (7.22 KB, patch)
2008-01-12 20:53 UTC, Anton Salikhmetov
Details | Diff
The cleanup patch, version 3 (3.75 KB, patch)
2008-01-15 08:31 UTC, Anton Salikhmetov
Details | Diff
The functional changes patch, version 3 (10.64 KB, patch)
2008-01-15 08:32 UTC, Anton Salikhmetov
Details | Diff
The cleanup patch, version 4 (3.82 KB, patch)
2008-01-17 02:58 UTC, Anton Salikhmetov
Details | Diff
The functional changes patch, version 4 (10.84 KB, patch)
2008-01-17 02:58 UTC, Anton Salikhmetov
Details | Diff
The performance test program (1001 bytes, text/x-csrc)
2008-01-17 14:08 UTC, Anton Salikhmetov
Details
The cleanup patch, version 5 (4.21 KB, patch)
2008-01-17 15:21 UTC, Anton Salikhmetov
Details | Diff
The functional changes patch, version 5 (3.44 KB, patch)
2008-01-17 15:22 UTC, Anton Salikhmetov
Details | Diff
The cleanup patch, version 6 (4.32 KB, patch)
2008-01-21 15:20 UTC, Anton Salikhmetov
Details | Diff
The functional changes patch, version 6 (3.78 KB, patch)
2008-01-21 15:20 UTC, Anton Salikhmetov
Details | Diff

Description Jakob 2004-05-06 01:44:18 UTC
Distribution: Debian GNU/Linux 3.0r2 
Hardware Environment: x86 
Software Environment: 
 
Any software updating files solely via. mmap(). In this particular scenario, I 
see the problem with the SysOrb Network Monitoring System 
(http://www.sysorb.com) - it holds three database files; main.odb, main.odbj 
and main.tsdb.  main.odb and main.tsdb are updated with read/write, while 
main.odbj is updated solely by means of an mmap(). 
 
Problem Description: 
 
The file(s) updated solely by means of an mmap() (open/mmap/{modify 
mem}/close) do not get their mtime updated. 
 
This is a serious problem when running backups, as the backup software will 
see such files as "not modified". 
 
I believe that this is a bug; mtime should reflect when file data changes. 
 
Solaris 9 does update the mtime correctly in the scenario that I tested. 
 
Windows 2000 exhibits the same problem as Linux when using the M$ equivalent 
of mmap.  If that's any consolation  ;) 
 
Steps to reproduce: 
 
See the atime/mtime/ctime using 'stat' of a file. 
 
Update the file with a open/mmap/{modify mem}/close 
 
Use 'stat' again to see that the mtime does not reflect the fact that the file 
data was changed.
Comment 1 Badari 2006-01-27 03:54:42 UTC
Hi can you please tell me how to reproduce the bug, i have used a program to
reproduce the bug!. and the time stamps were changed !. (which indicates this is
not a bug).
please go though the attached program.
thanks in advance . any suggestions and corrections are welcomed.
regards badari
mail id : h.badari@gmail.com

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <sys/mman.h>


#define ungraceful_exit          2
#define graceful_exit            0

struct stat time_buffer1, time_buffer2;

void * marea;


static void show_usage(const char *prog_name, const char *mesg ){
          fprintf(stderr, "Error: %s\n"
          "Usage is:\n%s fname\n\tWhere fname is a file name to use\n", mesg,
prog_name);
        }


static void error(char *mesg){
          perror(mesg);
          exit(ungraceful_exit);
}

int main (void) {

int child, status;
pid_t wait_status;
caddr_t mmap_ptr;
char buffer[80];
int fd, i;
char buff[256], ch;


        if ( (fd = open ("test", O_RDWR) ) < 0 )
                        error ("OPEN FAIL \n");


        if ( fstat ( fd, &time_buffer1 ) < 0 )
                        error (" fstat failed \n");

        if ( strcpy (buff,ctime ( & time_buffer1 .  st_mtime) ) == NULL)
                        error ("strcpy error \n");

        buff [ strlen (buff) + 1] = '\n';
        buff [ strlen (buff) + 2] = '\0';

        printf ("\n first mtime stamp = %s \n", buff );

        if ( strcpy (buff,ctime ( & time_buffer1 .  st_atime) ) == NULL)
                        error ("strcpy error \n");

        buff [ strlen (buff) + 1] = '\n';
        buff [ strlen (buff) + 2] = '\0';


        printf ("\n first atime stamp = %s \n", buff );

        if ( strcpy (buff,ctime ( & time_buffer1 .  st_ctime) ) == NULL)
                        error ("strcpy error \n");

        buff [ strlen (buff) + 1] = '\n';
        buff [ strlen (buff) + 2] = '\0';


        printf ("\n first ctime stamp = %s \n", buff );

 status = ftruncate(fd, sizeof(buffer)); /* make the file the buffer size */
  if (status){
      fprintf(stderr,"Could not ftruncate(%d, %d) = %d\n", fd,sizeof(buffer),
status );
      error("Bad Ftruncate");
  }


    mmap_ptr = mmap((caddr_t) 0,   /* Memory Location, 0 means O/S chooses */
                    sizeof(buffer),/* How many bytes to mmap */
                    PROT_READ | PROT_WRITE, /* Read and write permissions */
                    MAP_SHARED,    /* Accessible by another process */
                    fd,           /* which file is associated with mmap */
                    (off_t) 0);    /* Offset in page frame */
    if (mmap_ptr == MAP_FAILED)
              error("Memory Map Failed, QUIT!");

    sprintf(buffer, "This is the secret message!");
    memcpy(mmap_ptr, buffer, sizeof(buffer));

        if ( fstat ( fd, &time_buffer2 ) < 0 )
                        perror (" fstat failed \n");

        if ( strcpy (buff,ctime ( & time_buffer2 .  st_mtime) ) == NULL)
                        perror ("strcpy error \n");

        printf ("\n second mtime stamp = %s \n", ctime (& time_buffer2 .
st_mtime ) );

        if ( strcpy (buff,ctime ( & time_buffer2 .  st_atime) ) == NULL)
                        error ("strcpy error \n");

        buff [ strlen (buff) + 1] = '\n';
        buff [ strlen (buff) + 2] = '\0';


        printf ("\n second atime stamp = %s \n", ctime (& time_buffer2 .
st_atime ) );

        if ( strcpy (buff,ctime ( & time_buffer2 .  st_ctime) ) == NULL)
                        error ("strcpy error \n");

        buff [ strlen (buff) + 1] = '\n';
        buff [ strlen (buff) + 2] = '\0';

        printf ("\n second ctime stamp = %s \n", ctime (& time_buffer2 .
st_ctime ) );


        status = close (fd);
        if (status < 0 )
                error (" close open file failed \n");

        exit ( graceful_exit );
}
Comment 2 Badari 2006-01-27 04:06:47 UTC
And this was tested on three machines with three different os

1. AMD Athlon xp, 2400, 2GHz, 128 mb ram, Redhat linux release 9 2.4.20 kernel

2. Intel pentium 4, 256 mb ram, Fedora core 2 2.6.12- kernel

3. AMD sempron ,256 mb ram, DEBIAN LINUX 2.4.18 - gcov. 

Comment 3 Jakob 2006-01-27 04:44:51 UTC
Your test program modifies the metatadata of the file (by calling ftruncate). 
This causes the stat information to be updated correctly. 
 
Try commenting out the ftruncate call after the first run of the program. 
 
You will then see, that no matter how many times you run the program and 
update the file, only atime is modified - no file modification is apparant 
from viewing the stat data (mtime/ctame remain the same). 
 
Modifying the file data with a write() will cause the correct stat data 
updates. 
 
The missing mmap() updates are a major problem for memory mapped database 
files (a journal file in my usage scenario). 
 
Comment 4 sathya 2006-02-12 20:55:24 UTC
Hello,
         I have tested this program, before and after commenting the ftruncate
call , the results indicated as suspected i.e, only atime was modified ( mtime
and ctime stamps were not modified ), which indicates this is a bug.

Howvever , as of my knowledge goes , database concurrency controll system will
not use time stamps protocols, but strict 2PL protocols. So i believe this will
not create problems. 

But data base prevention schemes of linux uses time stamps, if time stamp does
not modify it might lead to dead locks.
please do correct me if im wrong in my concepts.

                                              Sathya Narayana P
Comment 5 Jakob 2006-02-13 01:46:36 UTC
It is a bug, yes, we agree.   
   
However, your arguments about this not causing problems are wrong. It does   
cause problems. Lots of tools like backup utilities or tmpreaper depend on   
mtime/ctime records to decide whether or not to delete or copy files. I 
wouldn't have files this bug report in the first place, were it not for 
commonly used tools doing the wrong thing (specifically for me, tmpreaper 
deleting database files that were frequently updated - but it could have been 
any other problem... In the worst case, a problem that could have gone 
unnoticed for a long time, like missing backups). 
   
I do not understand your arguments about database "concurrency controll  
systems" or "data base prevention schemes of linux", but I assume that we 
still agree that this is a bug. 
 
Comment 6 Olaf Kirch 2007-03-15 01:36:18 UTC
See http://lkml.org/lkml/2007/1/11/223 for ongoing work. Not sure what
the status of this patch is.
Comment 7 Natalie Protasevich 2007-09-06 20:56:05 UTC
What is the status of the bug, is it still being worked on? 

Thanks.
Comment 8 Jakob 2007-09-06 23:55:31 UTC
As far as I know, it is still not being worked on.
Comment 9 Anton Salikhmetov 2008-01-01 13:36:14 UTC
I changed the test program Badari wrote above. The program will be attached to this bug soon. This program shows that the msync() function has a bug: it does not update the st_mtime and st_ctime fields as the POSIX standard defines.

Now I'm trying to fix the problem and waiting for any comments about my view on this bug.
Comment 10 Anton Salikhmetov 2008-01-01 14:02:03 UTC
Created attachment 14254 [details]
The test program
Comment 11 Anton Salikhmetov 2008-01-01 14:04:17 UTC
The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED and PROT_WRITE shall be marked for update at some point in the interval between a write reference to the mapped region and the next call to msync() with MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is no such call and if the underlying file is modified as a result of a write reference, then these fields shall be marked for update at some time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time stamps st_mtime and st_ctime in the test program Badari wrote above. Otherwise, the time stamps may be updated at some unspecified moment.
Comment 12 Natalie Protasevich 2008-01-02 02:01:32 UTC
Anton, et.al, it you feel like fixing this problem, you are more than welcome to propose a patch.
Comment 13 Anton Salikhmetov 2008-01-04 18:29:20 UTC
A patch for this bug was proposed in the following message:

http://lkml.org/lkml/2008/1/4/296
Comment 14 Anton Salikhmetov 2008-01-05 23:13:05 UTC
Created attachment 14312 [details]
The proposed patch
Comment 15 Anton Salikhmetov 2008-01-08 08:38:34 UTC
There have been three solutions proposed to fix this bug:

1) Peter Staubach's one rejected due to the several problems but received no reaction when a slightly modified patch was proposed later:

http://lkml.org/lkml/2006/5/17/138

2) Miklos Szeredi's patch was not accepted as well:

http://lkml.org/lkml/2007/2/27/295

3) my solution was ignored:

http://lkml.org/lkml/2008/1/7/234

It's logical to presume the community has not enough interest in the above solutions, and this bug could be closed already as WONTFIX.

Natalie, what are you thinking about closing the bug?
Comment 16 Jakob 2008-01-09 01:07:19 UTC
Please re-submit to LKML.

This bug causes backup systems to *miss* changed files.

This bug does cause data loss in common real-world deployments (I gave an example with a database when posting the bug, but this affects the data from all mmap using applications with common backup systems).

Silent exclusion from backups is very very nasty.
Comment 17 Anton Salikhmetov 2008-01-09 03:38:31 UTC
(In reply to comment #16)
> Please re-submit to LKML.
> 
> This bug causes backup systems to *miss* changed files.
> 
> This bug does cause data loss in common real-world deployments (I gave an
> example with a database when posting the bug, but this affects the data from
> all mmap using applications with common backup systems).
> 
> Silent exclusion from backups is very very nasty.
> 

OK, I understand and have already replied to my last message in LKML:

http://lkml.org/lkml/2008/1/9/110
Comment 18 Jakob 2008-01-09 03:51:34 UTC
Thank you, I will try and back this up on LKML too.

Considering the fuzz there was about atime updates, I am surprised that no one seems to care about mtime (which, unlike atime, has very serious implications on peoples backups).

Thanks Anton.
Comment 19 Anton Salikhmetov 2008-01-10 06:32:15 UTC
Created attachment 14398 [details]
The test program, version 2

I changed the unit test due to Peter's remark in LKML (http://lkml.org/lkml/2008/1/9/267):

> http://pygx.sourceforge.net/mmap.c
>
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows the appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
>
> 

Sorry, I don't see where the test program shows that the file
times did not change if there had not been an intervening
modification to the mmap'd region.  It appears to me that it
just shows the file times changing or not when there has been
intervening modification after the mmap call and before the
fstat call.

Or am I looking in the wrong place?  :-)
Comment 20 Anton Salikhmetov 2008-01-10 06:42:19 UTC
Now I'm working on my next solution for this bug as I've already written in LKML (http://lkml.org/lkml/2008/1/9/387).

My first solution is not acceptable because of the intervening sync() case which Peter Staubach has mentioned in LKML (http://lkml.org/lkml/2008/1/9/267):

> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment check, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
> 

These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.
Comment 21 Anton Salikhmetov 2008-01-10 17:03:11 UTC
Created attachment 14407 [details]
The cleanup patch

The first part of my second solution was proposed in LKML:

http://lkml.org/lkml/2008/1/10/485
Comment 22 Anton Salikhmetov 2008-01-10 17:04:50 UTC
Created attachment 14408 [details]
The functional changes patch

The second part of my second solution was proposed in LKML:

http://lkml.org/lkml/2008/1/10/486
Comment 23 Anton Salikhmetov 2008-01-12 15:39:19 UTC
Created attachment 14430 [details]
The test program, version 3
Comment 24 Anton Salikhmetov 2008-01-12 20:45:11 UTC
I proposed my third version of the solution for this bug in LKML:

http://lkml.org/lkml/2008/1/12/196
Comment 25 Anton Salikhmetov 2008-01-12 20:51:04 UTC
Created attachment 14432 [details]
The cleanup patch, version 2

The error checks are separated out as suggested by Rik van Riel.
Comment 26 Anton Salikhmetov 2008-01-12 20:53:06 UTC
Created attachment 14433 [details]
The functional changes patch, version 2

New since the previous version:

1) no need to explicitly call msync() to update file times;
2) changing block device data is visible to all device files
   associated with the block device;
3) some small refinements according to the LKML comments.
Comment 27 Anton Salikhmetov 2008-01-12 21:06:41 UTC
The unit test program produces the following output.

Using the latest git version of the vanilla kernel:

>>>

Modifying file...
Flushing data using sync()...
Failure: time not changed.
Not modifying file...
Flushing data using msync()...
Success: time not changed.
Not modifying file...
Flushing data using fsync()...
Success: time not changed.
Modifying file...
Flushing data using msync()...
Failure: time not changed.
Modifying file...
Flushing data using fsync()...
Failure: time not changed.

<<<

With the patch applied:

>>>

Modifying file...
Flushing data using sync()...
Success: time changed.
Not modifying file...
Flushing data using msync()...
Success: time not changed.
Not modifying file...
Flushing data using fsync()...
Success: time not changed.
Modifying file...
Flushing data using msync()...
Success: time changed.
Modifying file...
Flushing data using fsync()...
Success: time changed.

<<<

Similar results were obtained for a block device file.
Comment 28 Anton Salikhmetov 2008-01-15 08:28:37 UTC
I proposed my fourth version of the solution for this bug in LKML:

http://lkml.org/lkml/2008/1/15/202
Comment 29 Anton Salikhmetov 2008-01-15 08:31:44 UTC
Created attachment 14464 [details]
The cleanup patch, version 3
Comment 30 Anton Salikhmetov 2008-01-15 08:32:41 UTC
Created attachment 14465 [details]
The functional changes patch, version 3
Comment 31 Anton Salikhmetov 2008-01-17 02:20:34 UTC
I proposed the fifth version of my solution for this bug in LKML:

http://marc.info/?l=linux-kernel&m=120053180919064
http://marc.info/?l=linux-kernel&m=120053180919074
http://marc.info/?l=linux-kernel&m=120053181019078
Comment 32 Anton Salikhmetov 2008-01-17 02:58:21 UTC
Created attachment 14486 [details]
The cleanup patch, version 4
Comment 33 Anton Salikhmetov 2008-01-17 02:58:55 UTC
Created attachment 14487 [details]
The functional changes patch, version 4
Comment 34 Anton Salikhmetov 2008-01-17 14:08:20 UTC
Created attachment 14493 [details]
The performance test program
Comment 35 Anton Salikhmetov 2008-01-17 15:16:51 UTC
I proposed the sixth version of my solution for this bug in LKML:

http://lkml.org/lkml/2008/1/17/382
Comment 36 Anton Salikhmetov 2008-01-17 15:21:19 UTC
Created attachment 14495 [details]
The cleanup patch, version 5
Comment 37 Anton Salikhmetov 2008-01-17 15:22:17 UTC
Created attachment 14496 [details]
The functional changes patch, version 5
Comment 38 Anton Salikhmetov 2008-01-18 09:50:09 UTC
Unfortunately, I have no more time for working on this bug.
Comment 39 Anton Salikhmetov 2008-01-18 11:40:48 UTC
Almost fixed already, this bug can probably be closed soon, because Linus sent some comments in LKML.
Comment 40 Anton Salikhmetov 2008-01-21 15:18:04 UTC
1. Requirements

1.1) the POSIX standard requires updating ctime and mtime not later
than at the call to msync() with MS_SYNC or MS_ASYNC flags;

1.2) in existing POSIX implementations, ctime and mtime
get updated not later than at the call to fsync();

1.3) in existing POSIX implementation, ctime and mtime
get updated not later than at the call to sync(), the "auto-update" feature;

1.4) the customers require and the common sense suggests that
ctime and mtime should be updated not later than at the call to munmap()
or exit(), the latter function implying an implicit call to munmap();

1.5) the (1.1) item should be satisfied if the file is a block device
special file;

1.6) the (1.1) item should be satisfied for files residing on
memory-backed filesystems such as tmpfs, too.

The following operating systems were used as the reference platforms
and are referred to as the "existing implementations" above:
HP-UX B.11.31 and FreeBSD 6.2-RELEASE.

2. Lazy update

All attempts before version 6 implemented the "lazy update" approach to
satisfying the requirements given above. Within the latter approach, ctime and mtime
get updated at last moment allowable.

Since we don't update the file times immediately, some Flag has to be
used. When up, this Flag means that the file data was modified and
the file times need to be updated as soon as possible.

Any existing "dirty" flag which, when up, mean that a page has been written to,
is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
would have to reset this "dirty" flag after updating ctime and mtime.
The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
Thereby, the synchronization routines relying upon this "dirty" flag
would lose data. Therefore, a new Flag has to be introduced.

The (1.5) item coupled with (1.3) requirement leads to hard work with
the block device inodes. Specifically, during writeback it is impossible to
tell which block device file was originally mapped. Therefore, we need to
traverse the list of "active" devices associated with the block device inode.
This would lead to updating file times for block device files, which were not
taking part in the data transfer.

Also all versions prior to version 6 failed to correctly process ctime and
mtime for files on the memory-backed filesystems such as tmpfs. So the (1.6)
requirement was not satisfied.

If a write reference has occurred between two consecutive calls to msync()
with MS_ASYNC, the second call to the latter function should take into
account the last write reference. The last write reference can not be caught
if no pagefault occurs. Hence a pagefault needs to be forced. This can be done
using two different approaches. The first one is to synchronize data even when
msync() was called with MS_ASYNC. This is not acceptable because the current
design of the sys_msync() routine forbids starting I/O for the MS_ASYNC case.
The second approach is to write protect the page for triggering a pagefault
at the next write reference. Note that the dirty flag for the page should not
be cleared thereby.

In the "lazy update" approach, the requirements (1.1), (1.2), (1.3), and (1.4)
taken together result in adding code at least to the following kernel routines:
sys_msync(), do_fsync(), some routine in the unmap() call path, some routine
in the sync() call path.

Finally, a file_update_time()-like function would have to be created for
processing the inode objects, not file objects. This is due to the fact that
during the sync() operation, the file object may not exist any more, only
the inode is known.

To sum up: this "lazy" approach leads to massive changes, incurs overhead in
the block device case, and requires complicated design decisions.

3. Immediate update

OK, still reading? There's a better way.

In a fashion analogous to what happens at write(2), react to the fact
that the page gets dirtied by updating the file times immediately.
Thereby any page writeback happens when the write reference has already
been accounted for from the view point of file times.

The only problem which remains is to force refreshing file times at the write
reference following a call to msync() with MS_ASYNC. As mentioned above, all
that is needed here is to force a pagefault.

The vma_wrprotect() routine introduced in this patch series is called
from sys_msync() in the MS_ASYNC case. The former routine is essentially
a version of existing page_mkclean_one() function from mm/rmap.c. Unlike
the latter function, the vma_wrprotect() does not touch the dirty bit.
Comment 41 Anton Salikhmetov 2008-01-21 15:20:19 UTC
Created attachment 14519 [details]
The cleanup patch, version 6
Comment 42 Anton Salikhmetov 2008-01-21 15:20:52 UTC
Created attachment 14520 [details]
The functional changes patch, version 6
Comment 43 Anton Salikhmetov 2008-01-21 16:20:23 UTC
1. The cost of updating the file times at pagefault

This section presents the results of measuring the performance impact
of enabling the mtime/ctime update in Linux kernel.

The relevant portions of the test program are:

>>>

#define FILE_SIZE (1024 * 1024 * 256)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
       tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
       p[i] = i;

/* Write-protect the pages */
msync(p, FILE_SIZE, MS_ASYNC);

/* Measure the cost of file_update_time() */
gettimeofday(&tv_start, NULL);
for (i = 0; i < FILE_SIZE; i += 4096)
       p[i] = i;
gettimeofday(&tv_stop, NULL);

<<<

An ext4 partition was used to keep the memory-mapped file. Here is the
relevant part of dmesg output:

>>>

EXT4 FS on hdb, internal journal
EXT4-fs: mounted filesystem with ordered data mode.
EXT4-fs: file extents enabled

<<<

The tests were performed using the following platforms:

1. KVM x86_64 guest OS, current Git kernel. RAM size: 1G. Host system:
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz.

2. Pentium III-based system, current Git kernel. RAM size: 512M.

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. The following
commands were issued prior to running the tests:

echo 80 >/proc/sys/vm/dirty_ratio
echo 80 >/proc/sys/vm/dirty_background_ratio
echo 30000 >/proc/sys/vm/dirty_expire_centisecs
sync

Tables 1.1 and 1.2 give the time difference in microseconds between
the two calls to gettimeofday() in the test program.

Table 1.1. The KVM guest system. Memory: 1G. File size: 256M.

---------------------------------------------------
|            | Before the patch | After the patch |
---------------------------------------------------
| First run  |    251898 usec   |   978355 usec   |
---------------------------------------------------
| Second run |    260171 usec   |   975127 usec   |
---------------------------------------------------
| Third run  |    257235 usec   |   983775 usec   |
---------------------------------------------------

Table 1.2. The Pentium III system. Memory: 512M. File size: 256M.

---------------------------------------------------
|            | Before the patch | After the patch |
---------------------------------------------------
| First run  |    10283 usec    |   96564 usec    |
---------------------------------------------------
| Second run |    10898 usec    |   96569 usec    |
---------------------------------------------------
| Third run  |    10107 usec    |   89970 usec    |
---------------------------------------------------

2. The cost of clearing the write flag

Following are the results of the measurements of the performance cost
of clearing the write flag when the msync() function is called with the
MS_ASYNC flag.

Here are the relevant portions of the test program:

>>>

#define FILE_SIZE (1024 * 1024 * 256)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
       tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
       p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(&tv_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(&tv_stop, NULL);

<<<

The tests were performed using the following platforms:

1. KVM x86_64 guest OS, current Git kernel. RAM size: 1G. Host system:
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz.

2. Pentium III-based system, current Git kernel.  RAM size: 512M.

Tables 2.1 and 2.2 give the time difference between the two calls
to gettimeofday() in the test program. The test program was run three
times in a raw with a delay of one second between consecutive runs.
The following commands were issued prior to running the tests:

echo 80 >/proc/sys/vm/dirty_ratio
echo 80 >/proc/sys/vm/dirty_background_ratio
echo 30000 >/proc/sys/vm/dirty_expire_centisecs
sync

Table 2.1. The KVM guest system. Memory: 1G. File size: 256M.

---------------------------------------------------
|            | Before the patch | After the patch |
---------------------------------------------------
| First run  |    31 usec       |   13578 usec    |
---------------------------------------------------
| Second run |    35 usec       |   13236 usec    |
---------------------------------------------------
| Third run  |    31 usec       |   13301 usec    |
---------------------------------------------------

Table 2.2. The Pentium III system. Memory: 512M. File size: 256M.

---------------------------------------------------
|            | Before the patch | After the patch |
---------------------------------------------------
| First run  |     8 usec       |  14134 usec     |
---------------------------------------------------
| Second run |     8 usec       |  14460 usec     |
---------------------------------------------------
| Third run  |     8 usec       |  14262 usec     |
---------------------------------------------------
Comment 44 Anton Salikhmetov 2008-01-22 15:26:39 UTC
I proposed the eighth version of my solution for this bug in LKML:

http://lkml.org/lkml/2008/1/22/371
Comment 45 Anton Salikhmetov 2008-01-22 15:28:32 UTC
Accidentally changed the bug status. Reopening.
Comment 46 Anton Salikhmetov 2008-01-23 09:17:05 UTC
I don't want to submit my patch series anymore, because Linus appeared to be very tired of my bad code.

Reassigning this bug to default assignee of selected component.
Comment 47 Anton Salikhmetov 2008-01-23 15:00:16 UTC
After reporting the fact that this bug had been reassigned to default assignee in LKML, the following patch was committed to the upstream kernel:

http://lkml.org/lkml/diff/2008/1/22/370/1
Comment 48 Ismail Donmez 2008-01-23 15:02:41 UTC
You can close the bug as fixed now.
Comment 49 Anton Salikhmetov 2008-01-23 15:10:11 UTC
The patch applied to the upstream kernel fixes this bug. Closing.

The remaining problem is to satisfy the paper standard requirement. This requirement is totally ambiguous and requires invasive changes and expensive code. This is the MS_ASYNC case of the msync() function. Googling shows that the msync() function is most probably not called with the MS_ASYNC flag by any popular application. Therefore, it is left unimplemented for now.
Comment 50 Natalie Protasevich 2008-02-02 02:12:08 UTC
Thanks Anton, for the code and analysis. This was a tough one...
Comment 51 Ferenc Wágner 2009-05-30 09:08:21 UTC
Quoting from http://lkml.org/lkml/2009/5/7/634:

Christoph Hellwig <hch@infradead.org> writes:

>> Ferenc Wagner <wferi@niif.hu> writes:
>>
>>> I've noticed that the last modification times of our RRD files got
>>> stuck after upgrading from 2.6.24 to 2.6.26 (Debian Etch -> Lenny; I
>>> also tested with 2.6.30-rc5, they are still stuck). It has some
>>> literature, most notably kernel bug #2645, but that's closed long ago
>>> and the resulting patch http://lkml.org/lkml/2008/1/22/370 is present
>>> in my kernels. Still, the test program (version 3 from the bug report)
>>> gives failures:
>
> The problem is pretty simple.  do_wp_page and __do_fault use
> file_update_time to update ctime and mtime.  But this function is only
> a helper for simply filesystems that have a binary inode dirty/non dirty
> state and keep the m/ctime purely in the Linux inode.  It must not be
> called from generic code as more complex filesystems need a notification
> through ->setattr to update the timestamps.  This will also affect other
> filesystems like ubifs.  I'm not entirely sure why it ever worked
> before, we must have picked up those c/mtime updates by accident
> somehow.
> [...]
> Doing this correctly in the framework of the current code is
> unfortunately not so easy, as calling ->setattr requires taking i_mutex
> which we can't in the pagefaul path.
>
> To fix this properly we need to actually update the timestamps during
> msync and co as done by the patches from Miklos:
>
>       http://lkml.org/lkml/2007/2/28/166
>
> and Peter:
>
>       http://lkml.org/lkml/2006/5/31/176

Did any further progress happen?
Comment 52 Ferenc Wágner 2009-05-30 09:11:50 UTC
Looks like I can't find a way to reopen this bug,
as Christoph Hellwig suggested. Somebody with the power,
please do. Thanks.
Comment 53 Anton Salikhmetov 2009-06-22 11:55:03 UTC
(In reply to comment #52)
> Looks like I can't find a way to reopen this bug,
> as Christoph Hellwig suggested. Somebody with the power,
> please do. Thanks.

As far as I remember, any other generic FS-independent methods except calling file_update_time() function when a page has been just dirtied cannot be used in the current code.

The reason why msync() function itself haven't changed within the last solution accepted by Linus was that the timestamps should be updated sometimes without additional calls to msync(), anyway.

And I agree with Ray Lee (http://lkml.org/lkml/2009/5/12/274) that it must be a filesystem-specific issue, not a generic one since it works for ramfs and ext2 filesystems.

I'm almost sure that it's rather worth filing a new bug against those filesystems forgetting about the ctime/mtime timestamps than putting an FS-specific code into the generic part of Linux or repeating of the previous history quite long, I'd say.

As for reopening and suggesting completed versions of the previous solutions (including both of two mentioned by Christoph), they had been already proposed. As you can see here in this bug, all were rejected by the community due to objective reasons for each.
Comment 54 Ferenc Wágner 2009-09-25 12:40:32 UTC
Sorry for the late reply.  Finally I did some further testing, which showed that under 2.6.30-rc8 the test program reports full failure on XFS and RAMFS (contrary to the above, would you mind rechecking?), full success on EXT3, VFAT and ReiserFS, and mixed results on TMPFS:

$ ./mmap /dev/shm/testfile 
Modifying /dev/shm/testfile...
Flushing data using sync()...
Failure: time not changed.
Not modifying /dev/shm/testfile...
Flushing data using msync()...
Success: time not changed.
Not modifying /dev/shm/testfile...
Flushing data using fsync()...
Success: time not changed.
Modifying /dev/shm/testfile...
Flushing data using msync()...
Failure: time not changed.
Modifying /dev/shm/testfile...
Flushing data using fsync()...
Failure: time not changed.

So this doesn't look like az XFS-specific issue. Do you still recommend opening separate new reports for the affected filesystems?

Thanks,
Feri.
Comment 55 Anton Salikhmetov 2009-09-25 15:08:13 UTC
(In reply to comment #54)
> Sorry for the late reply.  Finally I did some further testing, which showed
> that under 2.6.30-rc8 the test program reports full failure on XFS and RAMFS
> (contrary to the above, would you mind rechecking?), full success on EXT3,
> VFAT
> and ReiserFS, and mixed results on TMPFS:
> 
> $ ./mmap /dev/shm/testfile 
> Modifying /dev/shm/testfile...
> Flushing data using sync()...
> Failure: time not changed.
> Not modifying /dev/shm/testfile...
> Flushing data using msync()...
> Success: time not changed.
> Not modifying /dev/shm/testfile...
> Flushing data using fsync()...
> Success: time not changed.
> Modifying /dev/shm/testfile...
> Flushing data using msync()...
> Failure: time not changed.
> Modifying /dev/shm/testfile...
> Flushing data using fsync()...
> Failure: time not changed.
> 
> So this doesn't look like az XFS-specific issue. Do you still recommend
> opening
> separate new reports for the affected filesystems?
> 
> Thanks,
> Feri.

Unfortunately, I've not been following the changes into the kernel in the parts related to this bug, so it's possible that the fix is no more good for the recent kernels. However, if we suppose that the fix is still there, it's worth filing a bug against at least XFS. Then, according to the results of that new bug, the further development would be much easier.

As far as I vaguely remember some of the file systems such as XFS and probably RAMFS, too, overload the functionality of updating the time stamps by their own code, not using the general facilities.
Comment 56 Ferenc Wágner 2009-10-12 22:39:10 UTC
Denoting subtest success/failure in the mmap test program with +/-, the picture is the following:

            XFS   RAMFS  TMPFS  EXT2  VFAT
2.6.30-rc8 -----  -----  -++-- +++++ +++++
2.6.31     -++--  -++--  -++-- +++++ +++++
2.6.32-rc4 +++++  -++--  -++-- +++++ +++++

Christoph Hellwig fixed XFS some days ago, yay!

On the other hand, the improvement between 2.6.30-rc8 and 2.6.31 probably isn't FS specific, as RAMFS didn't change much during that period. Of course, RAMFS (and thus TMPFS) would be nice to get fully fixed as well.

Note You need to log in before you can comment on or make changes to this bug.