Bug 9546 - Huge latency in concurrent I/O when using data=ordered
Huge latency in concurrent I/O when using data=ordered
Status: CLOSED OBSOLETE
Product: File System
Classification: Unclassified
Component: ext3
All Linux
: P1 normal
Assigned To: Jan Kara
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-12-11 11:36 UTC by Christian Casteyde
Modified: 2012-05-17 15:17 UTC (History)
9 users (show)

See Also:
Kernel Version: 2.6.24-rc4
Tree: Mainline
Regression: No


Attachments
Proposed fix to latency in ordered mode (1.58 KB, patch)
2008-01-25 02:31 UTC, Jan Kara
Details | Diff
Patch to limit amount of dirty dada and dump more information about commit (3.76 KB, patch)
2008-01-25 05:11 UTC, Jan Kara
Details | Diff
Output of kernel message for the patch (77.00 KB, text/plain)
2008-01-25 13:24 UTC, Christian Casteyde
Details
Full kernel log + kernel mode call stacks of blocked process (64.09 KB, text/plain)
2008-01-28 11:53 UTC, Christian Casteyde
Details

Description Christian Casteyde 2007-12-11 11:36:38 UTC
Most recent kernel where this bug did not occur:
Unknown, certainly not a regression, but something specific to ext3 algorithm

Distribution:
Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12

Hardware Environment:
Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
Athlon 64X2 4200+SATA 7200 200GB drive+1GB
Athlon 2800+IDE 7200 40GB drive+512MB

Software Environment:
dd, cp, konqueror/KDE, mount/tune2fs

Problem Description:
When the system does heavy input/output operations on big files, small files access from other applications are always not served for very long time. This can cause huge latencies. The system is really not usable at all, even with all the recent improvements done to increase interactivity on desktop.

This behaviour is very visible with the simple following test case:
1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the files in the DVD stucture, then pass on them to fix VOBUs, but this part is not very long so this is not the main problem).
2. While the computer is doing this, try to open a web browser such as konqueror. Then open a page from bookmark. Then open a new tab, then open another page from bookmark. Switch bak to first page.

What I get is:
35 seconds to open Konqueror.
8 seconds to open the "bookmark menu". Incredible.
30 seconds to open the web page (DSL/10GBits).
5 seconds to open the second tab.
6 seconds to reopen the menu.
36 seconds to open the second page.
14 seconds to come back to first tab.
This is unbelievable! The system is completely trashed, with more than 1GB RAM, whatever the hardware configuration is used.

Of course, I investigated the problem... First, DMA is OK. Second, I thought cache would make memory swapped. So I used echo 0 > swapiness. Then (of course, the system was not swapping at all), I thought TEXT sections from software discarded (that would be simply stupid, but who knows?). I then tried to make the writing process throttled with dirty_background_ratio (say 10%) while reserving a greater RAM portion for the rest of the system with dirty_ratio (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was the problem for the frozen process (ie: konqueror). The I saw the faulty guy: log_wait_commit!

So I concluded there is unfair access to the filesystem journal. So I tried other journaling options than the default "ordered" data mode. The results were really different: "5s, 2s, 4s, etc.", both with journal and write back mode!

I therefore think there is a great lock and even maybe a priority inversion in log_wait_commit of the ext3 filesystem. I think that, even if it is throttled, the writing process always get access to the journal in ordered mode, simply because it writes many pages at a time and because the ordered mode indeed implies... ordering of requests (as I understand it).

It's sad this is the default option that gives the worst interactivity problems. Indeed, this messes all previous work done to enhance desktop experience I think, too bad!

Btw, I've also seen on Internet that some people reported that journal data mode gives "better" performance. I think the problem was indeed related to latency rather than performance (timing the writing process effectively shows a output rate halved with journal data mode, and twice the time to process).

Steps to reproduce:
I did a simple script:
#!/bin/bash

SRC1=src1.bin
SRC2=src2.bin
DEST_DIR=tmpdir
DST1=dst.bin

# First, create the source files:
if [ ! -e $SRC1 ] ; then
        dd if=/dev/zero of=$SRC1 bs=10k count=150000
fi
if [ ! -e $SRC2 ] ; then
        dd if=/dev/zero of=$SRC2 bs=10k count=150000
fi
mkdir $DEST_DIR > /dev/null 2>&1
sync

# Do the test:
echo "Trashing the system..."
rm $DEST_DIR/$DST1 > /dev/null 2>&1
cp $SRC1 $DEST_DIR/$DST1
cat $SRC2 >> $DEST_DIR/$DST1
echo "Done!"

#rm -rf $DEST_DIR $SRC1 $SRC2

While running it, try to use "normally" the interactive programs, such as konqueror (the program should have to access files, such as cookies, cache and so for konqueror). Then remount/tune the filesystem to use another data mode for ext3.

I didn't try with other journaling filesystems. I guess ext2 also doesn't exhibit the problem.
Comment 1 Anonymous Emailer 2007-12-11 13:31:33 UTC
Reply-To: akpm@linux-foundation.org


(switching to email - please respond via emailed reply-to-all, not via the
bugzilla web interface)

On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9546
> 
>            Summary: Huge latency in concurrent I/O when using data=ordered
>            Product: File System
>            Version: 2.5
>      KernelVersion: 2.6.24-rc4
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: ext3
>         AssignedTo: akpm@osdl.org
>         ReportedBy: casteyde.christian@free.fr
> 
> 
> Most recent kernel where this bug did not occur:
> Unknown, certainly not a regression, but something specific to ext3 algorithm
> 
> Distribution:
> Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
> 
> Hardware Environment:
> Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
> Athlon 64X2 4200+SATA 7200 200GB drive+1GB
> Athlon 2800+IDE 7200 40GB drive+512MB
> 
> Software Environment:
> dd, cp, konqueror/KDE, mount/tune2fs
> 
> Problem Description:
> When the system does heavy input/output operations on big files, small files
> access from other applications are always not served for very long time. This
> can cause huge latencies. The system is really not usable at all, even with all
> the recent improvements done to increase interactivity on desktop.
> 
> This behaviour is very visible with the simple following test case:
> 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
> files in the DVD stucture, then pass on them to fix VOBUs, but this part is not
> very long so this is not the main problem).
> 2. While the computer is doing this, try to open a web browser such as
> konqueror. Then open a page from bookmark. Then open a new tab, then open
> another page from bookmark. Switch bak to first page.
> 
> What I get is:
> 35 seconds to open Konqueror.
> 8 seconds to open the "bookmark menu". Incredible.
> 30 seconds to open the web page (DSL/10GBits).
> 5 seconds to open the second tab.
> 6 seconds to reopen the menu.
> 36 seconds to open the second page.
> 14 seconds to come back to first tab.
> This is unbelievable! The system is completely trashed, with more than 1GB RAM,
> whatever the hardware configuration is used.
> 
> Of course, I investigated the problem... First, DMA is OK. Second, I thought
> cache would make memory swapped. So I used echo 0 > swapiness. Then (of course,
> the system was not swapping at all), I thought TEXT sections from software
> discarded (that would be simply stupid, but who knows?). I then tried to make
> the writing process throttled with dirty_background_ratio (say 10%) while
> reserving a greater RAM portion for the rest of the system with dirty_ratio
> (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was
> the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
> log_wait_commit!
> 
> So I concluded there is unfair access to the filesystem journal. So I tried
> other journaling options than the default "ordered" data mode. The results were
> really different: "5s, 2s, 4s, etc.", both with journal and write back mode!
> 
> I therefore think there is a great lock and even maybe a priority inversion in
> log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
> the writing process always get access to the journal in ordered mode, simply
> because it writes many pages at a time and because the ordered mode indeed
> implies... ordering of requests (as I understand it).
> 
> It's sad this is the default option that gives the worst interactivity
> problems. Indeed, this messes all previous work done to enhance desktop
> experience I think, too bad!
> 
> Btw, I've also seen on Internet that some people reported that journal data
> mode gives "better" performance. I think the problem was indeed related to
> latency rather than performance (timing the writing process effectively shows a
> output rate halved with journal data mode, and twice the time to process).
> 
> Steps to reproduce:
> I did a simple script:
> #!/bin/bash
> 
> SRC1=src1.bin
> SRC2=src2.bin
> DEST_DIR=tmpdir
> DST1=dst.bin
> 
> # First, create the source files:
> if [ ! -e $SRC1 ] ; then
>         dd if=/dev/zero of=$SRC1 bs=10k count=150000
> fi
> if [ ! -e $SRC2 ] ; then
>         dd if=/dev/zero of=$SRC2 bs=10k count=150000
> fi
> mkdir $DEST_DIR > /dev/null 2>&1
> sync
> 
> # Do the test:
> echo "Trashing the system..."
> rm $DEST_DIR/$DST1 > /dev/null 2>&1
> cp $SRC1 $DEST_DIR/$DST1
> cat $SRC2 >> $DEST_DIR/$DST1
> echo "Done!"
> 
> #rm -rf $DEST_DIR $SRC1 $SRC2
> 
> While running it, try to use "normally" the interactive programs, such as
> konqueror (the program should have to access files, such as cookies, cache and
> so for konqueror). Then remount/tune the filesystem to use another data mode
> for ext3.
> 
> I didn't try with other journaling filesystems. I guess ext2 also doesn't
> exhibit the problem.
> 

Interesting that data=writeback helped this.  You don't give a lot of
details, but I assume that data=writeback made a large difference here?

Comment 2 Christian Casteyde 2007-12-11 14:43:57 UTC
OK for reply to all. I'm using the first mail received.

> Interesting that data=writeback helped this.  You don't give a lot of
> details, but I assume that data=writeback made a large difference here?

Well, my measures were taken by counting seconds from the desktop clock,
so the precision is roughly 1s. Here are the effective results from the 3 
journaling modes I have done yesterday :

Times in seconds:
ordered mode :	17	5	38	19	33
journal mode :		6	1	5	6	5
writeback mode :	4	1	7	6	8

I do not remember exactly the operation sequence, but the beginning was 
opening Konqueror, opening the bookmarks menu, loading a page, and opening 
another tab.

As you can see anyway, ordered is completely broken, opening the menu is near 
1s in journal and writeback mode, which is perfectly acceptable for a user 
(after all, the computer is loaded, and interactivity is not simply running 
interactive tasks at real time priority, so the the user can understand it 
reacts slower). Launching Konqueror in 4 to 6 seconds is a good time, given 
that the cache is poisoned by the writing process. Opening another tab or a 
web page is tolerable within 5-6s, even if this is still a little too long, 
specially for opening another tab of an already running process (2-3s would 
be a perfect value I think, but this is another story to get this). Certainly 
5-6s of latency is due to the block size used by the writing process, and 
tuning the kernel to get lower results would be dangerous for general 
performances. I could do tests by asking dd to write more but smaller blocks 
to check this idea.

As far as writeback is concerned, the latency reduction was not clear as you 
can see, at least with my mesurement procedure. For performance of the 
**writing process**, it was nearly the same as ordered. I havn't written the 
timings results for the writing process, since I was still investigating at 
that time. But I used :

time dowrite.sh

to time it, and I got real time, as well as dd output (records written, MB/s, 
and journal was half as ordered). So writeback and ordered seemed the same at 
this level, and journal mode was effectively twice slower. It was clear there 
was twice as most write operations, that was the expected behaviour.

I could redo the tests, with a more formal approach (that is: fixing the 
operations order done in Konqueror, and noting the times for the writing 
process), it's easy. In fact, I'm considering doing another script simulating 
Konqueror / user activities and timing them automatically, just to get a 
reproduceable benchmark suite.

Please note that not all programs are affected (for instance kmail was quite 
quick), so the kernel does a good job to run programs that do not many I/Os 
when the fs is stressed by another process. The difference with konqueror is 
it is opening several files (block dump says), so my guess is that processes 
doing small I/O operations are not given "journal access" fairly enough in 
ordered mode.

Indeed, I imagine (without knowing any internals of course) that the data 
writing is indeed scheduled correctly by the disk scheduler, and that the 
writing process is throttled, but the priority informations are lost when 
metadata are written in ordered mode, because all I/Os are mixed for 
metadata. Therefore, the writing process manages to fill the journal quite 
easily. With journal and writeback mode, all data may be written 
sequentially, in the context of the issuing process, so priorities may be 
tracked on the whole writing process. Then the writing process is prempted 
correctly to the benefits of Konqueror. This is of course just "black box" 
hypothesis...

CC

Comment 3 Jan Kara 2007-12-12 01:15:31 UTC
> 
> (switching to email - please respond via emailed reply-to-all, not via the
> bugzilla web interface)
> 
> On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
> bugme-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=9546
> > 
> >            Summary: Huge latency in concurrent I/O when using data=ordered
> >            Product: File System
> >            Version: 2.5
> >      KernelVersion: 2.6.24-rc4
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: ext3
> >         AssignedTo: akpm@osdl.org
> >         ReportedBy: casteyde.christian@free.fr
> > 
> > 
> > Most recent kernel where this bug did not occur:
> > Unknown, certainly not a regression, but something specific to ext3 algorithm
> > 
> > Distribution:
> > Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
> > 
> > Hardware Environment:
> > Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
> > Athlon 64X2 4200+SATA 7200 200GB drive+1GB
> > Athlon 2800+IDE 7200 40GB drive+512MB
> > 
> > Software Environment:
> > dd, cp, konqueror/KDE, mount/tune2fs
> > 
> > Problem Description:
> > When the system does heavy input/output operations on big files, small files
> > access from other applications are always not served for very long time. This
> > can cause huge latencies. The system is really not usable at all, even with all
> > the recent improvements done to increase interactivity on desktop.
> > 
> > This behaviour is very visible with the simple following test case:
> > 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
> > files in the DVD stucture, then pass on them to fix VOBUs, but this part is not
> > very long so this is not the main problem).
> > 2. While the computer is doing this, try to open a web browser such as
> > konqueror. Then open a page from bookmark. Then open a new tab, then open
> > another page from bookmark. Switch bak to first page.
> > 
> > What I get is:
> > 35 seconds to open Konqueror.
> > 8 seconds to open the "bookmark menu". Incredible.
> > 30 seconds to open the web page (DSL/10GBits).
> > 5 seconds to open the second tab.
> > 6 seconds to reopen the menu.
> > 36 seconds to open the second page.
> > 14 seconds to come back to first tab.
> > This is unbelievable! The system is completely trashed, with more than 1GB RAM,
> > whatever the hardware configuration is used.
> > 
> > Of course, I investigated the problem... First, DMA is OK. Second, I thought
> > cache would make memory swapped. So I used echo 0 > swapiness. Then (of course,
> > the system was not swapping at all), I thought TEXT sections from software
> > discarded (that would be simply stupid, but who knows?). I then tried to make
> > the writing process throttled with dirty_background_ratio (say 10%) while
> > reserving a greater RAM portion for the rest of the system with dirty_ratio
> > (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was
> > the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
> > log_wait_commit!
> > 
> > So I concluded there is unfair access to the filesystem journal. So I tried
> > other journaling options than the default "ordered" data mode. The results were
> > really different: "5s, 2s, 4s, etc.", both with journal and write back mode!
> > 
> > I therefore think there is a great lock and even maybe a priority inversion in
> > log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
> > the writing process always get access to the journal in ordered mode, simply
> > because it writes many pages at a time and because the ordered mode indeed
> > implies... ordering of requests (as I understand it).
> > 
> > It's sad this is the default option that gives the worst interactivity
> > problems. Indeed, this messes all previous work done to enhance desktop
> > experience I think, too bad!
> > 
> > Btw, I've also seen on Internet that some people reported that journal data
> > mode gives "better" performance. I think the problem was indeed related to
> > latency rather than performance (timing the writing process effectively shows a
> > output rate halved with journal data mode, and twice the time to process).
> > 
> > Steps to reproduce:
> > I did a simple script:
> > #!/bin/bash
> > 
> > SRC1=src1.bin
> > SRC2=src2.bin
> > DEST_DIR=tmpdir
> > DST1=dst.bin
> > 
> > # First, create the source files:
> > if [ ! -e $SRC1 ] ; then
> >         dd if=/dev/zero of=$SRC1 bs=10k count=150000
> > fi
> > if [ ! -e $SRC2 ] ; then
> >         dd if=/dev/zero of=$SRC2 bs=10k count=150000
> > fi
> > mkdir $DEST_DIR > /dev/null 2>&1
> > sync
> > 
> > # Do the test:
> > echo "Trashing the system..."
> > rm $DEST_DIR/$DST1 > /dev/null 2>&1
> > cp $SRC1 $DEST_DIR/$DST1
> > cat $SRC2 >> $DEST_DIR/$DST1
> > echo "Done!"
> > 
> > #rm -rf $DEST_DIR $SRC1 $SRC2
> > 
> > While running it, try to use "normally" the interactive programs, such as
> > konqueror (the program should have to access files, such as cookies, cache and
> > so for konqueror). Then remount/tune the filesystem to use another data mode
> > for ext3.
> > 
> > I didn't try with other journaling filesystems. I guess ext2 also doesn't
> > exhibit the problem.
> > 
> 
> Interesting that data=writeback helped this.  You don't give a lot of
> details, but I assume that data=writeback made a large difference here?
  What I think could be happening:
One process does data-intensive load. Thus in the ordered mode the
transaction is tiny but has tons of data buffers attached. If commit
happens, it takes a long time to sync all the data before the commit
can proceed... In the writeback mode, we don't wait for data buffers, in
the journal mode amount of data to be written is really limited by the
maximum size of a transaction and so we write by much smaller chunks
and better latency is thus ensured.
  I'll try some tests to verify whether my theory is correct :).

								Honza
Comment 4 Jan Kara 2007-12-12 06:43:56 UTC
> > 
> > (switching to email - please respond via emailed reply-to-all, not via the
> > bugzilla web interface)
> > 
> > On Tue, 11 Dec 2007 11:36:39 -0800 (PST)
> > bugme-daemon@bugzilla.kernel.org wrote:
> > 
> > > http://bugzilla.kernel.org/show_bug.cgi?id=9546
> > > 
> > >            Summary: Huge latency in concurrent I/O when using data=ordered
> > >            Product: File System
> > >            Version: 2.5
> > >      KernelVersion: 2.6.24-rc4
> > >           Platform: All
> > >         OS/Version: Linux
> > >               Tree: Mainline
> > >             Status: NEW
> > >           Severity: normal
> > >           Priority: P1
> > >          Component: ext3
> > >         AssignedTo: akpm@osdl.org
> > >         ReportedBy: casteyde.christian@free.fr
> > > 
> > > 
> > > Most recent kernel where this bug did not occur:
> > > Unknown, certainly not a regression, but something specific to ext3 algorithm
> > > 
> > > Distribution:
> > > Bluewhite64 12 (Slackware 12 64 bits port) and Slackware 12
> > > 
> > > Hardware Environment:
> > > Athlon 64 3000+laptop IDE 5400 80GB+1.2GB RAM
> > > Athlon 64X2 4200+SATA 7200 200GB drive+1GB
> > > Athlon 2800+IDE 7200 40GB drive+512MB
> > > 
> > > Software Environment:
> > > dd, cp, konqueror/KDE, mount/tune2fs
> > > 
> > > Problem Description:
> > > When the system does heavy input/output operations on big files, small files
> > > access from other applications are always not served for very long time. This
> > > can cause huge latencies. The system is really not usable at all, even with all
> > > the recent improvements done to increase interactivity on desktop.
> > > 
> > > This behaviour is very visible with the simple following test case:
> > > 1. Build a DVD structure from big MPEG+PS files with dvdauthor (it copies the
> > > files in the DVD stucture, then pass on them to fix VOBUs, but this part is not
> > > very long so this is not the main problem).
> > > 2. While the computer is doing this, try to open a web browser such as
> > > konqueror. Then open a page from bookmark. Then open a new tab, then open
> > > another page from bookmark. Switch bak to first page.
> > > 
> > > What I get is:
> > > 35 seconds to open Konqueror.
> > > 8 seconds to open the "bookmark menu". Incredible.
> > > 30 seconds to open the web page (DSL/10GBits).
> > > 5 seconds to open the second tab.
> > > 6 seconds to reopen the menu.
> > > 36 seconds to open the second page.
> > > 14 seconds to come back to first tab.
> > > This is unbelievable! The system is completely trashed, with more than 1GB RAM,
> > > whatever the hardware configuration is used.
> > > 
> > > Of course, I investigated the problem... First, DMA is OK. Second, I thought
> > > cache would make memory swapped. So I used echo 0 > swapiness. Then (of course,
> > > the system was not swapping at all), I thought TEXT sections from software
> > > discarded (that would be simply stupid, but who knows?). I then tried to make
> > > the writing process throttled with dirty_background_ratio (say 10%) while
> > > reserving a greater RAM portion for the rest of the system with dirty_ratio
> > > (say 70%). No way. Then I launched top, and looked at the WCHAN to see what was
> > > the problem for the frozen process (ie: konqueror). The I saw the faulty guy:
> > > log_wait_commit!
> > > 
> > > So I concluded there is unfair access to the filesystem journal. So I tried
> > > other journaling options than the default "ordered" data mode. The results were
> > > really different: "5s, 2s, 4s, etc.", both with journal and write back mode!
> > > 
> > > I therefore think there is a great lock and even maybe a priority inversion in
> > > log_wait_commit of the ext3 filesystem. I think that, even if it is throttled,
> > > the writing process always get access to the journal in ordered mode, simply
> > > because it writes many pages at a time and because the ordered mode indeed
> > > implies... ordering of requests (as I understand it).
> > > 
> > > It's sad this is the default option that gives the worst interactivity
> > > problems. Indeed, this messes all previous work done to enhance desktop
> > > experience I think, too bad!
> > > 
> > > Btw, I've also seen on Internet that some people reported that journal data
> > > mode gives "better" performance. I think the problem was indeed related to
> > > latency rather than performance (timing the writing process effectively shows a
> > > output rate halved with journal data mode, and twice the time to process).
> > > 
> > > Steps to reproduce:
> > > I did a simple script:
> > > #!/bin/bash
> > > 
> > > SRC1=src1.bin
> > > SRC2=src2.bin
> > > DEST_DIR=tmpdir
> > > DST1=dst.bin
> > > 
> > > # First, create the source files:
> > > if [ ! -e $SRC1 ] ; then
> > >         dd if=/dev/zero of=$SRC1 bs=10k count=150000
> > > fi
> > > if [ ! -e $SRC2 ] ; then
> > >         dd if=/dev/zero of=$SRC2 bs=10k count=150000
> > > fi
> > > mkdir $DEST_DIR > /dev/null 2>&1
> > > sync
> > > 
> > > # Do the test:
> > > echo "Trashing the system..."
> > > rm $DEST_DIR/$DST1 > /dev/null 2>&1
> > > cp $SRC1 $DEST_DIR/$DST1
> > > cat $SRC2 >> $DEST_DIR/$DST1
> > > echo "Done!"
> > > 
> > > #rm -rf $DEST_DIR $SRC1 $SRC2
> > > 
> > > While running it, try to use "normally" the interactive programs, such as
> > > konqueror (the program should have to access files, such as cookies, cache and
> > > so for konqueror). Then remount/tune the filesystem to use another data mode
> > > for ext3.
> > > 
> > > I didn't try with other journaling filesystems. I guess ext2 also doesn't
> > > exhibit the problem.
> > > 
> > 
> > Interesting that data=writeback helped this.  You don't give a lot of
> > details, but I assume that data=writeback made a large difference here?
>   What I think could be happening:
> One process does data-intensive load. Thus in the ordered mode the
> transaction is tiny but has tons of data buffers attached. If commit
> happens, it takes a long time to sync all the data before the commit
> can proceed... In the writeback mode, we don't wait for data buffers, in
> the journal mode amount of data to be written is really limited by the
> maximum size of a transaction and so we write by much smaller chunks
> and better latency is thus ensured.
>   I'll try some tests to verify whether my theory is correct :).
  Hmm, I couldn't reproduce the bad behavior... But anyway, if my theory
is correct, attached patch should help - can you try it please? It's a
debugging quality only but shouldn't do anything bad to your data :)

									Honza
Comment 5 Christian Casteyde 2007-12-12 11:27:07 UTC
Le mercredi 12 d
Comment 6 Christian Casteyde 2007-12-12 11:31:04 UTC
Oops, I forgot an essential point: my CPU is constrained to 1GHz instead of 
2.2GHz on my desktop by default, but we are looking only at disk I/O. Of 
course, starting KMail and Konqueror should be much faster from cache if I 
let the CPU run at full speed.

CC

Comment 7 Christian Casteyde 2007-12-25 01:37:58 UTC
I'm just thinking about it (and the fact that comment #4 says it is not always reproduceable): my partitions are ext3, mounted with noatime (atime updates may throttle the writing process - who knows?).

Here is my /etc/fstab for info:

/dev/hda1        swap             swap        defaults         0   0
/dev/hda2        /                ext3        defaults,noatime 1   1
/dev/sda1        /home            ext3        defaults,data=writeback,noatime 1   2
/dev/cdrom       /mnt/cdrom       auto        noauto,owner,users,ro  0   0
/dev/dvd         /mnt/dvd         auto        noauto,owner,users,ro       0   0
/dev/fd0         /mnt/floppy      auto        noauto,owner     0   0
/dev/sdb1        /mnt/memory      auto        noauto,owner,users,umask=0  0   0
devpts           /dev/pts         devpts      gid=5,mode=620   0   0
proc             /proc            proc        defaults         0   0
tmpfs            /dev/shm         tmpfs       defaults,size=550m 0   0

Happy Christmas and new year :-)
Comment 8 Jan Kara 2008-01-25 02:31:40 UTC
Created attachment 14571 [details]
Proposed fix to latency in ordered mode

Just attaching a patch you've tried for completeness... Bugzilla seems to have not included it.
Comment 9 Jan Kara 2008-01-25 04:43:16 UTC
Sorry for not responding but I didn't get any notification via email that
something changed.

BTW: I'm able to reproduce some spikes in latency with your test script now but
times are definitely not nearly as bad as you observe even without my patch (I
get stalls for something like 2-4s or so)...

Looking at your results - note that data=ordered mode won't ever be as good as
data=writeback mode - that's the price of additional data consistency
guarantee. But hopefully we are able to squeeze a bit more of your system :).
I'll write a patch to output more in detail what is really happening and attach
it here for you to try out.
Comment 10 Jan Kara 2008-01-25 05:11:44 UTC
Created attachment 14574 [details]
Patch to limit amount of dirty dada and dump more information about commit

So this patch should do the same thing as the previous one but additionally also print what the commit is doing. Could you please enable "Show timing information on printk" in the kernel hacking section of menuconfig and test a kernel with this patch? Thanks.
Comment 11 Christian Casteyde 2008-01-25 13:21:07 UTC
So here are my results with the latest patch.
I applied it on the fresh new 2.6.24 kernel, and checked before I still could reproduce them problem. The results were:

With 2.6.24 without patch, writeback mode, full CPU speed (2.2GHz):
Konq Menu Page1 NewTab Menu Page2 AltTab Close Konq KMail
7s 1s 9s 1s 5s 0s 5s 5s 19s

Same config, CPU at only 1GHz (powersave mode):
6s 1s 7s 1s 7s 0s 3s 6s 20s
(so nearly the same as expected)

With 2.6.24 with the patch, ordered mode, powersave mode :
18s 1s 55s 20s 1s 38s 18s 25s 78s
(KMail still fails to load in a decent time, opening the first page is also a pain [was linux.org.uk/Portaloo.cs]).

I activated dmesg + printk timing + the patch + echo 1 > block_dump while trying to do the following sequence:
Start Konqueror, open the bookmarks, load the page, close Konqueror, discard KWin warning on unresponsive process (Konqi takes too long to close), open KMail.
The sequence starts at 22:06, ends at 22:10. Roughly, the first minute or so is related to Konqueror, and the last minute to KMail.
The results is appended in attachment.
Comment 12 Christian Casteyde 2008-01-25 13:24:58 UTC
Created attachment 14578 [details]
Output of kernel message for the patch

Output of the patch results.
Comment 13 Christian Casteyde 2008-01-25 13:37:50 UTC
I didn't manage to append the next dmesg output, since the bzip2 compressed form of the syslog output is bigger than what bugzilla can eat.
So you can download it at:
http://casteyde.christian.free.fr/mess2.txt.bz2

I'll keep it for a couple of mounth there.
This file is a bzip2 excerpt of syslog info which contains block_dump output. As we can see, KMail never wins againsts pdflush, so it is stucked while dirtying some files (it's not for atime since I use noatime).
Comment 14 Christian Casteyde 2008-01-25 13:46:44 UTC
More info:
1. First, I issued a "tail -f /var/log/message" while doing the test, hence the "cat" command in the block dump.
2. Second, I've just noticed my Mail dir contains huge mail boxes which I never really cleaned up:

christian@sirius:~$ ls -l Mail
total 248056
-rw------- 1 christian users  48435316 Oct 31 21:33 OE5-Boîte\ de\ réception
drwx------ 5 christian users      4096 Dec 13 19:58 boîte\ de\ petri/
-rw------- 1 christian users      5940 Jan 25 22:15 drafts
-rw------- 1 christian users 124560600 Jan 25 22:15 inbox
-rw------- 1 christian users         0 Jan 25 22:15 outbox
-rw------- 1 christian users         0 May  1  2007 received
-rw------- 1 christian users  80723766 Jan 23 19:23 sent-mail
drwx------ 5 christian users      4096 Jan 25 22:15 templates/
-rw------- 1 christian users         0 Jan 25 22:05 trash

That certainly is the cause of the huge KMail latency observed compared to other softwares, but that doesn't explain why the problem only occurs with ordered data mode.
Comment 15 Jan Kara 2008-01-28 08:51:19 UTC
Thank you for the test but actually, the debug messages I've added are KERN_INFO and they didn't make it into your log... So could you please change configuration of your syslog so that KERN_INFO messages are logged as well (or just change the patch so that messages are printed with KERN_ERR instead of KERN_INFO)?

And one more thing - when konqueror is stuck in log_wait_commit, could you please do "echo w >/proc/sysrq-trigger" (you have to have Magic Sysrq enabled in kernel config for this to work). Then stack traces of blocked processes will be dumped into the log and one can find out exactly why is konqueror waiting on commit. Thank you and sorry for bothering with all the tests.
Comment 16 Christian Casteyde 2008-01-28 11:53:47 UTC
Created attachment 14628 [details]
Full kernel log + kernel mode call stacks of blocked process

OK, I added a kern.* /var/log/kernel line in syslog.conf, and did what you asked.

I triggered the sysrq trigger twice, once at Konqueror startup while running my script, and one when it was blocked for the opening of the bookmark menu (that last something like 10-15s, so I could switch to konsole and run the command once again).

The interesting part of the kernel log is appended. Curiously, konqueror is supposed to be blocked in fsync system call (as well as kmail in the first trigger), but I do not understand that at all, since I'm quite convinced it does not call it.

To prove that, I tried to run
strace konqueror 2>&1 | grep fsync
to check it, and obviously konqueror effectively does not seem to call fsync. I'm confused, hope it'll help you...

Maybe I also could start it with strace directly and see the last system call effectively called? I'll post the result later tonight.
Comment 17 Jan Kara 2008-01-28 13:06:04 UTC
Thanks for the test! So two observations from your logs:
1) transactions really have lots of ordered data and only a bit of metadata and that is causing commit to take a while - so this confirms my suspicion. I could further decrease amount of data that can be queued to a transaction (currently it is 128MB for your system) which would reduce the latency a bit more (but I'm afraid that there's going to be a tradeoff that the more you decrease the transaction size (and thus latency) the more you also decrease throughput).
2) konqueror and kmail both seem to call fsync() (actually fdatasync() when I strace it locally) as you write. And because fsync() forces commit of a transaction, there's not much we can do other than wait for all writes to finish...

Hmm... Well, you can try decreasing number 32768 in jbd/transaction.c: start_this_handle(). Maybe to something like 8192 - that is the number of data buffers which can be attached to a transaction. And you'll see how much latency improves and how much you loose in the throughput of dd/cp/whatever.

I'll try to think how one can make fsync() in ext3 more clever. I think Andrew had some ideas.
Comment 18 Christian Casteyde 2008-02-18 05:46:32 UTC
I've just done the reverse test:
- all disks on ext3 + ordered mode, CPU with powersave;
- added /lib/nofsync.so with the following code:

/* #define DEBUG_FSYNC */

#ifdef DEBUG_FSYNC
extern int getpid();
extern int getppid();
#include <stdio.h>
#endif

int sync()
{
#ifdef DEBUG_FSYNC
        static int i=0;
        FILE *f = fopen("/tmp/fsync.log", "a+t");
        if (f != NULL)
        {
                fprintf(f, "%d sync from pid %d (ppid %d)\n", ++i, getpid(), getppid());
                fclose(f);
        }
#endif
        return 0;
}

int fsync (int __fd)
{
#ifdef DEBUG_FSYNC
        static int i=0;
        FILE *f = fopen("/tmp/fsync.log", "a+t");
        if (f != NULL)
        {
                fprintf(f, "%d fsync from pid %d (ppid %d)\n", ++i, getpid(), getppid());
                fclose(f);
        }
#endif
        return 0;
}

int fdatasync (int __fd)
{
#ifdef DEBUG_FSYNC
        static int i=0;
        FILE *f = fopen("/tmp/fsync.log", "a+t");
        if (f != NULL)
        {
                fprintf(f, "%d fdatasync from pid %d (ppid %d)\n", ++i, getpid(), getppid());
                fclose(f);
        }
#endif
        return 0;
}

- added /lib/nofsync.so in /etc/ld.so.preload + ldconfig (yeah, that's a great feature :-);
- reran the test:
Konq Menu Page1 NewTab Menu Page2 AltTab Close Konq KMail
6s   0s   3s    0s     1s   3s    3s     0s         12s

which is by far the best results I ever got! The time to start software aparts, they are quite usable (3s max of latency on busy disk access without buffering [ie: KMail is not realy a video player], that's quite honest). Adn I'm with default memory tweaks (= I didn't throttled the writing process).

So it is very clear the main latencies problem are due to overuse of fdatasync in stupid programs (I consider fsync stupid on high level fs, especially jounaling fs, and stupid also for "regular" programs). And to coarse grained sync in the VFS/ext3 (as you suspected, the kernel is not clean either here: fdatasync should only block writing process or reading process waiting their turn for the syncer)...

Just for information, fdatasync is redirected to sync in the VFS or only with ext2/ext3 ?
Comment 19 Jan Kara 2008-02-18 10:54:24 UTC
Thanks for verifying this. I've actually spoken to a KDE developer and they're aware they are doing fdatasync() and that it's not the most clever thing to do. They introduced it because in the not too distant past XFS didn't have ordered mode and thus a crash could result in zeroed out cache files which KMail/Konqueror wasn't able to handle.

fdatasync() is redirected to fsync() and journal commit only in ext3 filesystem. Actually, you'd see that ext2 probably does much better than ext3 because it doesn't have the journaling guarantees to obey and other journaling overhead...
Comment 20 Nathan Grennan 2008-04-03 18:03:53 UTC
I see this problem with Firefox 3 beta3 and later. They disabled async i/o of sqlite in beta 3, because of database corruption problems. So now it is a fsync monster. For each history entry, which gets updated every time you visit a new page sqlite does six fsyncs. This causes Firefox's ui to freeze for seconds, often 30 seconds or more, when the system is under heavy i/o. All that it takes to create heavy i/o is something like "dd if=/dev/zero of=test-file bs=1M count=1024". I also see this with simple things like copying large files like dvd isos.

As the original bug reporter said, this is completely unusable, at least with ordered mode. In addition, even with writeback mode I consider the latency to be unacceptable. I am not on a low end machine. I see this with Fedora 8 x86_64 systems running recent 2.6.24.x kernels. The first 2.4ghz quad core, 4gb memory, and 4x500gb software raid5. The second 2.4gh dual core, 2gb memory, 2x300gb software raid1.

As soon as distribution releases like Fedora 9 go official with Firefox 3 as the default browser this problem is likely going to go very visible.

https://bugzilla.mozilla.org/show_bug.cgi?id=42482


When are we going to see real progress on this?
Comment 21 Jan Kara 2008-04-07 05:58:45 UTC
Three points here:
1) fsync() is expensive by definition (you have to wait for disk to acknowledge data is written), even more on a journaling filesystem. That is not going to change until we start using solid-state-disks. Application designers should have this in mind when writing their applications... I'd personally prefer to lose my browsing history on a machine crash rather than slowing down operation of my browser (BTW: The pointer to the bug seems to be wrong - that is come unrelated bug).
2) fsync() in ordered mode is more expensive than it possibly needs to be. That is highly non-trivial to fix but I'm working on that. I already have patches which are a step in the right direction and could improve the situation a bit but I'd call them alpha quality so far :). But ordered mode is never going to be faster than a writeback mode.
3) Your observed high latency in writeback mode is probably caused by the fact that the disk is loaded with IO. Not much we can do here... maybe ionice the process loading the disk will help.
Comment 22 Nathan Grennan 2008-04-07 10:15:05 UTC
  Here is the proper link.

https://bugzilla.mozilla.org/show_bug.cgi?id=421482

I definitely agree I wouldn't care if I lose some browsing history during a crash, but they don't think they have any middle ground. They think async means a corrupt database, not just missing data. They have talked about batching the sync inserts into the database every 5-10 seconds, but I don't get the impression that will really help. If the high i/o lasts more than 5-10 seconds, which is it is very likely to do then batching is of no help.

What I think programs like Firefox really need to do is separate the writing of the database to another thread, but their programmers seem to be fairly anti-threading.
Comment 23 Christian Casteyde 2008-04-07 10:45:41 UTC
This is going off topic, but sincerly I do not understand why there is file/database corruption in the first place. For KMail, it was said the XFS had a bug. So well, I guess this was critical, and fixed since, so they simply could remove the fsync calls now. For mozilla, it should be the same.

There are two options:
a. the software writes parts of files in an unordered way, and therefore could as well have inconsistent files be it calling fsync or not.
b. the software writes its data in such a way the last thing lost is the last thing not passed to the OS. The file couldn't be corrupted then, it's the job of the journaling filesystem to take care of that. Therefore, there is no need of fsync either!

I really never got any case professionnaly where fsync was needed. The only thing that could be tricky is if the C library (or any either level library) buffers the application requests and exectutes them in an out of order way. In this case, the wisest thing to do is to write directly to the files, or open them with the right options. I highly doubt any library does that.

So fsync may be very costly, but it shouldn't be called in any classic software. Certainly not in something as simple - sorry - as a web browser or a mail client!

Btw, there are effectively many problems with disk access, which are simply related to cache management. And indeed this is certainly more tricky than OOM or CPU scheduler to get right! Heavy IO can discard read pages, page out code, and so, and there are certainly many heuristics to find (for instance, throttle cache usage for heavy IO processes to prevent cache poisoning for all other processes), and many technical problems (cache is shared and does not know its usage context for example).
But this is really off topic, even if Linux would make a giant leap if it could handle those case. I've never seen any OS with a good behaviour (every version of Windows is a dog, Linux is a dog, not tried other but I do not expecte much either). Seems disks are still as slow as they were 10 years ago, and now this is much more noticed and falls below the expectations of the users.
Comment 24 Nathan Grennan 2008-04-07 11:54:25 UTC
I think the problem is sqlite. I don't think it was designed to be durable in the async state. Which could be considered a bug. But then it is common for databases to use fsync. Personally I think this points out how sqlite was a poor choice.

They also had a secondary issue with async where it was resulting in slow shutdown of Firefox. This was probably because they weren't doing regular writes to disk.
Comment 25 Jan Kara 2008-04-07 13:23:25 UTC
Actually, there may be a reason for fsync(). Because in ordered mode it can happen some reordering of data writes happens - we guarantee that all the data is written on the transaction commit but before that, arbitrary subset of data can be written   so in case of crash you can seen only these writes.
Comment 26 Nathan Grennan 2008-04-07 14:12:29 UTC
It doesn't use fsync in async mode. It only does fsyncs with sync mode. I have personally tested this by using beta 2 which was async, and beta 3 and later which is sync.

In addition to using fsync in sync mode, it also does not one, but six fsyncs back to back for each history addition.
Comment 27 Christian Casteyde 2008-04-19 00:10:35 UTC
In response to #25:
Reordering data writes is perfectly OK in the transaction code. However, it is not in buffering code (I think this was not what you talked about however).

The filesystem can do whatever it wants provided the journaling mechanisms ensures that data written by the application won't be lost, or could be lost but in the reverse order they were written (say, app writes A, B, C : the kernel could loose C, or C and B, or all, but not B only). The ordering is only a requirement of the application : if the kernel ensures there won't be an out of order data loss while being async when write exists, that should be sufficient and the application should'nt have to call fsync at all.
Comment 28 Christian Casteyde 2008-04-19 00:11:22 UTC
For information, I've filled once upon a time the following KDE bug:
http://bugs.kde.org/show_bug.cgi?id=142462

It states that fsync from desktop applications ruins power saving utilities either. Of course, people seems to be looking more closely to latencies and power savings nowadays, so all these applications will certainly have to be fixed now! That lowers the importance of this bug. Indeed, if fsync code had a "sleep(5)" to show problems, they certainly would be fixed! And people would not sync everything anymore...
Comment 29 Jan Kara 2008-04-21 02:54:58 UTC
To comment #27: My point was that in ordered mode, it can under some circumstances happen (although it's unlikely) that you write to some file A and then B and after a crash you'll see only B written. In principle, ordered mode only guarantees that you can't see some random garbage instead of your data and that your data will hit the disk at latest commit_interval (5s by default) after you wrote them.
Comment 30 Tyson Whitehead 2010-11-19 22:00:59 UTC
We ran into something that sounds very much like this on one of our systems that has a large amount of memory (over 200GB of RAM).  Processes doing large amounts of writing (e.g., an rsync) cause other interactive process (e.g., saving a file in vi) to hang for upwards of 8 minutes if they call fsync when writing.

It looks very much like the system ext3 system in ordered mode is forcing all the dirty blocks accumulated by the IO intensive process (which can be substantial due to the amount of memory available for caching) to also hit the disk before the interactive process that has done the fsync call is allowed to continue.

Is this correct?  Does ordered apply across the entire system?  Is it possible to get it to apply on a per-file or per-process basis instead?  This sort of dramatic cross-interference between processes is pretty brutal on a large system like this (it was intended to consolidate several other servers).

Thanks!  -Tyson
Comment 31 Christian Casteyde 2010-11-20 00:37:58 UTC
That was my understanding. The sad thing is that the more powerfull the system is, the more latencies you have.

For me, the system is quite useable with recent kernels (2.6.36), I redid the test and ordered mode doesn't seem to be impacted as it was with 2.6.24. So for me the ordered mode has become usable (and has become back as the default). I do not need my nofsync hack for now.

However, there are still great latencies (more than 26s to start KMail [please note I upgraded to KDE4 since, so it's not the same user environment either]), and I still consider IO priorities as the main default for desktop interactivity (and I think that even with SSD there will be some problems). Everybody looks at CPU scheduling for unreal and stupid use cases, but the real problem is IO throtling/fairness since the disk is the slower device and will still be for a long time. CFQ is not fair at all indeed.

I don't know if fdatasync is still implemented as fsync internally, or if the used algorithms in journaling code lose the granularity for applications, but I'm still very suspicious and I think there is lots of work to do in that domain and I'm not expecting any progress anytime soon because of that (it's a difficult problem indeed).

I don't know either if read operations are handled before write operations (they should: it's useless to block programs that read files because a single stupid program does a fsync while the system is write IO loaded by a third process). I hope it's handled that way, even if I still conceive that's something difficult due to rentrancy in the block layer or locks problems.

Still I'm wondering if modern complex disk algorithms are not bloatware: the way delayed allocations is done implies out of order operations between file operations, which compells userspace applications to use fsync, which ruins all the efforts the kernel is trying to do in the first place to be faster. Let's state it: it failed. It could be done correctly if fdatasync requests would track the file blocks deep to the block layer, but meanwhile a dumb block allocator and a dumb file system could do a better job with dumb userspace applications that do not use fsync, while keeping data integrity as long as operation order was preserved **even** between different files. There will still be a problem if a fsync is done: everything would have to be written. However, the fsync would not be necessary in the first place... Sometime, less is better (just my 2 cents philo). A form of BFSQ... (not a flamewar lol).
Comment 32 Tyson Whitehead 2010-11-22 14:44:59 UTC
I should add this is OpenVZ kernel 2.6.18-194.26.1.el5.028stab070.14.
Comment 33 Tyson Whitehead 2010-11-22 16:54:52 UTC
It also happens with the non-OpenVZ 2.6.18-194.26.1.el5 kernel, but seems to be resolved by OpenVZ kernel 2.6.27-levitan.1.

Guess we will just have to use data=writeback or something until RHEL 6.

Cheers!  -Tyson
Comment 34 Alan 2012-05-17 15:17:47 UTC
Closing this bug as obsolete. Some of the problem is there but it's well known and for the current file systems its probably unfixable.

Note You need to log in before you can comment on or make changes to this bug.