Bug 14830

Summary:	When other IO is running sync times go to 10 to 20 minutes
Product:	File System	Reporter:	Michael Godfrey (godfrey)
Component:	ext4	Assignee:	fs_ext4 (fs_ext4)
Status:	CLOSED OBSOLETE
Severity:	normal	CC:	agk, alan, david, haircut, jack, jmaggard10, linux-kernel-bugs, neil.broomfield, rhuddusa, sandeen, tytso
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.31.6-166.fc12.x86_64	Subsystem:
Regression:	Yes	Bisected commit-id:
Attachments:	descrition of system and excerps from /var/log/messages Requested iostat.log file messages output with sync running Fix long sync times during heavy writing additional data from kernel 2.6.32.9-67.fc12.x86_64 Patch to stop background writeback when other work is queued for the thread Patch to stop background writeback when other work is queued for the thread Patch to stop background writeback when other work is queued for the thread

Description Michael Godfrey 2009-12-18 08:02:22 UTC

Created attachment 24220 [details]
descrition of system and excerps from /var/log/messages

The sync task runs for a very long time and generates
log messages: INFO: task sync:1996 blocked for more than 120 seconds.
Under similar conditions the same messages appear in /var/log/messages,
but referencing nfsd.  During these periods performance drops
drastically.

See attached.

Comment 1 Alasdair G Kergon 2010-01-13 19:15:12 UTC

filesystem or nfs perhaps?

Comment 2 Jan Kara 2010-01-18 21:50:38 UTC

There are two issues here - the first is that sync takes long - this is at the level of "don't do it when it hurts" kind of thing ;). So when you do heavy writing and call sync, it simply takes long time to flush all the caches to disk. If you think the time is inappropriately long, we can have a look at it but for that we'd need much more details like amount and nature of data writen (many small files vs a few large ones), time it takes sync to complete, speed of disks for sequential IO...

The second issue is that nfsd blocks as well. Partly this might be because sync blocks writers (so that it can get it's work done in a finite time), partly it might be a limitation of ext4 because all metadata writes go through a journal which has a limited size and thus we have to copy data from the journal to final locations on disk once in a while and that usually leads to all writer processes blocking waiting for the journal space to be freed (which can take a longer time when 'sync' process is making disk busy with data writes).

Note in the log file:
This problem prevents production use of systems using this kernel.

evokes a question: Do you have a kernel which behaved better for you? Which one?

Comment 3 Michael Godfrey 2010-01-18 23:58:09 UTC

>This problem prevents production use of systems using this kernel.

>evokes a question: Do you have a kernel which behaved better for you? Which
>one?

Yes.  RHEL5.4 does not show this problem.  It is the production
system that works in this environment.

The response above is disappointing.  Is sync response of 20 minutes,
including several task timeouts to be considered "normal?"

>If you think the time is inappropriately long, we can have a look at it
>but for that we'd need much more details like amount and nature of data writen
>(many small files vs a few large ones), time it takes sync to complete, speed
>of disks for sequential IO...

I am sorry to have to tell you that in this environment we do not
deal in exclusively small or large files, we actually have quite a
few of both. When an rsync which transfers about 50GB of files of
various sizes is running, the hung condition is continuous until the rsync
completes.  This is just a pretty typical load.  You could try it
yourself.  No special sizes of files are required.  I think I
mentioned that the ext4 LVM is a RAID 50 3ware 9650SE-8LPML,
with 8 2T drives.  Its throughput for reading and writing is good
when the system is not locked up.

Comment 4 Anonymous Emailer 2010-01-19 17:24:49 UTC

Reply-To: cslee-list@cybericom.co.uk

bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14830
>
>
>
>
>
> --- Comment #3 from Michael Godfrey <godfrey@isl.stanford.edu>  2010-01-18
> 23:58:09 ---
>   
>> This problem prevents production use of systems using this kernel.
>>     
>
>   
>> evokes a question: Do you have a kernel which behaved better for you? Which
>> one?
>>     
>
> Yes.  RHEL5.4 does not show this problem.  It is the production
> system that works in this environment.
>
> The response above is disappointing.  Is sync response of 20 minutes,
> including several task timeouts to be considered "normal?"
>
>   
>> If you think the time is inappropriately long, we can have a look at it
>> but for that we'd need much more details like amount and nature of data
>> writen
>> (many small files vs a few large ones), time it takes sync to complete,
>> speed
>> of disks for sequential IO...
>>     
>
> I am sorry to have to tell you that in this environment we do not
> deal in exclusively small or large files, we actually have quite a
> few of both. When an rsync which transfers about 50GB of files of
> various sizes is running, the hung condition is continuous until the rsync
> completes.  This is just a pretty typical load.  You could try it
> yourself.  No special sizes of files are required.  I think I
> mentioned that the ext4 LVM is a RAID 50 3ware 9650SE-8LPML,
> with 8 2T drives.  Its throughput for reading and writing is good
> when the system is not locked up.
>
>   
Is it possible that it is something allong the lines of what is 
described at this link:
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html

If so a runtime adjustment might help you out.

Chris.

Comment 5 Theodore Tso 2010-01-19 19:38:34 UTC

So the Linux write cache mystery is unlikely to solve the problem since that was talking about backporting some tuning parameters from 2.6.22 to the RHEL/CentOS 2.6.18 kernel, and here the problem seems to be that the sync is taking much longer on a 2.6.31 FC 12 kernel. 

So the first thing I notice is the fact that you have the nodelalloc mount option enabled.   Any particular reason why you did that?   Try removing that; one of the reasons why ext4 is generally described as being much better than ext3 with respect to this problem (of the machine becoming unresponsive during a sync) is because of delayed allocation, and you've turned it off.   So try removing nodelalloc and see if that makes the performance come back.

Another thing that might be worth testing is to see whether an ext3 filesystem on a 2.6.31 FC 12 kernel behaves any differently.   This may be something that is some kind of VM tuning issue between RHEL 5.4 and a modern kernel; I don't know how many people try running Fedora 12 on a system with large amounts of memory and an NFS load, and maybe there is some kind of tuning issue that has been exposed.   So that's a quick experiment that's worth doing just so we can figure out where we need to concentrate our diagnostic efforts.

Another thing to try is to do some instrumentation using iostat to see what the system is doing, before, during, and after the sync command.

Comment 6 Michael Godfrey 2010-01-19 21:56:12 UTC

>So the first thing I notice is the fact that you have the nodelalloc mount
>option enabled.   Any particular reason why you did that?  

This was required due to an error in 2.6.30.  It is possible that
it is no longer needed. I will check.

It is not really feasible to revert to ext3 other than going back to
RHEL5 which is what has been done.  One of the main purposes of using
FC11 was ext4.  fsck under ext3 is about 8 hours, under ext4 about
9 minutes.  And, ext4 has other well-advertised advantages.

>Another thing to try is to do some instrumentation using iostat to see what
>the
>system is doing, before, during, and after the sync command.

I have tried this to some extent.  It is not too easy when response
is extremely slow.

Comment 7 Michael Godfrey 2010-01-25 07:38:52 UTC

I tried the advice above about nodelalloc using a newer kernel.
It had no observable effect, but did not cause problems as it
had before. So, I left it turned off.  And, I installed kernel
2.6.31.12-174.2.3.fc12.x86_64.

None of this helped.  sync times still at 20 minutes during rsync,
and task (sync or rsync) timeouts in /var/log/messages.

I guess that patience is called for here.

Comment 8 Theodore Tso 2010-01-25 08:27:26 UTC

Hmm, can you run "iostat 1 | tee iostat.log &", and while that is running, wait for a 15 seconds or so we can capture what things like in steady state, and then type "sync", and note when the sync command was initiated in the iostat.log file?   It would be useful seeing what this looks like on both your ext3 production server and on the ext4 test server.  Do you know if there are any other differences between the two systems, in terms of the workload seen by your production server versus your test server?

Another thing that would be very useful to do is to enable ftrace, and then cd to /sys/kernel/debug/tracing.  (This assumes that you have debugfs mounted on /sys/kernel/debug.)   Then "echo 1 > events/jbd2/jbd2_run_stats/enable" and then in a similar fashion, do "cat trace_pipe | tee /tmp/trace.output", wait for four or five data samples from your file system of interest, and then issue the sync command, and let's see what is happening.

Comment 9 Michael Godfrey 2010-01-25 11:07:46 UTC

Created attachment 24705 [details]
Requested iostat.log file

As requested by: Theodore Tso <tytso@mit.edu>

Comment 10 Michael Godfrey 2010-01-25 11:16:16 UTC

Created attachment 24706 [details]
messages  output with sync running

Additional information for Theodore Tso <tytso@mit.edu>

Comment 11 Michael Godfrey 2010-01-25 11:19:12 UTC

I have attached the iostat.log and log/messages output.

sync was started after about 4 cycles of iostat.

This was run on the FC12 ext4 system.  No testing can
be done on the production system.

After kill -9 of the sync run it took about 20 minutes before 
it died.

Michael

Comment 12 Theodore Tso 2010-01-28 09:18:21 UTC

On Wed, Jan 27, 2010 at 02:06:25PM +0100, Andre Noll wrote:
> On 11:19, bugzilla-daemon@bugzilla.kernel.org wrote:
> > After kill -9 of the sync run it took about 20 minutes before 
> > it died.
> 
> I was seeing similar behaviour on one of our servers, and changing
> the io scheduler to noop fixed things for me. So it seems to be an
> issue with cfq which is somehow triggered by ext4 but not by ext3.
> 
> To change the IO scheduler, just execute
> 
>       echo noop > /sys/block/sda/queue/scheduler
> 
> (replace sda if necessary).

Andre or Michael.  If switching away from cfq helps, that's
definitely... interesting.  Given that cfq is the default scheduler, I
definitely want to understand what might be going on here.  Are either
if you able to run blktrace so we can get a sense of what is going on
under the cfq and deadline/noop I/O schedulers?

And in both of your cases, were you using a new file system freshly
created using mke2fs -t ext4, or was this a ext2/ext3 filesystem that
was converted for use under ext4?

Thanks,

					- Ted

Comment 13 Anonymous Emailer 2010-01-28 11:23:48 UTC

Reply-To: maan@systemlinux.org

On 02:53, tytso@mit.edu wrote:
> On Wed, Jan 27, 2010 at 02:06:25PM +0100, Andre Noll wrote:
> > On 11:19, bugzilla-daemon@bugzilla.kernel.org wrote:
> > > After kill -9 of the sync run it took about 20 minutes before 
> > > it died.
> > 
> > I was seeing similar behaviour on one of our servers, and changing
> > the io scheduler to noop fixed things for me. So it seems to be an
> > issue with cfq which is somehow triggered by ext4 but not by ext3.
> > 
> > To change the IO scheduler, just execute
> > 
> >     echo noop > /sys/block/sda/queue/scheduler
> > 
> > (replace sda if necessary).
> 
> Andre or Michael.  If switching away from cfq helps, that's
> definitely... interesting.  Given that cfq is the default scheduler, I
> definitely want to understand what might be going on here.  Are either
> if you able to run blktrace so we can get a sense of what is going on
> under the cfq and deadline/noop I/O schedulers?

Yes, I can use that machine freely for testing purposes, including
reboots. It is just our fallback server which creates hardlink-based
snapshots using rsync.

However, I have to recompile the kernel to include debugfs which is
needed by blktrace and I'd like to wait until the currently running
rsync completes before rebooting. Would you like to see the output of

	btrace /dev/mapper/...

or should I use more sophisticated command line options?

> And in both of your cases, were you using a new file system freshly
> created using mke2fs -t ext4, or was this a ext2/ext3 filesystem that
> was converted for use under ext4?

The ext4 file system was created from scratch using -O
dir_index,uninit_bg,extent, a block size of 4096 and 32768 bytes
per inode.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

Comment 14 Michael Godfrey 2010-01-28 16:21:57 UTC

On 01/27/2010 11:53 PM, tytso@mit.edu wrote:
> And in both of your cases, were you using a new file system freshly
> created using mke2fs -t ext4, or was this a ext2/ext3 filesystem that
> was converted for use under ext4?
>
> Thanks,
>
>                                       - Ted
>    
In my case it is new ext4.  I will try what tests I can in a day or
two.  The systems are busy right now.

Michael

Comment 15 Theodore Tso 2010-01-29 00:52:01 UTC

> In my case it is new ext4.  I will try what tests I can in a day or
> two.  The systems are busy right now.

Well, I'm flying back from New Zealand so I'll be off the grid mostly
until Sunday or Monday.

Thanks for your efforts!

						- Ted

Comment 16 Michael Godfrey 2010-01-29 06:37:55 UTC

I had a chance to try noop and deadline:

With the amount of testing that I could do they seemed to behave
about the same, and:

1. the task timeouts are gone.

2. sync, which used to take over 20 minutes is consistently
   about 10 minutes.  I did retest by setting scheduler back
   to cfg and the sync time went back to 24 minutes.

3. I also tried 2 variations:
   1. rsync from source to target:  rsync machinea:/aaa bbb
   2. rsync of an NFS-mounted filesystem to local filesystem.
      (i.e the rsync thought it was local:  rsync aaa/  bbb,
      but aaa was NFS-mounted (NFS3).
   These two variation behaved just about identically.

I have left the system set to deadline, as recommended.  With
this setting (and with noop) things like du seem quicker too.

So, this is good, but 10 minutes for sync is still way too long.
With these systems, I cannot compile debug kernels or stuff like
that.  They are in use for backup.

Any other suggestions?

Michael

Comment 17 Michael Godfrey 2010-02-04 04:16:40 UTC

I have now set deadline for all tests, and I had
a chance to try a system just like the one I have been
using, but with an 8.1T ext4 partition instead of the
11T partition before.

I used the same load:  rsync  /aaa /bbb, where /aaa was
NFS-mounted.
The behavior was quite similar, but the longest sync
time was about 4 minutes.

This suggests that the size of the filesystem is an
important factor.  Increasing the filesystem size by 36%
more than doubled the sync time.

Th size of the database in the 2 systems was somewhat larger
in the 11T system, but part of that was unused in the transfer.

Michael

Comment 18 Jan Kara 2010-02-15 14:28:38 UTC

Created attachment 25050 [details]
Fix long sync times during heavy writing

Last week, I've noticed a bug in writeback code causing sync to take longer than necessary. This patch should fix it. Is it possible for you to try it? Thanks.

Comment 19 Michael Godfrey 2010-02-15 18:29:18 UTC

On 02/15/2010 06:28 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> Last week, I've noticed a bug in writeback code causing sync to take longer
> than necessary. This patch should fix it. Is it possible for you to try it?
> Thanks.
>    
I have been following this.  But right now I am leaving on a trip.  And,
I cannot right now build kernels for the system where we get the
problem.  These systems are in production use for backup.

Sorry that I cannot help more.

Michael

Comment 20 Michael Godfrey 2010-03-10 06:56:33 UTC

Created attachment 25447 [details]
additional data from kernel 2.6.32.9-67.fc12.x86_64

Again, I have had very limited time, but I ran an 
rsync after update to kernel 2.6.32.9-67.fc12.x86_64.

The time for sync to complete was about as before,
(about 10 minutes) and timeouts occurred as shown in the
attached log entries.

I assume that the patches are not included in this
kernel, but I just thought that I should report this test.

Michael

Comment 21 Eric Sandeen 2010-03-10 15:36:20 UTC

Michael, you're right the patch is not in f12 (not in upstream yet, either).

Comment 22 Michael Godfrey 2010-04-02 05:41:26 UTC

I tried another test with 2.6.32.10-90.fc12.x86_64.  I did
not expect an improvement.  But, the results were actually
a lot worse.  After starting an rsync  which transferred a
few 100GB through NFS, I started a sync using time sync.
This caused a number of the usual 2 minute timeout messages.  But, also
it did not close until about 20 minutes after the rsync had
completed.  All together  it ran for several hours. By the
way it was not possible to kill the sync using kill -9.

This is clearly hopeless.

Will anything be done about this in 2.6.33 for fc13?

Will the fact that Google is going with ext4 possibly help?

Michael

Comment 23 Eric Sandeen 2010-04-02 16:41:39 UTC

The patch in comment #18 is still not upstream.  Jan, what's the status of that?

Michael, I did discover one issue upstream related to fsync, see:

http://marc.info/?l=linux-ext4&m=126987658403214&w=2

This was very inefficient scanning of large files for sync.  However, for
sys_sync, I didn't see the problem because the loop was limited in that case, so it may not be related.

(In reply to comment #3)
> >This problem prevents production use of systems using this kernel.
> 
> >evokes a question: Do you have a kernel which behaved better for you? Which
> >one?
> 
> Yes.  RHEL5.4 does not show this problem.  It is the production
> system that works in this environment.

RHEL5.4 on ext3 or ext4?
 
> The response above is disappointing.  Is sync response of 20 minutes,
> including several task timeouts to be considered "normal?"

Probably not, but it really depends.  If you have a system with massive amounts of memory, and a slow path to the disk, then sure, if you have to flush many many gigabytes it will be slow.  But that's extreme, and I don't think you're in that case.  You do have a 12G box though, so that's potentially a lot of memory to flush.  OTOH your storage should probably be reasonably fast.

It does seem like something else is going on here.

(In reply to comment #22)
> I tried another test with 2.6.32.10-90.fc12.x86_64.  I did
> not expect an improvement.  But, the results were actually
> a lot worse.  After starting an rsync  which transferred a
> few 100GB through NFS, I started a sync using time sync.
> This caused a number of the usual 2 minute timeout messages.  But, also
> it did not close until about 20 minutes after the rsync had
> completed.  All together  it ran for several hours. By the
> way it was not possible to kill the sync using kill -9.
> 
> This is clearly hopeless.

Hm, don't give up quite yet ;)

Can you describe this test a little more explicitly; which box was the nfs server vs. client, which boxes were the rsync servers/clients, which box ran sync?  I just don't want to make wrong assumptions in trying to recreate this.

> Will anything be done about this in 2.6.33 for fc13?

we still have to get to the bottom of the problem before we can talk about fixes, I'm afraid.

> Will the fact that Google is going with ext4 possibly help?

I don't think so.

One thing that may be interesting is to run blktrace (or use seekwatcher to do that for you) during the sync call that is stalling out, to get an idea of what is happening at the block layer and when.

Comment 24 Eric Sandeen 2010-04-02 21:23:21 UTC

For what it's worth, assuming I have replicated the behavior properly, the long-running sync doesn't seem unique to ext4 at all.

I can replicate it by running a script which creates 4G files in sequence, putting it in the background, sleeping for a while, and typing "sync" - which never returns.

I see the same behavior on ext4 as well as xfs and ext3.

I applied Jan's patch from comment #18, and the behavior is unchanged.

Comment 25 Jan Kara 2010-04-06 16:46:52 UTC

As I'm reading through wb_writeback, it could happen that the flushing thread gets stuck in background flushing and thus we never gets to processing the work for sync(1) and thus sync(1) never finishes. The attached patch should fix that.
Eric, could you please test whether this patch together with the patch from comment 18 fixes your testcase?

Comment 26 Jan Kara 2010-04-06 16:47:54 UTC

Created attachment 25882 [details]
Patch to stop background writeback when other work is queued for the thread

Comment 27 Eric Sandeen 2010-04-06 16:52:28 UTC

Jan, will do after lunch.  Thanks for looking into this!

-Eric

Comment 28 Jan Kara 2010-04-06 16:55:10 UTC

Created attachment 25883 [details]
Patch to stop background writeback when other work is queued for the thread

Oops, attached a wrong patch. This is the right one.

Comment 29 Eric Sandeen 2010-04-06 22:59:27 UTC

A quick test on ext3 looks good; ext4 still seems to run away on sync :(

But I need to be a bit more methodical, and test a few more filesystems; will let you know.

I'm not familiar enough with all the new writeback code; does this mean that a sync will return as soon as any new IO is queued post-sync?  That seems odd if so - but maybe I misunderstand.

Thanks,
-Eric

Comment 30 Eric Sandeen 2010-04-06 23:05:57 UTC

Hm maybe I spoke too soon, had a couple runs on ext3 that looked good but now it's been syncing for many minutes ...

(this is the test where I create 4G files in a loop, let it go for a while, then time sync - on a 16g box)

Comment 31 Jan Kara 2010-04-07 09:33:00 UTC

(In reply to comment #29)
> I'm not familiar enough with all the new writeback code; does this mean that
> a
> sync will return as soon as any new IO is queued post-sync?  That seems odd
> if
> so - but maybe I misunderstand.
  No. The patch means that writeback thread stops doing pdflush-style writeback when it sees new work queued - work does not mean IO. It means that someone asks writeback thread to do some kind of writeout... Now I'm still not sure my patch is the right approach to the problem but I just wanted to checkout whether it
at least solves the problem of this particular workload.

  If you still see the problem (only less often) even with ext3, then we probably have also some other work that is livelockable and thus we never get to work submitted by sync(1). I guess I'll have to find a machine with enough disks and memory to try this out...

Comment 32 Eric Sandeen 2010-04-07 15:09:56 UTC

I'm doing this on a machine with a single spare disk, dedicated to the test, so disks aren't an issue.   I haven't tried to replicate with less memory.

I'll try to poke at it a bit more today...

-Eric

Comment 33 Neil Broomfield 2010-04-15 08:44:13 UTC

I think I might be seeing the same/similar issue, see "dmesg | tail" below, it seams fairly repeatable and occurs usually when I move a file from my download drive to my raid array 5 (5 disk, 6TB), I'm running Ubuntu 9.10 with all the latest official patches.

[301080.930044] INFO: task kjournald2:1583 blocked for more than 120 seconds.
[301080.930048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[301080.930051] kjournald2    D ffff88007734e8e0     0  1583      2 0x00000000
[301080.930056]  ffff88006c38dd10 0000000000000046 ffff88005695d850 0000000000015880
[301080.930060]  ffff880070d247c0 0000000000015880 0000000000015880 0000000000015880
[301080.930063]  0000000000015880 ffff880070d247c0 0000000000015880 0000000000015880
[301080.930066] Call Trace:
[301080.930075]  [<ffffffff811f096a>] jbd2_journal_commit_transaction+0x1aa/0x1120
[301080.930080]  [<ffffffff8127c336>] ? rb_erase+0xd6/0x160
[301080.930085]  [<ffffffff81010785>] ? __switch_to+0x1e5/0x370
[301080.930088]  [<ffffffff8104f075>] ? finish_task_switch+0x65/0x120
[301080.930093]  [<ffffffff8152c98a>] ? _spin_lock_irqsave+0x2a/0x40
[301080.930097]  [<ffffffff8106af57>] ? lock_timer_base+0x37/0x70
[301080.930101]  [<ffffffff81078a30>] ? autoremove_wake_function+0x0/0x40
[301080.930105]  [<ffffffff811f6963>] kjournald2+0x103/0x270
[301080.930108]  [<ffffffff81078a30>] ? autoremove_wake_function+0x0/0x40
[301080.930111]  [<ffffffff811f6860>] ? kjournald2+0x0/0x270
[301080.930114]  [<ffffffff81078646>] kthread+0xa6/0xb0
[301080.930117]  [<ffffffff8101316a>] child_rip+0xa/0x20
[301080.930120]  [<ffffffff810785a0>] ? kthread+0x0/0xb0
[301080.930123]  [<ffffffff81013160>] ? child_rip+0x0/0x20
[301080.930145] INFO: task mv:2005 blocked for more than 120 seconds.
[301080.930146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[301080.930148] mv            D ffff880079705928     0  2005   2464 0x00000000
[301080.930152]  ffff8800210f9868 0000000000000086 ffff8800210f97e8 0000000000015880
[301080.930155]  ffff880003b747c0 0000000000015880 0000000000015880 0000000000015880
[301080.930159]  0000000000015880 ffff880003b747c0 0000000000015880 0000000000015880
[301080.930162] Call Trace:
[301080.930165]  [<ffffffff8152c87d>] __down_read+0x8d/0xc6
[301080.930169]  [<ffffffff8152bc49>] down_read+0x19/0x20
[301080.930172]  [<ffffffff811b5fe2>] ext4_get_blocks+0x52/0x210
[301080.930175]  [<ffffffff811b62e7>] ext4_da_get_block_prep+0x77/0x100
[301080.930179]  [<ffffffff81148663>] __block_prepare_write+0x1c3/0x560
[301080.930182]  [<ffffffff811b6270>] ? ext4_da_get_block_prep+0x0/0x100
[301080.930185]  [<ffffffff81148b9f>] block_write_begin+0x5f/0x100
[301080.930188]  [<ffffffff811b8a0d>] ext4_da_write_begin+0x12d/0x260
[301080.930191]  [<ffffffff811b6270>] ? ext4_da_get_block_prep+0x0/0x100
[301080.930194]  [<ffffffff8104f075>] ? finish_task_switch+0x65/0x120
[301080.930198]  [<ffffffff810da1a2>] generic_perform_write+0xb2/0x1d0
[301080.930202]  [<ffffffff811e4dbb>] ? ext4_xattr_get+0x5b/0x90
[301080.930206]  [<ffffffff810dafd3>] generic_file_buffered_write+0x83/0x140
[301080.930209]  [<ffffffff810dc950>] __generic_file_aio_write_nolock+0x240/0x470
[301080.930213]  [<ffffffff811348f3>] ? touch_atime+0x33/0x150
[301080.930216]  [<ffffffff810dcca0>] generic_file_aio_write+0x70/0xf0
[301080.930221]  [<ffffffff811ae7f9>] ext4_file_write+0x49/0x160
[301080.930225]  [<ffffffff8111f342>] do_sync_write+0xf2/0x130
[301080.930229]  [<ffffffff811511db>] ? fsnotify+0xfb/0x140
[301080.930232]  [<ffffffff81078a30>] ? autoremove_wake_function+0x0/0x40
[301080.930235]  [<ffffffff81133333>] ? dput+0xc3/0x190
[301080.930239]  [<ffffffff81224a11>] ? security_file_permission+0x11/0x20
[301080.930242]  [<ffffffff8111f628>] vfs_write+0xb8/0x1a0
[301080.930245]  [<ffffffff811200dc>] sys_write+0x4c/0x80
[301080.930249]  [<ffffffff81012082>] system_call_fastpath+0x16/0x1b

Let me know if I can be of any further assistance? If you require more info / test running, I'm happy to help.

Neil

Comment 34 Dave Chinner 2010-04-20 02:49:34 UTC

See:

http://marc.info/?l=linux-fsdevel&m=127166071530948&w=2

Sync is acting as designed right now. I agree it's not ideal, but it's now defaulting to slow-but-safe behaviour rather than the previous behaviour of potentially not syncing everything that was dirty at the time of the sync call.

Cheers,

Dave.

Comment 35 Michael Godfrey 2010-04-20 03:46:28 UTC

Dave (above) said:

"Sync is acting as designed right now. I agree it's not ideal, but it's now
defaulting to slow-but-safe behaviour rather than the previous behaviour of
potentially not syncing everything that was dirty at the time of the sync call.

Dave,"

Are you aware that this blocks other IO so that a user who requests
a read of some data may have to wait for something like 20 minutes
before getting a response?  This includes, for instance, just typing
vi xxx.  Take a look at the reports above which show nfsd being effectively
blocked for periods of more than 20 minutes.

For me this is not just "not ideal" but simply useless.  I do not
see how a system with this behavior can be used.

I also do not see why sync completing with dirty data is a problem.
In an active system there will be new dirty data within milliseconds
of sync completion no matter what it does.

I am well-aware that this is not a simple problem.  But, a solution
that is consistent with the usability of the system is necessary.

Michael

Comment 36 Dave Chinner 2010-04-20 05:24:39 UTC

(In reply to comment #35)

> Are you aware that this blocks other IO so that a user who requests
> a read of some data may have to wait for something like 20 minutes
> before getting a response? This includes, for instance, just typing
> vi xxx.

That's not a just a read - vi(m) writes a backup file when you open
it. So it's blocking on writes.

However, that sort of antisocial behaviour under heavy write loads is usually caused by a filesystem concurrency limitation or a IO scheduler problem, not sync. However, unless you can reproduce the read hangs on XFS when sync is running, then I'm not the expert you're looking for to debug them. ;)

But I do know the endless sync problem is filesystem independent and I'm trying to do something about mitigating it's effects:

http://lkml.org/lkml/2010/4/19/410

> Take a look at the reports above which show nfsd being effectively
> blocked for periods of more than 20 minutes.

Can't say I've heard of any such recent problems on XFS....

Cheers,

Dave.

Comment 37 Neil Broomfield 2010-04-20 09:00:28 UTC

"Can't say I've heard of any such recent problems on XFS...."

Are these not effectivly the same issue?:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/276476

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/494476

Neil

Comment 38 Michael Godfrey 2010-04-30 06:02:32 UTC

Just a couple of comments on the posts above:

1. In our environment, home directories are NFS mounted,
   so vi cannot read .vimrc.  It appears that this is where
   it hangs.  The important thing is that nothing
   appears on the user's screen after typing vi. 

2. We do not have an XFS filesystem, so I cannot comment on
   whether these problems exist for XFS.

3. The comments indicate that, as usual, there is more than
   one problem.  I am hopeful that the ext4 problem that Eric
   Sandeen is fixing and proposed fixes for sync will make
   a significant difference.

Comment 39 Justin Maggard 2010-05-20 02:11:11 UTC

(In reply to comment #28)
> Created an attachment (id=25883) [details]
> Patch to stop background writeback when other work is queued for the thread
> 
> Oops, attached a wrong patch. This is the right one.

I ran into this issue recently (extra long sync times) and have been trying the patches attached in this thread.  The above patch specifically causes a pretty significant performance regression for me doing a simple sequential dd write on a dual-core Atom system running 2.6.33.4 x86_64.  This command:
dd bs=1M conv=fsync if=/dev/zero of=test_file count=10000
went from 135 MB/sec to all the way down to 92.9 MB/sec.  Is this expected?

Comment 40 Jan Kara 2010-05-20 11:11:41 UTC

Created attachment 26456 [details]
Patch to stop background writeback when other work is queued for the thread

Ah, there's a bug in the condition in the original patch and thus it stops background writeback when there it *no* other work queued instead when there *is*. Could you please test this fixed version?

Comment 41 Justin Maggard 2010-05-20 17:53:08 UTC

Ah, much better. :)  Performance for me with this version is now just about the same as without the patch.