Bug 12579 - ext4 filesystem hang
Summary: ext4 filesystem hang
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-29 21:03 UTC by Eric Sandeen
Modified: 2010-02-27 18:28 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.29-rc4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Tested patch (2.92 KB, patch)
2009-02-13 14:06 UTC, Eric Sandeen
Details | Diff
updated patch w/ my signoffs (3.01 KB, patch)
2009-02-13 14:22 UTC, Eric Sandeen
Details | Diff

Description Eric Sandeen 2009-01-29 21:03:40 UTC
Latest working kernel version:
Earliest failing kernel version: 2.6.29 rc2 git1
Distribution: Rawhide
Hardware Environment: x86_64
Software Environment: Rawhide
Problem Description:

Deadlock/hang:

SysRq : Show Blocked State
  task                        PC stack   pid father
livecd-creato D ffff880001869f30  3096 11953   3276
 ffff880064cc1b18 0000000000000046 0000000000000001 0000000000000002
 ffffffff818fe700 ffffffff818fe700 ffff8800836cf000 ffff880010f28000
 ffff88006389a390 ffff880010f283b8 0000000000000000 0000000000000046
Call Trace:
 [<ffffffff81016ec0>] ? read_tsc+0xe/0x24
 [<ffffffff81066749>] ? getnstimeofday+0x58/0xb4
 [<ffffffff81380f70>] io_schedule+0x63/0xa5
 [<ffffffff810a7029>] sync_page+0x49/0x4d
 [<ffffffff8138136a>] __wait_on_bit_lock+0x45/0x79
 [<ffffffff810a6fe0>] ? sync_page+0x0/0x4d
 [<ffffffff810a6f93>] __lock_page+0x63/0x6a
 [<ffffffff8105f0c8>] ? wake_bit_function+0x0/0x2a
 [<ffffffff810ae25b>] write_cache_pages+0x1fc/0x3a4
 [<ffffffffa0020231>] ? __mpage_da_writepage+0x0/0x163 [ext4]
 [<ffffffffa0005329>] ? jbd2_journal_start+0xef/0x119 [jbd2]
 [<ffffffffa0005329>] ? jbd2_journal_start+0xef/0x119 [jbd2]
 [<ffffffffa001ffe1>] ext4_da_writepages+0x2c0/0x457 [ext4]
 [<ffffffff8138164c>] ? mutex_unlock+0x9/0xb
 [<ffffffff810a8b01>] ? generic_file_aio_write+0x74/0xc5
 [<ffffffff8106ebcb>] ? print_lock_contention_bug+0x1e/0x110
 [<ffffffffa0021310>] ? ext4_da_get_block_write+0x0/0x130 [ext4]
 [<ffffffff810ae456>] do_writepages+0x2b/0x3b
 [<ffffffff810a7b3f>] __filemap_fdatawrite_range+0x51/0x59
 [<ffffffff810a889a>] filemap_fdatawrite+0x1a/0x1c
 [<ffffffff810fbbe5>] vfs_fsync+0x53/0xac
 [<ffffffff810fbc70>] do_fsync+0x32/0x48
 [<ffffffff810fbc94>] sys_fdatasync+0xe/0x12
 [<ffffffff810112ba>] system_call_fastpath+0x16/0x1b
kjournald2    D ffff88000c475e28  5040 11969      2
 ffff88000c475d40 0000000000000046 0000000000000000 0000000000000046
 ffffffff818fe700 ffffffff818fe700 0000000000000000 ffff880014882390
 ffff88007eb88000 ffff880014882748 000000000c475ce0 0000000000000046
Call Trace:
 [<ffffffff8105f215>] ? prepare_to_wait+0x1c/0x5e
 [<ffffffff81383055>] ? _spin_unlock_irqrestore+0x47/0x57
 [<ffffffff8106d609>] ? trace_hardirqs_on_caller+0x1f/0x153
 [<ffffffff8119a55f>] ? _raw_spin_unlock+0x8e/0x93
 [<ffffffffa00057e8>] jbd2_journal_commit_transaction+0x25c/0x159c [jbd2]
 [<ffffffff81054bf5>] ? lock_timer_base+0x26/0x4a
 [<ffffffff81383055>] ? _spin_unlock_irqrestore+0x47/0x57
 [<ffffffff8106d609>] ? trace_hardirqs_on_caller+0x1f/0x153
 [<ffffffff8105f090>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff81054c67>] ? try_to_del_timer_sync+0x4e/0x5c
 [<ffffffffa000a68c>] kjournald2+0x172/0x3a8 [jbd2]
 [<ffffffff8105f090>] ? autoremove_wake_function+0x0/0x38
 [<ffffffffa000a51a>] ? kjournald2+0x0/0x3a8 [jbd2]
 [<ffffffff8105ed18>] kthread+0x49/0x76
 [<ffffffff8101262a>] child_rip+0xa/0x20
 [<ffffffff81383002>] ? _spin_unlock_irq+0x2b/0x37
 [<ffffffff8106d74a>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81011f3e>] ? restore_args+0x0/0x30
 [<ffffffff8105ecaa>] ? kthreadd+0x176/0x19b
 [<ffffffff8105eccf>] ? kthread+0x0/0x76
 [<ffffffff81012620>] ? child_rip+0x0/0x20
pdflush       D ffff880001878350   992 12227      2
 ffff880050975960 0000000000000046 0000000000000000 ffff8800724d8ea0
 ffffffff818fe700 ffffffff818fe700 ffff8800509758e0 ffff880076c48000
 ffffffff815334f0 ffff880076c483b8 0000000000000046 0000000000000046
Call Trace:
 [<ffffffff8105f173>] ? prepare_to_wait_exclusive+0x1c/0x5e
 [<ffffffff81016ec0>] ? read_tsc+0xe/0x24
 [<ffffffff81066749>] ? getnstimeofday+0x58/0xb4
 [<ffffffff81380f70>] io_schedule+0x63/0xa5
 [<ffffffff810a7029>] sync_page+0x49/0x4d
 [<ffffffff8138136a>] __wait_on_bit_lock+0x45/0x79
 [<ffffffff810a6fe0>] ? sync_page+0x0/0x4d
 [<ffffffff810a6f93>] __lock_page+0x63/0x6a
 [<ffffffff8105f0c8>] ? wake_bit_function+0x0/0x2a
 [<ffffffff810ae25b>] write_cache_pages+0x1fc/0x3a4
 [<ffffffffa0020231>] ? __mpage_da_writepage+0x0/0x163 [ext4]
 [<ffffffffa0005329>] ? jbd2_journal_start+0xef/0x119 [jbd2]
 [<ffffffffa001ffe1>] ext4_da_writepages+0x2c0/0x457 [ext4]
 [<ffffffff810ae375>] ? write_cache_pages+0x316/0x3a4
 [<ffffffff8106ebcb>] ? print_lock_contention_bug+0x1e/0x110
 [<ffffffff8103b23a>] ? account_group_exec_runtime+0x2c/0x44
 [<ffffffff8106ebcb>] ? print_lock_contention_bug+0x1e/0x110
 [<ffffffffa0021310>] ? ext4_da_get_block_write+0x0/0x130 [ext4]
 [<ffffffff810ae456>] do_writepages+0x2b/0x3b
 [<ffffffff810f893e>] __writeback_single_inode+0x187/0x2fc
 [<ffffffff810f8ee0>] ? generic_sync_sb_inodes+0x275/0x362
 [<ffffffff810f8e86>] generic_sync_sb_inodes+0x21b/0x362
 [<ffffffff810f91c3>] writeback_inodes+0xa2/0xfa
 [<ffffffff810aeaaf>] background_writeout+0x92/0xcb
 [<ffffffff810af0eb>] pdflush+0x14e/0x201
 [<ffffffff810aea1d>] ? background_writeout+0x0/0xcb
 [<ffffffff810aef9d>] ? pdflush+0x0/0x201
 [<ffffffff8105ed18>] kthread+0x49/0x76
 [<ffffffff8101262a>] child_rip+0xa/0x20
 [<ffffffff81383002>] ? _spin_unlock_irq+0x2b/0x37
 [<ffffffff8106d74a>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81011f3e>] ? restore_args+0x0/0x30
 [<ffffffff8105ecaa>] ? kthreadd+0x176/0x19b
 [<ffffffff8105eccf>] ? kthread+0x0/0x76
 [<ffffffff81012620>] ? child_rip+0x0/0x20
Sched Debug Version: v0.08, 2.6.29-0.53.rc2.git1.fc11.x86_64 #1
now at 8999693.829504 msecs
  .sysctl_sched_latency                    : 40.000000
  .sysctl_sched_min_granularity            : 8.000000
  .sysctl_sched_wakeup_granularity         : 10.000000
  .sysctl_sched_child_runs_first           : 0.000001
  .sysctl_sched_features                   : 24191

cpu#0, 2666.818 MHz
  .nr_running                    : 0
  .load                          : 0
  .nr_switches                   : 5337240
  .nr_load_updates               : 3406197
  .nr_uninterruptible            : -5686
  .jiffies                       : 4303666545
  .next_balance                  : 4303.666549
  .curr->pid                     : 0
  .clock                         : 8999757.002901
  .cpu_load[0]                   : 0
  .cpu_load[1]                   : 0
  .cpu_load[2]                   : 0
  .cpu_load[3]                   : 0
  .cpu_load[4]                   : 0
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 693
  .sched_switch                  : 0
  .sched_count                   : 5466466
  .sched_goidle                  : 1234594
  .ttwu_count                    : 3148627
  .ttwu_local                    : 2728322
  .bkl_count                     : 50886

cfs_rq[0]:/
  .exec_clock                    : 2295440.884259
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 42668359.480395
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 0
  .load                          : 0
  .nr_spread_over                : 336023
  .shares                        : 0

rt_rq[0]:/
  .rt_nr_running                 : 0
  .rt_throttled                  : 0
  .rt_time                       : 0.000000
  .rt_runtime                    : 950.000000

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------

cpu#1, 2666.818 MHz
  .nr_running                    : 1
  .load                          : 1024
  .nr_switches                   : 4843840
  .nr_load_updates               : 3459888
  .nr_uninterruptible            : 5693
  .jiffies                       : 4303666545
  .next_balance                  : 4303.666801
  .curr->pid                     : 13252
  .clock                         : 8999249.133201
  .cpu_load[0]                   : 1024
  .cpu_load[1]                   : 561
  .cpu_load[2]                   : 448
  .cpu_load[3]                   : 422
  .cpu_load[4]                   : 476
  .yld_exp_empty                 : 0
  .yld_act_empty                 : 0
  .yld_both_empty                : 0
  .yld_count                     : 1603
  .sched_switch                  : 0
  .sched_count                   : 5039690
  .sched_goidle                  : 1429690
  .ttwu_count                    : 2758073
  .ttwu_local                    : 2199222
  .bkl_count                     : 38801

cfs_rq[1]:/
  .exec_clock                    : 1717848.884015
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 50983710.080737
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 8315350.600342
  .nr_running                    : 1
  .load                          : 1024
  .nr_spread_over                : 385373
  .shares                        : 0

rt_rq[1]:/
  .rt_nr_running                 : 0
  .rt_throttled                  : 0
  .rt_time                       : 0.000000
  .rt_runtime                    : 950.000000

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
R           bash 13252  50983710.080737       245   120  50983710.080737       868.862431     66420.517493 /



Steps to reproduce:

Run livecd-creator from rawhide to create a desktop spin:

livecd-creator --config /usr/share/spin-kickstarts/fedora-livecd-desktop-en_US.ks --cache /mnt/test/livecd/cache --tmpdir /mnt/test/livecd/tmp

kickstart file modified to create image on ext4

haven't dug into this yet but wanted to capture it.
Comment 1 Eric Sandeen 2009-02-12 07:24:05 UTC
Finally had some time, and I've tracked this down to the page-writeback.c changes since .28 - backing them all out the testcase runs overnight.  Still working out which one and why.  I suppose it's still possible it's an ext4 bug but if so the page-writeback.c changes exposed it.

It's a classic deadlock; I usually have 2 threads stuck, pdflush vs. the livecd creator doing an fsync.  Each is waiting for a page the other has locked.

-Eric
Comment 2 Aneesh Kumar K.V 2009-02-12 08:37:03 UTC
Two commits which are already in the mainline which fixed some changes
after .28 are

89e1219004b3657cc014521663eeef0744f1c99d
dcf6a79dda5cc2a2bec183e50d829030c0972aaa


Both changes have implication wrt ext4. I would suggest try with the
above two commits and see if the problem still exist.

-aneesh
Comment 3 Eric Sandeen 2009-02-12 08:44:36 UTC
Aneesh, thanks.  I tried both of those last night, and the problem persists.
Comment 4 Theodore Tso 2009-02-12 10:48:52 UTC
>> Finally had some time, and I've tracked this down to the
>> page-writeback.c changes since .28 - backing them all out the
>> testcase runs overnight.  Still working out which one and why.  I
>> suppose it's still possible it's an ext4 bug but if so the
>> page-writeback.c changes exposed it.
>> 
>> It's a classic deadlock; I usually have 2 threads stuck, pdflush
>> vs. the livecd creator doing an fsync.  Each is waiting for a page
>> the other has locked.
>
> Two commits which are already in the mainline which fixed some changes
> after .28 are
> 
> 89e1219004b3657cc014521663eeef0744f1c99d
> dcf6a79dda5cc2a2bec183e50d829030c0972aaa
> 

I read Eric's email as saying he tracked it down to the
page-writeback.c changes *after* 2.6.28.  Eric, is that what you
meant?

	    	      	       	    	     - Ted
Comment 5 Eric Sandeen 2009-02-12 10:55:45 UTC
Yep.  the 2 patches Aneesh refers to are just hours hold, so they were fixes after I originally saw the hang and worth testing, but they don't seem to help.

I'm still trying some bisection of the changes (unfortunately many intermediate points seem nonfunctional) and may have finally found a spot; if this doesn't yield results though it's time for another crash-course in writeback.  ;)

-Eric
Comment 6 Eric Sandeen 2009-02-12 16:42:51 UTC
dJust for the record; here are the paths we're stuck on:

livecd:

sys_fdatasync
	do_fsync
		vfs_fsync
			filemap_fdatawrite
				__filemap_fdatawrite_range
					do_writepage
						ext4_da_writepages
							write_cache_pages
								lock page (currently locked by pdflush)

pdflush:

pdflush
	background_writeout
		writeback_inodes
			generic_sync_sb_inodes
				__writeback_single_inode
					do_writepages
						ext4_da_writepages
							write_cache_pages
								lock page (currently loscked by livecd)

the pdflush wbc looks like:

[1]kdb> wbc 0xffff88007d4abe10
struct writeback_control at  0xffff88007d4abe10
 bdi = (null) sync_mode 0 (NONE)
 nr_to_write = 992 pages_skipped = 0
 range_start 0, range_end 0
 nonblocking 1   congestion 0 kupdate 0   reclaim 0
  writepages 1 range_cyclic 1 more_io 0 no_update 1

and the livecd-creator fsync wbc looks like:

[1]kdb> wbc 0xffff880012821ea8
struct writeback_control at  0xffff880012821ea8
 bdi = (null) sync_mode 1 (ALL)
 nr_to_write = 9223372036854775689 pages_skipped = 0
 range_start 0, range_end 9223372036854775807
 nonblocking 0   congestion 0 kupdate 0   reclaim 0
  writepages 1 range_cyclic 0 more_io 0 no_update 1
Comment 7 Aneesh Kumar K.V 2009-02-13 03:50:47 UTC
How about the below patch ?

From: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Subject: [PATCH] ext4: Don't use the range_cylic mode implemented by write_cache_pages

With delayed allocation we lock the page in write_cache_pages and try to build
an in memory extent of contiguous blocks. This is needed so that we can get
large contiguous blocks request. Now with range_cyclic mode in write_cache_pages
if we have not done an I/O we loop back to 0 index and try to write the page.
That would imply we will attempt to take page lock of lower index page holding
the page lock of higher index page. This can cause a dead lock with other writeback
thread.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

---
 fs/ext4/inode.c |   20 ++++++++++++++++++--
 1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 61e8fc0..f743524 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2437,6 +2437,7 @@ static int ext4_da_writepages(struct address_space *mapping,
 	int no_nrwrite_index_update;
 	int pages_written = 0;
 	long pages_skipped;
+	int range_cyclic = 0, cycled = 1, io_done = 0;
 	int needed_blocks, ret = 0, nr_to_writebump = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 
@@ -2488,9 +2489,14 @@ static int ext4_da_writepages(struct address_space *mapping,
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 		range_whole = 1;
 
-	if (wbc->range_cyclic)
+	if (wbc->range_cyclic) {
 		index = mapping->writeback_index;
-	else
+		wbc->range_start = index << PAGE_CACHE_SHIFT;
+		wbc->range_end  = LLONG_MAX;
+		wbc->range_cyclic = 0;
+		range_cyclic = 1;
+		cycled = 0;
+	} else
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 
 	mpd.wbc = wbc;
@@ -2504,6 +2510,7 @@ static int ext4_da_writepages(struct address_space *mapping,
 	wbc->no_nrwrite_index_update = 1;
 	pages_skipped = wbc->pages_skipped;
 
+retry:
 	while (!ret && wbc->nr_to_write > 0) {
 
 		/*
@@ -2546,6 +2553,7 @@ static int ext4_da_writepages(struct address_space *mapping,
 			pages_written += mpd.pages_written;
 			wbc->pages_skipped = pages_skipped;
 			ret = 0;
+			io_done = 1;
 		} else if (wbc->nr_to_write)
 			/*
 			 * There is no more writeout needed
@@ -2554,6 +2562,13 @@ static int ext4_da_writepages(struct address_space *mapping,
 			 */
 			break;
 	}
+	if (!io_done && !cycled) {
+		cycled = 1;
+		index = 0;
+		wbc->range_start = index << PAGE_CACHE_SHIFT;
+		wbc->range_end  = mapping->writeback_index - 1;
+		goto retry;
+	}
 	if (pages_skipped != wbc->pages_skipped)
 		printk(KERN_EMERG "This should not happen leaving %s "
 				"with nr_to_write = %ld ret = %d\n",
@@ -2561,6 +2576,7 @@ static int ext4_da_writepages(struct address_space *mapping,
 
 	/* Update index */
 	index += pages_written;
+	wbc->range_cyclic = range_cyclic;
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		/*
 		 * set the writeback_index so that range_cyclic
Comment 8 Eric Sandeen 2009-02-13 14:06:06 UTC
Created attachment 20238 [details]
Tested patch

Patch from Aneesh, un-whitespace-mangled.

Ted, can you push this out?  Works great.  :)  We might want to ask the other reporter of something similar (next-20090206: deadlock on ext4) to test it too.  I'll ping him.
Comment 9 Eric Sandeen 2009-02-13 14:22:33 UTC
Created attachment 20239 [details]
updated patch w/ my signoffs

Updated the patch w/ my tested/reviewed by.

I'd like to think about whether we can make write_cache_pages smarter and not need to sort of recreate bits of it in the ext4 code, but for now this solves my problem, at least.
Comment 10 Theodore Tso 2009-02-13 18:07:28 UTC
> Patch from Aneesh, un-whitespace-mangled.
> 
> Ted, can you push this out?  Works great.  :) We might want to ask
> the other reporter of something similar (next-20090206: deadlock on
> ext4) to test it too.  I'll ping him.

Do we completely understand the root cause, in terms of which commit
broken the mm/page-writeback.c code we were depending on?  And if so,
what of the code in mm/page-writeback.c?  Does anyone else use it?
Can anyone sanely use it?

And am I right in assuming that this only applies to 2.6.29-rcX
kernels, and is not needed for 2.6.28 or earlier kernels?

I hadn't yet pushed it out because I needed time to understand all of
these issues, hence these questions....

						- Ted
Comment 11 Eric Sandeen 2009-02-13 19:52:11 UTC
Ted, all good questions.  :)  If you think we need to digest this a bit that's fine, though I'd rather this go in than not at all, before 2.6.29 gets out the door.  But I think we still have some time.

The problem has not been seen on 2.6.28 kernels; I can't right now point for sure to one of Nick's patch series that changed the behavior; in fact many points in the series seemed to not work at all; even building kernels on other filesystems got me hung up.  Didn't inspire tons of confidence.  :)

You're probably right though, and echoing my concern in the previous comment, that we might need to look at this holistically and decide if this exact change to ext4 is the right path or if the page-writeback infrastructure needs a tweak...

-Eric
Comment 12 Aneesh Kumar K.V 2009-02-14 00:07:33 UTC
On Fri, Feb 13, 2009 at 02:06:06PM -0800, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12579
> 
> 
> 
> 
> 
> ------- Comment #8 from sandeen@redhat.com  2009-02-13 14:06 -------
> Created an attachment (id=20238)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=20238&action=view)
> Tested patch
> 
> Patch from Aneesh, un-whitespace-mangled.
> 
> Ted, can you push this out?  Works great.  :)  We might want to ask the other
> reporter of something similar (next-20090206: deadlock on ext4) to test it
> too.
>  I'll ping him.
> 

I think this small changes is also needed.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f743524..c80e038 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2495,7 +2495,10 @@ static int ext4_da_writepages(struct address_space *mapping,
 		wbc->range_end  = LLONG_MAX;
 		wbc->range_cyclic = 0;
 		range_cyclic = 1;
-		cycled = 0;
+		if (index == 0)
+			cycled = 1;
+		else
+			cycled = 0;
 	} else
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 
Comment 13 Aneesh Kumar K.V 2009-02-14 00:40:29 UTC
On Fri, Feb 13, 2009 at 08:50:18PM -0500, Theodore Tso wrote:
> > Patch from Aneesh, un-whitespace-mangled.
> > 
> > Ted, can you push this out?  Works great.  :) We might want to ask
> > the other reporter of something similar (next-20090206: deadlock on
> > ext4) to test it too.  I'll ping him.
> 
> Do we completely understand the root cause, in terms of which commit
> broken the mm/page-writeback.c code we were depending on?  And if so,
> what of the code in mm/page-writeback.c?  Does anyone else use it?
> Can anyone sanely use it?

AFAIU we need the changes even for older kernels. The
reasoning is, with delayed allocation we cannot allow to retry with lower
page index in write_cache_pages. We do retry even in older version of
kernel. What made it so easy to reproduce it on later kernels is that
we were doing a retry even if nr_to_write was zero. This got fixed on
mainline by 3a4c6800f31ea8395628af5e7e490270ee5d0585. So with that
change we are logically back to 2.6.28 state, But still the possibility
of deadlock remain.

> 
> And am I right in assuming that this only applies to 2.6.29-rcX
> kernels, and is not needed for 2.6.28 or earlier kernels?

I guess the hang can happen on 2.6.28 or earlier kernels.

> 
> I hadn't yet pushed it out because I needed time to understand all of
> these issues, hence these questions....

-aneesh
Comment 14 Aneesh Kumar K.V 2009-02-14 01:42:10 UTC
On Sat, Feb 14, 2009 at 02:10:04PM +0530, Aneesh Kumar K.V wrote:
> On Fri, Feb 13, 2009 at 08:50:18PM -0500, Theodore Tso wrote:
> > > Patch from Aneesh, un-whitespace-mangled.
> > > 
> > > Ted, can you push this out?  Works great.  :) We might want to ask
> > > the other reporter of something similar (next-20090206: deadlock on
> > > ext4) to test it too.  I'll ping him.
> > 
> > Do we completely understand the root cause, in terms of which commit
> > broken the mm/page-writeback.c code we were depending on?  And if so,
> > what of the code in mm/page-writeback.c?  Does anyone else use it?
> > Can anyone sanely use it?
> 
> AFAIU we need the changes even for older kernels. The
> reasoning is, with delayed allocation we cannot allow to retry with lower
> page index in write_cache_pages. We do retry even in older version of
> kernel. What made it so easy to reproduce it on later kernels is that
> we were doing a retry even if nr_to_write was zero. This got fixed on
> mainline by 3a4c6800f31ea8395628af5e7e490270ee5d0585. So with that
> change we are logically back to 2.6.28 state, But still the possibility
> of deadlock remain.
> 

I found commit 31a12666d8f0c22235297e1c1575f82061480029 to be the root
cause. The commit is correct in what it does. Ext4 was dependent on the
wrong behaviour. The relevant change is 

@@ -897,7 +903,6 @@ retry:
                                              min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
                unsigned i;
 
-               scanned = 1;
                for (i = 0; i < nr_pages; i++) {


I think that caused us the retry. That would imply we may not need the
patch I did for 2.6.28. But given that Ext4 was dependent on the wrong
behaviour of write_cache_pages i would suggest we still push the patch
to 2.6.28

-aneesh
Comment 15 Federico Cuello 2009-02-19 18:31:48 UTC
There is another bug reported about the deadlock ( http://bugzilla.kernel.org/show_bug.cgi?id=12610 ) happening on 2.6.28.2 and later.
Comment 16 Theodore Tso 2009-02-21 09:35:16 UTC
This patch has been merged into mainline.  I will work on getting this pushed into a stable kernel.
Comment 17 Jayson R. King 2010-02-27 18:28:56 UTC
I sent mail a few hours ago about this bug occurring on a 2.6.27.y kernel. The patch which fixed the bug on mainline also fixes 2.6.27.y, but it depends on several other patches being merged in before it (mostly other ext4 fixes).

Note You need to log in before you can comment on or make changes to this bug.