Latest working kernel version: Earliest failing kernel version: 2.6.29 rc2 git1 Distribution: Rawhide Hardware Environment: x86_64 Software Environment: Rawhide Problem Description: Deadlock/hang: SysRq : Show Blocked State task PC stack pid father livecd-creato D ffff880001869f30 3096 11953 3276 ffff880064cc1b18 0000000000000046 0000000000000001 0000000000000002 ffffffff818fe700 ffffffff818fe700 ffff8800836cf000 ffff880010f28000 ffff88006389a390 ffff880010f283b8 0000000000000000 0000000000000046 Call Trace: [<ffffffff81016ec0>] ? read_tsc+0xe/0x24 [<ffffffff81066749>] ? getnstimeofday+0x58/0xb4 [<ffffffff81380f70>] io_schedule+0x63/0xa5 [<ffffffff810a7029>] sync_page+0x49/0x4d [<ffffffff8138136a>] __wait_on_bit_lock+0x45/0x79 [<ffffffff810a6fe0>] ? sync_page+0x0/0x4d [<ffffffff810a6f93>] __lock_page+0x63/0x6a [<ffffffff8105f0c8>] ? wake_bit_function+0x0/0x2a [<ffffffff810ae25b>] write_cache_pages+0x1fc/0x3a4 [<ffffffffa0020231>] ? __mpage_da_writepage+0x0/0x163 [ext4] [<ffffffffa0005329>] ? jbd2_journal_start+0xef/0x119 [jbd2] [<ffffffffa0005329>] ? jbd2_journal_start+0xef/0x119 [jbd2] [<ffffffffa001ffe1>] ext4_da_writepages+0x2c0/0x457 [ext4] [<ffffffff8138164c>] ? mutex_unlock+0x9/0xb [<ffffffff810a8b01>] ? generic_file_aio_write+0x74/0xc5 [<ffffffff8106ebcb>] ? print_lock_contention_bug+0x1e/0x110 [<ffffffffa0021310>] ? ext4_da_get_block_write+0x0/0x130 [ext4] [<ffffffff810ae456>] do_writepages+0x2b/0x3b [<ffffffff810a7b3f>] __filemap_fdatawrite_range+0x51/0x59 [<ffffffff810a889a>] filemap_fdatawrite+0x1a/0x1c [<ffffffff810fbbe5>] vfs_fsync+0x53/0xac [<ffffffff810fbc70>] do_fsync+0x32/0x48 [<ffffffff810fbc94>] sys_fdatasync+0xe/0x12 [<ffffffff810112ba>] system_call_fastpath+0x16/0x1b kjournald2 D ffff88000c475e28 5040 11969 2 ffff88000c475d40 0000000000000046 0000000000000000 0000000000000046 ffffffff818fe700 ffffffff818fe700 0000000000000000 ffff880014882390 ffff88007eb88000 ffff880014882748 000000000c475ce0 0000000000000046 Call Trace: [<ffffffff8105f215>] ? prepare_to_wait+0x1c/0x5e [<ffffffff81383055>] ? _spin_unlock_irqrestore+0x47/0x57 [<ffffffff8106d609>] ? trace_hardirqs_on_caller+0x1f/0x153 [<ffffffff8119a55f>] ? _raw_spin_unlock+0x8e/0x93 [<ffffffffa00057e8>] jbd2_journal_commit_transaction+0x25c/0x159c [jbd2] [<ffffffff81054bf5>] ? lock_timer_base+0x26/0x4a [<ffffffff81383055>] ? _spin_unlock_irqrestore+0x47/0x57 [<ffffffff8106d609>] ? trace_hardirqs_on_caller+0x1f/0x153 [<ffffffff8105f090>] ? autoremove_wake_function+0x0/0x38 [<ffffffff81054c67>] ? try_to_del_timer_sync+0x4e/0x5c [<ffffffffa000a68c>] kjournald2+0x172/0x3a8 [jbd2] [<ffffffff8105f090>] ? autoremove_wake_function+0x0/0x38 [<ffffffffa000a51a>] ? kjournald2+0x0/0x3a8 [jbd2] [<ffffffff8105ed18>] kthread+0x49/0x76 [<ffffffff8101262a>] child_rip+0xa/0x20 [<ffffffff81383002>] ? _spin_unlock_irq+0x2b/0x37 [<ffffffff8106d74a>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81011f3e>] ? restore_args+0x0/0x30 [<ffffffff8105ecaa>] ? kthreadd+0x176/0x19b [<ffffffff8105eccf>] ? kthread+0x0/0x76 [<ffffffff81012620>] ? child_rip+0x0/0x20 pdflush D ffff880001878350 992 12227 2 ffff880050975960 0000000000000046 0000000000000000 ffff8800724d8ea0 ffffffff818fe700 ffffffff818fe700 ffff8800509758e0 ffff880076c48000 ffffffff815334f0 ffff880076c483b8 0000000000000046 0000000000000046 Call Trace: [<ffffffff8105f173>] ? prepare_to_wait_exclusive+0x1c/0x5e [<ffffffff81016ec0>] ? read_tsc+0xe/0x24 [<ffffffff81066749>] ? getnstimeofday+0x58/0xb4 [<ffffffff81380f70>] io_schedule+0x63/0xa5 [<ffffffff810a7029>] sync_page+0x49/0x4d [<ffffffff8138136a>] __wait_on_bit_lock+0x45/0x79 [<ffffffff810a6fe0>] ? sync_page+0x0/0x4d [<ffffffff810a6f93>] __lock_page+0x63/0x6a [<ffffffff8105f0c8>] ? wake_bit_function+0x0/0x2a [<ffffffff810ae25b>] write_cache_pages+0x1fc/0x3a4 [<ffffffffa0020231>] ? __mpage_da_writepage+0x0/0x163 [ext4] [<ffffffffa0005329>] ? jbd2_journal_start+0xef/0x119 [jbd2] [<ffffffffa001ffe1>] ext4_da_writepages+0x2c0/0x457 [ext4] [<ffffffff810ae375>] ? write_cache_pages+0x316/0x3a4 [<ffffffff8106ebcb>] ? print_lock_contention_bug+0x1e/0x110 [<ffffffff8103b23a>] ? account_group_exec_runtime+0x2c/0x44 [<ffffffff8106ebcb>] ? print_lock_contention_bug+0x1e/0x110 [<ffffffffa0021310>] ? ext4_da_get_block_write+0x0/0x130 [ext4] [<ffffffff810ae456>] do_writepages+0x2b/0x3b [<ffffffff810f893e>] __writeback_single_inode+0x187/0x2fc [<ffffffff810f8ee0>] ? generic_sync_sb_inodes+0x275/0x362 [<ffffffff810f8e86>] generic_sync_sb_inodes+0x21b/0x362 [<ffffffff810f91c3>] writeback_inodes+0xa2/0xfa [<ffffffff810aeaaf>] background_writeout+0x92/0xcb [<ffffffff810af0eb>] pdflush+0x14e/0x201 [<ffffffff810aea1d>] ? background_writeout+0x0/0xcb [<ffffffff810aef9d>] ? pdflush+0x0/0x201 [<ffffffff8105ed18>] kthread+0x49/0x76 [<ffffffff8101262a>] child_rip+0xa/0x20 [<ffffffff81383002>] ? _spin_unlock_irq+0x2b/0x37 [<ffffffff8106d74a>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81011f3e>] ? restore_args+0x0/0x30 [<ffffffff8105ecaa>] ? kthreadd+0x176/0x19b [<ffffffff8105eccf>] ? kthread+0x0/0x76 [<ffffffff81012620>] ? child_rip+0x0/0x20 Sched Debug Version: v0.08, 2.6.29-0.53.rc2.git1.fc11.x86_64 #1 now at 8999693.829504 msecs .sysctl_sched_latency : 40.000000 .sysctl_sched_min_granularity : 8.000000 .sysctl_sched_wakeup_granularity : 10.000000 .sysctl_sched_child_runs_first : 0.000001 .sysctl_sched_features : 24191 cpu#0, 2666.818 MHz .nr_running : 0 .load : 0 .nr_switches : 5337240 .nr_load_updates : 3406197 .nr_uninterruptible : -5686 .jiffies : 4303666545 .next_balance : 4303.666549 .curr->pid : 0 .clock : 8999757.002901 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 .yld_exp_empty : 0 .yld_act_empty : 0 .yld_both_empty : 0 .yld_count : 693 .sched_switch : 0 .sched_count : 5466466 .sched_goidle : 1234594 .ttwu_count : 3148627 .ttwu_local : 2728322 .bkl_count : 50886 cfs_rq[0]:/ .exec_clock : 2295440.884259 .MIN_vruntime : 0.000001 .min_vruntime : 42668359.480395 .max_vruntime : 0.000001 .spread : 0.000000 .spread0 : 0.000000 .nr_running : 0 .load : 0 .nr_spread_over : 336023 .shares : 0 rt_rq[0]:/ .rt_nr_running : 0 .rt_throttled : 0 .rt_time : 0.000000 .rt_runtime : 950.000000 runnable tasks: task PID tree-key switches prio exec-runtime sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- cpu#1, 2666.818 MHz .nr_running : 1 .load : 1024 .nr_switches : 4843840 .nr_load_updates : 3459888 .nr_uninterruptible : 5693 .jiffies : 4303666545 .next_balance : 4303.666801 .curr->pid : 13252 .clock : 8999249.133201 .cpu_load[0] : 1024 .cpu_load[1] : 561 .cpu_load[2] : 448 .cpu_load[3] : 422 .cpu_load[4] : 476 .yld_exp_empty : 0 .yld_act_empty : 0 .yld_both_empty : 0 .yld_count : 1603 .sched_switch : 0 .sched_count : 5039690 .sched_goidle : 1429690 .ttwu_count : 2758073 .ttwu_local : 2199222 .bkl_count : 38801 cfs_rq[1]:/ .exec_clock : 1717848.884015 .MIN_vruntime : 0.000001 .min_vruntime : 50983710.080737 .max_vruntime : 0.000001 .spread : 0.000000 .spread0 : 8315350.600342 .nr_running : 1 .load : 1024 .nr_spread_over : 385373 .shares : 0 rt_rq[1]:/ .rt_nr_running : 0 .rt_throttled : 0 .rt_time : 0.000000 .rt_runtime : 950.000000 runnable tasks: task PID tree-key switches prio exec-runtime sum-exec sum-sleep ---------------------------------------------------------------------------------------------------------- R bash 13252 50983710.080737 245 120 50983710.080737 868.862431 66420.517493 / Steps to reproduce: Run livecd-creator from rawhide to create a desktop spin: livecd-creator --config /usr/share/spin-kickstarts/fedora-livecd-desktop-en_US.ks --cache /mnt/test/livecd/cache --tmpdir /mnt/test/livecd/tmp kickstart file modified to create image on ext4 haven't dug into this yet but wanted to capture it.
Finally had some time, and I've tracked this down to the page-writeback.c changes since .28 - backing them all out the testcase runs overnight. Still working out which one and why. I suppose it's still possible it's an ext4 bug but if so the page-writeback.c changes exposed it. It's a classic deadlock; I usually have 2 threads stuck, pdflush vs. the livecd creator doing an fsync. Each is waiting for a page the other has locked. -Eric
Two commits which are already in the mainline which fixed some changes after .28 are 89e1219004b3657cc014521663eeef0744f1c99d dcf6a79dda5cc2a2bec183e50d829030c0972aaa Both changes have implication wrt ext4. I would suggest try with the above two commits and see if the problem still exist. -aneesh
Aneesh, thanks. I tried both of those last night, and the problem persists.
>> Finally had some time, and I've tracked this down to the >> page-writeback.c changes since .28 - backing them all out the >> testcase runs overnight. Still working out which one and why. I >> suppose it's still possible it's an ext4 bug but if so the >> page-writeback.c changes exposed it. >> >> It's a classic deadlock; I usually have 2 threads stuck, pdflush >> vs. the livecd creator doing an fsync. Each is waiting for a page >> the other has locked. > > Two commits which are already in the mainline which fixed some changes > after .28 are > > 89e1219004b3657cc014521663eeef0744f1c99d > dcf6a79dda5cc2a2bec183e50d829030c0972aaa > I read Eric's email as saying he tracked it down to the page-writeback.c changes *after* 2.6.28. Eric, is that what you meant? - Ted
Yep. the 2 patches Aneesh refers to are just hours hold, so they were fixes after I originally saw the hang and worth testing, but they don't seem to help. I'm still trying some bisection of the changes (unfortunately many intermediate points seem nonfunctional) and may have finally found a spot; if this doesn't yield results though it's time for another crash-course in writeback. ;) -Eric
dJust for the record; here are the paths we're stuck on: livecd: sys_fdatasync do_fsync vfs_fsync filemap_fdatawrite __filemap_fdatawrite_range do_writepage ext4_da_writepages write_cache_pages lock page (currently locked by pdflush) pdflush: pdflush background_writeout writeback_inodes generic_sync_sb_inodes __writeback_single_inode do_writepages ext4_da_writepages write_cache_pages lock page (currently loscked by livecd) the pdflush wbc looks like: [1]kdb> wbc 0xffff88007d4abe10 struct writeback_control at 0xffff88007d4abe10 bdi = (null) sync_mode 0 (NONE) nr_to_write = 992 pages_skipped = 0 range_start 0, range_end 0 nonblocking 1 congestion 0 kupdate 0 reclaim 0 writepages 1 range_cyclic 1 more_io 0 no_update 1 and the livecd-creator fsync wbc looks like: [1]kdb> wbc 0xffff880012821ea8 struct writeback_control at 0xffff880012821ea8 bdi = (null) sync_mode 1 (ALL) nr_to_write = 9223372036854775689 pages_skipped = 0 range_start 0, range_end 9223372036854775807 nonblocking 0 congestion 0 kupdate 0 reclaim 0 writepages 1 range_cyclic 0 more_io 0 no_update 1
How about the below patch ? From: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Subject: [PATCH] ext4: Don't use the range_cylic mode implemented by write_cache_pages With delayed allocation we lock the page in write_cache_pages and try to build an in memory extent of contiguous blocks. This is needed so that we can get large contiguous blocks request. Now with range_cyclic mode in write_cache_pages if we have not done an I/O we loop back to 0 index and try to write the page. That would imply we will attempt to take page lock of lower index page holding the page lock of higher index page. This can cause a dead lock with other writeback thread. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> --- fs/ext4/inode.c | 20 ++++++++++++++++++-- 1 files changed, 18 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 61e8fc0..f743524 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2437,6 +2437,7 @@ static int ext4_da_writepages(struct address_space *mapping, int no_nrwrite_index_update; int pages_written = 0; long pages_skipped; + int range_cyclic = 0, cycled = 1, io_done = 0; int needed_blocks, ret = 0, nr_to_writebump = 0; struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb); @@ -2488,9 +2489,14 @@ static int ext4_da_writepages(struct address_space *mapping, if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; - if (wbc->range_cyclic) + if (wbc->range_cyclic) { index = mapping->writeback_index; - else + wbc->range_start = index << PAGE_CACHE_SHIFT; + wbc->range_end = LLONG_MAX; + wbc->range_cyclic = 0; + range_cyclic = 1; + cycled = 0; + } else index = wbc->range_start >> PAGE_CACHE_SHIFT; mpd.wbc = wbc; @@ -2504,6 +2510,7 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->no_nrwrite_index_update = 1; pages_skipped = wbc->pages_skipped; +retry: while (!ret && wbc->nr_to_write > 0) { /* @@ -2546,6 +2553,7 @@ static int ext4_da_writepages(struct address_space *mapping, pages_written += mpd.pages_written; wbc->pages_skipped = pages_skipped; ret = 0; + io_done = 1; } else if (wbc->nr_to_write) /* * There is no more writeout needed @@ -2554,6 +2562,13 @@ static int ext4_da_writepages(struct address_space *mapping, */ break; } + if (!io_done && !cycled) { + cycled = 1; + index = 0; + wbc->range_start = index << PAGE_CACHE_SHIFT; + wbc->range_end = mapping->writeback_index - 1; + goto retry; + } if (pages_skipped != wbc->pages_skipped) printk(KERN_EMERG "This should not happen leaving %s " "with nr_to_write = %ld ret = %d\n", @@ -2561,6 +2576,7 @@ static int ext4_da_writepages(struct address_space *mapping, /* Update index */ index += pages_written; + wbc->range_cyclic = range_cyclic; if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) /* * set the writeback_index so that range_cyclic
Created attachment 20238 [details] Tested patch Patch from Aneesh, un-whitespace-mangled. Ted, can you push this out? Works great. :) We might want to ask the other reporter of something similar (next-20090206: deadlock on ext4) to test it too. I'll ping him.
Created attachment 20239 [details] updated patch w/ my signoffs Updated the patch w/ my tested/reviewed by. I'd like to think about whether we can make write_cache_pages smarter and not need to sort of recreate bits of it in the ext4 code, but for now this solves my problem, at least.
> Patch from Aneesh, un-whitespace-mangled. > > Ted, can you push this out? Works great. :) We might want to ask > the other reporter of something similar (next-20090206: deadlock on > ext4) to test it too. I'll ping him. Do we completely understand the root cause, in terms of which commit broken the mm/page-writeback.c code we were depending on? And if so, what of the code in mm/page-writeback.c? Does anyone else use it? Can anyone sanely use it? And am I right in assuming that this only applies to 2.6.29-rcX kernels, and is not needed for 2.6.28 or earlier kernels? I hadn't yet pushed it out because I needed time to understand all of these issues, hence these questions.... - Ted
Ted, all good questions. :) If you think we need to digest this a bit that's fine, though I'd rather this go in than not at all, before 2.6.29 gets out the door. But I think we still have some time. The problem has not been seen on 2.6.28 kernels; I can't right now point for sure to one of Nick's patch series that changed the behavior; in fact many points in the series seemed to not work at all; even building kernels on other filesystems got me hung up. Didn't inspire tons of confidence. :) You're probably right though, and echoing my concern in the previous comment, that we might need to look at this holistically and decide if this exact change to ext4 is the right path or if the page-writeback infrastructure needs a tweak... -Eric
On Fri, Feb 13, 2009 at 02:06:06PM -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12579 > > > > > > ------- Comment #8 from sandeen@redhat.com 2009-02-13 14:06 ------- > Created an attachment (id=20238) > --> (http://bugzilla.kernel.org/attachment.cgi?id=20238&action=view) > Tested patch > > Patch from Aneesh, un-whitespace-mangled. > > Ted, can you push this out? Works great. :) We might want to ask the other > reporter of something similar (next-20090206: deadlock on ext4) to test it > too. > I'll ping him. > I think this small changes is also needed. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f743524..c80e038 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2495,7 +2495,10 @@ static int ext4_da_writepages(struct address_space *mapping, wbc->range_end = LLONG_MAX; wbc->range_cyclic = 0; range_cyclic = 1; - cycled = 0; + if (index == 0) + cycled = 1; + else + cycled = 0; } else index = wbc->range_start >> PAGE_CACHE_SHIFT;
On Fri, Feb 13, 2009 at 08:50:18PM -0500, Theodore Tso wrote: > > Patch from Aneesh, un-whitespace-mangled. > > > > Ted, can you push this out? Works great. :) We might want to ask > > the other reporter of something similar (next-20090206: deadlock on > > ext4) to test it too. I'll ping him. > > Do we completely understand the root cause, in terms of which commit > broken the mm/page-writeback.c code we were depending on? And if so, > what of the code in mm/page-writeback.c? Does anyone else use it? > Can anyone sanely use it? AFAIU we need the changes even for older kernels. The reasoning is, with delayed allocation we cannot allow to retry with lower page index in write_cache_pages. We do retry even in older version of kernel. What made it so easy to reproduce it on later kernels is that we were doing a retry even if nr_to_write was zero. This got fixed on mainline by 3a4c6800f31ea8395628af5e7e490270ee5d0585. So with that change we are logically back to 2.6.28 state, But still the possibility of deadlock remain. > > And am I right in assuming that this only applies to 2.6.29-rcX > kernels, and is not needed for 2.6.28 or earlier kernels? I guess the hang can happen on 2.6.28 or earlier kernels. > > I hadn't yet pushed it out because I needed time to understand all of > these issues, hence these questions.... -aneesh
On Sat, Feb 14, 2009 at 02:10:04PM +0530, Aneesh Kumar K.V wrote: > On Fri, Feb 13, 2009 at 08:50:18PM -0500, Theodore Tso wrote: > > > Patch from Aneesh, un-whitespace-mangled. > > > > > > Ted, can you push this out? Works great. :) We might want to ask > > > the other reporter of something similar (next-20090206: deadlock on > > > ext4) to test it too. I'll ping him. > > > > Do we completely understand the root cause, in terms of which commit > > broken the mm/page-writeback.c code we were depending on? And if so, > > what of the code in mm/page-writeback.c? Does anyone else use it? > > Can anyone sanely use it? > > AFAIU we need the changes even for older kernels. The > reasoning is, with delayed allocation we cannot allow to retry with lower > page index in write_cache_pages. We do retry even in older version of > kernel. What made it so easy to reproduce it on later kernels is that > we were doing a retry even if nr_to_write was zero. This got fixed on > mainline by 3a4c6800f31ea8395628af5e7e490270ee5d0585. So with that > change we are logically back to 2.6.28 state, But still the possibility > of deadlock remain. > I found commit 31a12666d8f0c22235297e1c1575f82061480029 to be the root cause. The commit is correct in what it does. Ext4 was dependent on the wrong behaviour. The relevant change is @@ -897,7 +903,6 @@ retry: min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { unsigned i; - scanned = 1; for (i = 0; i < nr_pages; i++) { I think that caused us the retry. That would imply we may not need the patch I did for 2.6.28. But given that Ext4 was dependent on the wrong behaviour of write_cache_pages i would suggest we still push the patch to 2.6.28 -aneesh
There is another bug reported about the deadlock ( http://bugzilla.kernel.org/show_bug.cgi?id=12610 ) happening on 2.6.28.2 and later.
This patch has been merged into mainline. I will work on getting this pushed into a stable kernel.
I sent mail a few hours ago about this bug occurring on a 2.6.27.y kernel. The patch which fixed the bug on mainline also fixes 2.6.27.y, but it depends on several other patches being merged in before it (mostly other ext4 fixes).