Bug 13232
Summary: | ext3/4 with synchronous writes gets wedged by Postfix | ||
---|---|---|---|
Product: | File System | Reporter: | David Watson (kernel-nospam) |
Component: | ext3 | Assignee: | fs_ext3 (fs_ext3) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | fgiunchedi, macli, ole, rjw, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 12398 | ||
Attachments: | Proposed patch from Al Viro |
Description
David Watson
2009-05-03 19:46:36 UTC
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sun, 3 May 2009 19:48:21 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13232 > > Summary: ext3/4 with synchronous writes gets wedged by Postfix > Product: File System > Version: 2.5 > Kernel Version: 2.6.29.2 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: ext3 > AssignedTo: fs_ext3@kernel-bugs.osdl.org > ReportedBy: kernel-nospam@dbwatson.ukfsn.org > Regression: Yes > > > Latest working kernel version: 2.6.28.9 > Earliest failing kernel version: 2.6.29 A very very bad regression. > Distribution: Ubuntu 8.04 LTS > Hardware Environment: Intel 875p chipset (ICH5), P4 with HT > Software Environment: Postfix 2.5.1-2ubuntu1.2 > > Problem Description: > > With its queue directory on an ext3 or journalled ext4 file > system, mounted with the 'sync' option (or with the 'S' attribute > on the individual queue directories, which is a config option for > the Ubuntu package), sending more than a few messages in quick > succession through Postfix causes Postfix and any other process > attempting to modify the file system afterwards (e.g. by updating > the atime when listing a directory) to hang in uninterruptible > sleep. sync(1) will also hang afterwards. If the FS is mounted > noatime, it is still possible to read from it, although trying to > list the affected queue directories still causes ls to hang (the > list of affected directories seems to vary, but may include at > least "active" and "maildrop"). > > This happens whether the filesystem is on a hard disk or a RAM > disk (/dev/ramX, not tmpfs). Enabling or disabling barriers, or > changing the journalling mode doesn't help. > > Ext2 seems to be unaffected, as does non-journalled ext4. XFS, > Reiser and JFS are also fine. The hang doesn't occur if the > queue directories have the 'D' attribute (and a non-sync mount) > instead of 'S'. > > Nothing appears in dmesg at the time of the hang, but here's the > output from SysRq-W afterwards (ext3 FS, kernel 2.6.29.2, > non-SMP, PREEMPT_NONE): > > SysRq : Show Blocked State > task PC stack pid father > kjournald D c01384df 0 2525 2 > cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00 > de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb > 00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80 > Call Trace: > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf > [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > [<c0122a25>] ? del_timer+0x50/0x59 > [<c01c0c75>] kjournald+0xad/0x1bb > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > [<c01c0bc8>] ? kjournald+0x0/0x1bb > [<c012b442>] kthread+0x37/0x59 > [<c012b40b>] ? kthread+0x0/0x59 > [<c0103667>] kernel_thread_helper+0x7/0x10 I assume this is while (commit_transaction->t_updates) { DEFINE_WAIT(wait); prepare_to_wait(&journal->j_wait_updates, &wait, TASK_UNINTERRUPTIBLE); if (commit_transaction->t_updates) { spin_unlock(&commit_transaction->t_handle_lock); spin_unlock(&journal->j_state_lock); schedule(); I'm wondering about : commit e219cca082f52e7dfea41f3be264b7b5eb204227 : Author: Theodore Ts'o <tytso@mit.edu> : AuthorDate: Thu Nov 6 22:37:59 2008 -0500 : Commit: Theodore Ts'o <tytso@mit.edu> : CommitDate: Thu Nov 6 22:37:59 2008 -0500 : : jbd: don't give up looking for space so easily in __log_wait_for_space but that patch was present in 2.6.28. > pickup D c01384df 0 2597 2594 > cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00 > cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539 > 00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4 > Call Trace: > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > [<c012b8b7>] ? prepare_to_wait+0x42/0x4a > [<c01c0539>] log_wait_commit+0x90/0xf7 > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > [<c01bba9d>] journal_stop+0x1c8/0x288 > [<c01b4236>] __ext3_journal_stop+0x1c/0x38 > [<c01aeb96>] ext3_delete_inode+0x90/0xc2 > [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2 > [<c017ab82>] generic_delete_inode+0x72/0x100 > [<c02c4ee1>] ? _spin_lock+0x3a/0x40 > [<c017ad4c>] generic_drop_inode+0x13c/0x1da > [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38 > [<c017a4e7>] iput+0x47/0x4e > [<c0173db0>] do_unlinkat+0xc1/0x137 > [<c0102f87>] ? sysenter_exit+0xf/0x18 > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > [<c0173e36>] sys_unlink+0x10/0x12 > [<c0102f55>] sysenter_do_call+0x12/0x35 > qmgr D c01384df 0 2598 2594 > cfe23e60 00000082 df9fa200 c01384df cfe23e48 c02cb5c0 cfe23e78 de32cf00 > cfe23e60 c012b8b7 00000002 de324800 0000014f cfe23e78 cfe23e98 c01c0539 > 00000000 de3248c8 de324910 de324814 00000000 df9fa200 c012b6ee cfaa9e80 > Call Trace: > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > [<c012b8b7>] ? prepare_to_wait+0x42/0x4a > [<c01c0539>] log_wait_commit+0x90/0xf7 > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > [<c01bba9d>] journal_stop+0x1c8/0x288 > [<c01ac11e>] ? ext3_mark_iloc_dirty+0x1d8/0x34b > [<c01b4236>] __ext3_journal_stop+0x1c/0x38 > [<c01b309d>] ext3_unlink+0x70/0x157 > [<c0172344>] ? vfs_unlink+0x45/0xb0 > [<c0172358>] vfs_unlink+0x59/0xb0 > [<c0173e17>] do_unlinkat+0x128/0x137 > [<c0102f87>] ? sysenter_exit+0xf/0x18 > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > [<c0173e36>] sys_unlink+0x10/0x12 > [<c0102f55>] sysenter_do_call+0x12/0x35 > local D c01384df 0 2654 2594 > de369c74 00000096 df9f8880 c01384df de369c5c 00000286 00000000 de183b80 > de324814 de324800 de324814 dd037e80 de324800 de324880 de369cc4 c01bc306 > c0138489 00000246 00000046 00000001 de324a0c de324814 de324814 df59c060 > Call Trace: > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > [<c01bc306>] start_this_handle+0x1c2/0x361 > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > [<c01bc54a>] journal_start+0xa5/0x100 > [<c01b43ac>] ext3_journal_start_sb+0x48/0x4d > [<c01aec4c>] ext3_write_begin+0x84/0x1c6 > [<c0139fe2>] ? __lock_acquire+0x32b/0xa21 > [<c014749e>] generic_file_buffered_write+0xe1/0x292 > [<c0147a94>] __generic_file_aio_write_nolock+0x25c/0x4ab > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > [<c0147f63>] generic_file_aio_write+0x66/0xda > [<c0138f6b>] ? validate_chain+0x3f5/0x1141 > [<c01aad03>] ext3_file_write+0x27/0xa8 > [<c016a972>] do_sync_write+0xcc/0x102 > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > [<c016b16d>] vfs_write+0x8b/0x11f > [<c0102f87>] ? sysenter_exit+0xf/0x18 > [<c016a8a6>] ? do_sync_write+0x0/0x102 > [<c016b577>] sys_write+0x3d/0x64 > [<c0102f55>] sysenter_do_call+0x12/0x35 > postdrop D c01384df 0 2664 2663 > cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780 > c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b > dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13 > Call Trace: > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88 > [<c012b726>] ? wake_bit_function+0x0/0x47 > [<c017a5b5>] find_inode_fast+0x3f/0x4a > [<c017ba05>] insert_inode_locked+0x50/0xeb > [<c01ab90b>] ext3_new_inode+0x738/0x88f > [<c01bc550>] ? journal_start+0xab/0x100 > [<c01b259a>] ext3_create+0x59/0xbf > [<c01722c4>] vfs_create+0x75/0xb0 > [<c02c4dda>] ? _spin_unlock+0x1d/0x20 > [<c01b2541>] ? ext3_create+0x0/0xbf > [<c0174bc3>] do_filp_open+0x644/0x713 > [<c02c4dda>] ? _spin_unlock+0x1d/0x20 > [<c01692ce>] do_sys_open+0x45/0xce > [<c0102f87>] ? sysenter_exit+0xf/0x18 > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > [<c01693a3>] sys_open+0x23/0x2b > [<c0102f55>] sysenter_do_call+0x12/0x35 > Sched Debug Version: v0.08, 2.6.29.2-test #1 > now at 166155.025991 msecs > .sysctl_sched_latency : 20.000000 > .sysctl_sched_min_granularity : 4.000000 > .sysctl_sched_wakeup_granularity : 5.000000 > .sysctl_sched_child_runs_first : 0.000001 > .sysctl_sched_features : 24191 > > cpu#0, 2798.181 MHz > .nr_running : 2 > .load : 2048 > .nr_switches : 284843 > .nr_load_updates : 6497 > .nr_uninterruptible : 5 > .jiffies : 4294953911 > .next_balance : 0.000000 > .curr->pid : 2466 > .clock : 166154.370058 > .cpu_load[0] : 0 > .cpu_load[1] : 4 > .cpu_load[2] : 76 > .cpu_load[3] : 160 > .cpu_load[4] : 196 > > cfs_rq[0]: > .exec_clock : 0.000000 > .MIN_vruntime : 24407.116340 > .min_vruntime : 24407.114218 > .max_vruntime : 24407.116340 > .spread : 0.000000 > .spread0 : 0.000000 > .nr_running : 2 > .load : 2048 > .nr_spread_over : 0 > > rt_rq[0]: > .rt_nr_running : 0 > .rt_throttled : 0 > .rt_time : 0.000000 > .rt_runtime : 950.000000 > > runnable tasks: > task PID tree-key switches prio exec-runtime > sum-exec sum-sleep > > ---------------------------------------------------------------------------------------------------------- > Xorg 2163 24407.116340 28791 120 0 > 0 0.000000 0.000000 0.000000 > R bash 2466 24387.418829 274 120 0 > 0 0.000000 0.000000 0.000000 > > > Steps to reproduce: > > # mount -o remount,sync /var > (or wherever queue_directory points to in the Postfix config) > > # while true; do sendmail testaddr@localhost </dev/null; done > Hi, > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > SysRq : Show Blocked State > > task PC stack pid father > > kjournald D c01384df 0 2525 2 > > cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00 > > de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb > > 00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80 > > Call Trace: > > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > > [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf > > [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f > > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > > [<c0122a25>] ? del_timer+0x50/0x59 > > [<c01c0c75>] kjournald+0xad/0x1bb > > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > > [<c01c0bc8>] ? kjournald+0x0/0x1bb > > [<c012b442>] kthread+0x37/0x59 > > [<c012b40b>] ? kthread+0x0/0x59 > > [<c0103667>] kernel_thread_helper+0x7/0x10 > > I assume this is > > while (commit_transaction->t_updates) { > DEFINE_WAIT(wait); > > prepare_to_wait(&journal->j_wait_updates, &wait, > TASK_UNINTERRUPTIBLE); > if (commit_transaction->t_updates) { > spin_unlock(&commit_transaction->t_handle_lock); > spin_unlock(&journal->j_state_lock); > schedule(); Yes. > I'm wondering about > > : commit e219cca082f52e7dfea41f3be264b7b5eb204227 > : Author: Theodore Ts'o <tytso@mit.edu> > : AuthorDate: Thu Nov 6 22:37:59 2008 -0500 > : Commit: Theodore Ts'o <tytso@mit.edu> > : CommitDate: Thu Nov 6 22:37:59 2008 -0500 > : > : jbd: don't give up looking for space so easily in __log_wait_for_space > > but that patch was present in 2.6.28. Hmm, I don't see what made this deadlock happening - as far as I can see it's there for quite some time. See below... > > pickup D c01384df 0 2597 2594 > > cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00 > > cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539 > > 00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4 > > Call Trace: > > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > > [<c012b8b7>] ? prepare_to_wait+0x42/0x4a > > [<c01c0539>] log_wait_commit+0x90/0xf7 > > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38 > > [<c01bba9d>] journal_stop+0x1c8/0x288 > > [<c01b4236>] __ext3_journal_stop+0x1c/0x38 > > [<c01aeb96>] ext3_delete_inode+0x90/0xc2 > > [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2 > > [<c017ab82>] generic_delete_inode+0x72/0x100 > > [<c02c4ee1>] ? _spin_lock+0x3a/0x40 > > [<c017ad4c>] generic_drop_inode+0x13c/0x1da > > [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38 > > [<c017a4e7>] iput+0x47/0x4e > > [<c0173db0>] do_unlinkat+0xc1/0x137 > > [<c0102f87>] ? sysenter_exit+0xf/0x18 > > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > > [<c0173e36>] sys_unlink+0x10/0x12 > > [<c0102f55>] sysenter_do_call+0x12/0x35 In generic_delete_inode() we mark inode as I_FREEING. Then ext3_delete_inode() is called and because of O_SYNC it starts a transaction commit and waits for it. > > postdrop D c01384df 0 2664 2663 > > cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780 > > c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b > > dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13 > > Call Trace: > > [<c01384df>] ? trace_hardirqs_on+0xb/0xd > > [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88 > > [<c012b726>] ? wake_bit_function+0x0/0x47 > > [<c017a5b5>] find_inode_fast+0x3f/0x4a > > [<c017ba05>] insert_inode_locked+0x50/0xeb > > [<c01ab90b>] ext3_new_inode+0x738/0x88f > > [<c01bc550>] ? journal_start+0xab/0x100 > > [<c01b259a>] ext3_create+0x59/0xbf > > [<c01722c4>] vfs_create+0x75/0xb0 > > [<c02c4dda>] ? _spin_unlock+0x1d/0x20 > > [<c01b2541>] ? ext3_create+0x0/0xbf > > [<c0174bc3>] do_filp_open+0x644/0x713 > > [<c02c4dda>] ? _spin_unlock+0x1d/0x20 > > [<c01692ce>] do_sys_open+0x45/0xce > > [<c0102f87>] ? sysenter_exit+0xf/0x18 > > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190 > > [<c01693a3>] sys_open+0x23/0x2b > > [<c0102f55>] sysenter_do_call+0x12/0x35 Here, we have started a transaction in ext3_create() and then wait in find_inode_fast() for I_FREEING to be cleared (obviously we have reallocated the inode and squeezed the allocation before journal_stop() from the delete was called). Nasty deadlock and I don't see how to fix it now - have to go home for today... Tomorrow I'll have a look what we can do about it. Honza > http://bugzilla.kernel.org/show_bug.cgi?id=13232
>
> --- Comment #2 from Jan Kara <jack@suse.cz> 2009-05-12 16:56:04 ---
> Hi,
>
> > (switched to email. Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
>
> > > SysRq : Show Blocked State
> > > task PC stack pid father
> > > kjournald D c01384df 0 2525 2
> > > cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00
> > > de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb
> > > 00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80
> > > Call Trace:
> > > [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> > > [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf
> > > [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f
> > > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> > > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> > > [<c0122a25>] ? del_timer+0x50/0x59
> > > [<c01c0c75>] kjournald+0xad/0x1bb
> > > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> > > [<c01c0bc8>] ? kjournald+0x0/0x1bb
> > > [<c012b442>] kthread+0x37/0x59
> > > [<c012b40b>] ? kthread+0x0/0x59
> > > [<c0103667>] kernel_thread_helper+0x7/0x10
> >
> > I assume this is
> >
> > while (commit_transaction->t_updates) {
> > DEFINE_WAIT(wait);
> >
> > prepare_to_wait(&journal->j_wait_updates, &wait,
> > TASK_UNINTERRUPTIBLE);
> > if (commit_transaction->t_updates) {
> > spin_unlock(&commit_transaction->t_handle_lock);
> > spin_unlock(&journal->j_state_lock);
> > schedule();
> Yes.
>
> > I'm wondering about
> >
> > : commit e219cca082f52e7dfea41f3be264b7b5eb204227
> > : Author: Theodore Ts'o <tytso@mit.edu>
> > : AuthorDate: Thu Nov 6 22:37:59 2008 -0500
> > : Commit: Theodore Ts'o <tytso@mit.edu>
> > : CommitDate: Thu Nov 6 22:37:59 2008 -0500
> > :
> > : jbd: don't give up looking for space so easily in __log_wait_for_space
> >
> > but that patch was present in 2.6.28.
> Hmm, I don't see what made this deadlock happening - as far as I can
> see it's there for quite some time. See below...
>
> > > pickup D c01384df 0 2597 2594
> > > cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00
> > > cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539
> > > 00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4
> > > Call Trace:
> > > [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> > > [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
> > > [<c01c0539>] log_wait_commit+0x90/0xf7
> > > [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> > > [<c01bba9d>] journal_stop+0x1c8/0x288
> > > [<c01b4236>] __ext3_journal_stop+0x1c/0x38
> > > [<c01aeb96>] ext3_delete_inode+0x90/0xc2
> > > [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2
> > > [<c017ab82>] generic_delete_inode+0x72/0x100
> > > [<c02c4ee1>] ? _spin_lock+0x3a/0x40
> > > [<c017ad4c>] generic_drop_inode+0x13c/0x1da
> > > [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38
> > > [<c017a4e7>] iput+0x47/0x4e
> > > [<c0173db0>] do_unlinkat+0xc1/0x137
> > > [<c0102f87>] ? sysenter_exit+0xf/0x18
> > > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> > > [<c0173e36>] sys_unlink+0x10/0x12
> > > [<c0102f55>] sysenter_do_call+0x12/0x35
> In generic_delete_inode() we mark inode as I_FREEING. Then
> ext3_delete_inode() is called and because of O_SYNC it starts a
> transaction commit and waits for it.
>
> > > postdrop D c01384df 0 2664 2663
> > > cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780
> > > c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b
> > > dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13
> > > Call Trace:
> > > [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> > > [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88
> > > [<c012b726>] ? wake_bit_function+0x0/0x47
> > > [<c017a5b5>] find_inode_fast+0x3f/0x4a
> > > [<c017ba05>] insert_inode_locked+0x50/0xeb
> > > [<c01ab90b>] ext3_new_inode+0x738/0x88f
> > > [<c01bc550>] ? journal_start+0xab/0x100
> > > [<c01b259a>] ext3_create+0x59/0xbf
> > > [<c01722c4>] vfs_create+0x75/0xb0
> > > [<c02c4dda>] ? _spin_unlock+0x1d/0x20
> > > [<c01b2541>] ? ext3_create+0x0/0xbf
> > > [<c0174bc3>] do_filp_open+0x644/0x713
> > > [<c02c4dda>] ? _spin_unlock+0x1d/0x20
> > > [<c01692ce>] do_sys_open+0x45/0xce
> > > [<c0102f87>] ? sysenter_exit+0xf/0x18
> > > [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> > > [<c01693a3>] sys_open+0x23/0x2b
> > > [<c0102f55>] sysenter_do_call+0x12/0x35
> Here, we have started a transaction in ext3_create() and then wait in
> find_inode_fast() for I_FREEING to be cleared (obviously we have
> reallocated the inode and squeezed the allocation before journal_stop()
> from the delete was called).
> Nasty deadlock and I don't see how to fix it now - have to go home for
> today... Tomorrow I'll have a look what we can do about it.
OK, the deadlock has been introduced by ext3 variant of
261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
is really tough to avoid - we have to first allocate inode on disk so
that we know the inode number. For this we need transaction open but we
cannot afford waiting for old inode with same INO to be freed when we have
transaction open because of the above deadlock. So we'd have to wait for
inode release only after everything is done and we closed the transaction. But
that would mean reordering a lot of code in ext3/namei.c so that all the
dcache handling is done after all the IO is done.
Hmm, maybe we could change the delete side of the deadlock but that's
going to be tricky as well :(.
Al, any idea if we could somehow get away without waiting on
I_FREEING?
Honza
On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote: > OK, the deadlock has been introduced by ext3 variant of > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). What do you mean by this? I'm puzzled why we haven't hit this before. This looks like long-standing issue; what unmasked it now? > The deadlock > is really tough to avoid - we have to first allocate inode on disk so > that we know the inode number. Well, the simple thing to do is to have a way of quickly determining that a particular inode number is in the I_FREEING state, and simply try to avoid using that inode number. If there are no inodes available, it can simply close the handle (since nothing else has changed at that point), wait for the current transaction to close, and then try again. That should fix the problem, I think. - Ted I tried to run Postfix on ubuntu 9.04 server version and followed the reporduce command, had the same problem. kernel version: 2.6.30-rc4 Distribution: Ubuntu 9.04 Software Environment: Postfix 2.5.1-2ubuntu1.2 also have following message poping up on control terminal: [844,700070] INFO: task kjournald: 628 blocked for more than 120 seconds, [844,700148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this meassage. Reply-To: viro@ZenIV.linux.org.uk On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote: > > Here, we have started a transaction in ext3_create() and then wait in > > find_inode_fast() for I_FREEING to be cleared (obviously we have > > reallocated the inode and squeezed the allocation before journal_stop() > > from the delete was called). > > Nasty deadlock and I don't see how to fix it now - have to go home for > > today... Tomorrow I'll have a look what we can do about it. > OK, the deadlock has been introduced by ext3 variant of > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock > is really tough to avoid - we have to first allocate inode on disk so > that we know the inode number. For this we need transaction open but we > cannot afford waiting for old inode with same INO to be freed when we have > transaction open because of the above deadlock. So we'd have to wait for > inode release only after everything is done and we closed the transaction. > But > that would mean reordering a lot of code in ext3/namei.c so that all the > dcache handling is done after all the IO is done. > Hmm, maybe we could change the delete side of the deadlock but that's > going to be tricky as well :(. > Al, any idea if we could somehow get away without waiting on > I_FREEING? At which point do we actually run into deadlock on delete side? We could, in principle, skip everything like that in insert_inode_locked(), but I would rather avoid the "two inodes in icache at the same time, with the same inumber" situations completely. We might get away with that, since everything else *will* wait, so we can afford a bunch of inodes past the point in foo_delete_inode() that has cleared it in bitmap + new locked one, but if it's at all possible to avoid, I'd rather avoid it. Reply-To: viro@ZenIV.linux.org.uk On Wed, May 13, 2009 at 05:52:54PM +0100, Al Viro wrote: > On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote: > > > Here, we have started a transaction in ext3_create() and then wait in > > > find_inode_fast() for I_FREEING to be cleared (obviously we have > > > reallocated the inode and squeezed the allocation before journal_stop() > > > from the delete was called). > > > Nasty deadlock and I don't see how to fix it now - have to go home for > > > today... Tomorrow I'll have a look what we can do about it. > > OK, the deadlock has been introduced by ext3 variant of > > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock > > is really tough to avoid - we have to first allocate inode on disk so > > that we know the inode number. For this we need transaction open but we > > cannot afford waiting for old inode with same INO to be freed when we have > > transaction open because of the above deadlock. So we'd have to wait for > > inode release only after everything is done and we closed the transaction. > But > > that would mean reordering a lot of code in ext3/namei.c so that all the > > dcache handling is done after all the IO is done. > > Hmm, maybe we could change the delete side of the deadlock but that's > > going to be tricky as well :(. > > Al, any idea if we could somehow get away without waiting on > > I_FREEING? > > At which point do we actually run into deadlock on delete side? We could, > in principle, skip everything like that in insert_inode_locked(), but > I would rather avoid the "two inodes in icache at the same time, with the > same inumber" situations completely. We might get away with that, since > everything else *will* wait, so we can afford a bunch of inodes past the > point in foo_delete_inode() that has cleared it in bitmap + new locked > one, but if it's at all possible to avoid, I'd rather avoid it. OK, that's probably the easiest way to do that, as much as I don't like it... Since iget() et.al. will not accept I_FREEING (will wait to go away and restart), and since we'd better have serialization between new/free on fs data structures anyway, we can afford simply skipping I_FREEING et.al. in insert_inode_locked(). We do that from new_inode, so it won't race with free_inode in any interesting ways and it won't race with iget (of any origin; nfsd or in case of fs corruption a lookup) since both still will wait for I_LOCK. Tentative patch follow; folks, I would very much like review on that one, since I'm far too low on caffeine and the area is nasty. diff --git a/fs/inode.c b/fs/inode.c index 9d26490..4406952 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1053,13 +1053,22 @@ int insert_inode_locked(struct inode *inode) struct super_block *sb = inode->i_sb; ino_t ino = inode->i_ino; struct hlist_head *head = inode_hashtable + hash(sb, ino); - struct inode *old; inode->i_state |= I_LOCK|I_NEW; while (1) { + struct hlist_node *node; + struct inode *old = NULL; spin_lock(&inode_lock); - old = find_inode_fast(sb, head, ino); - if (likely(!old)) { + hlist_for_each_entry(old, node, head, i_hash) { + if (old->i_ino != ino) + continue; + if (old->i_sb != sb) + continue; + if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) + continue; + break; + } + if (likely(!node)) { hlist_add_head(&inode->i_hash, head); spin_unlock(&inode_lock); return 0; @@ -1081,14 +1090,24 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval, { struct super_block *sb = inode->i_sb; struct hlist_head *head = inode_hashtable + hash(sb, hashval); - struct inode *old; inode->i_state |= I_LOCK|I_NEW; while (1) { + struct hlist_node *node; + struct inode *old = NULL; + spin_lock(&inode_lock); - old = find_inode(sb, head, test, data); - if (likely(!old)) { + hlist_for_each_entry(old, node, head, i_hash) { + if (old->i_sb != sb) + continue; + if (!test(old, data)) + continue; + if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) + continue; + break; + } + if (likely(!node)) { hlist_add_head(&inode->i_hash, head); spin_unlock(&inode_lock); return 0; On Wed 13-05-09 12:07:24, Theodore Tso wrote: > On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote: > > OK, the deadlock has been introduced by ext3 variant of > > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). > > What do you mean by this? > > I'm puzzled why we haven't hit this before. This looks like > long-standing issue; what unmasked it now? Unless you mount the fs with 'sync' option, hitting this is much harder (the window is quite small in nosync case). I think that is the main reason why we didn't see this earlier. > > The deadlock > > is really tough to avoid - we have to first allocate inode on disk so > > that we know the inode number. > > Well, the simple thing to do is to have a way of quickly determining > that a particular inode number is in the I_FREEING state, and simply > try to avoid using that inode number. If there are no inodes > available, it can simply close the handle (since nothing else has > changed at that point), wait for the current transaction to close, and > then try again. That should fix the problem, I think. Yes, we could work-around it like that but other filesystems might need similar things and generally it would be nicer if we could avoid using this vfs-internal information in the filesystems. Al seems to have found some other solution without changing filesystems so that would be easier for us... Honza On Wed 13-05-09 17:52:54, Al Viro wrote:
> On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
> > > Here, we have started a transaction in ext3_create() and then wait in
> > > find_inode_fast() for I_FREEING to be cleared (obviously we have
> > > reallocated the inode and squeezed the allocation before journal_stop()
> > > from the delete was called).
> > > Nasty deadlock and I don't see how to fix it now - have to go home for
> > > today... Tomorrow I'll have a look what we can do about it.
> > OK, the deadlock has been introduced by ext3 variant of
> > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
> > is really tough to avoid - we have to first allocate inode on disk so
> > that we know the inode number. For this we need transaction open but we
> > cannot afford waiting for old inode with same INO to be freed when we have
> > transaction open because of the above deadlock. So we'd have to wait for
> > inode release only after everything is done and we closed the transaction.
> But
> > that would mean reordering a lot of code in ext3/namei.c so that all the
> > dcache handling is done after all the IO is done.
> > Hmm, maybe we could change the delete side of the deadlock but that's
> > going to be tricky as well :(.
> > Al, any idea if we could somehow get away without waiting on
> > I_FREEING?
>
> At which point do we actually run into deadlock on delete side? We could,
> in principle, skip everything like that in insert_inode_locked(), but
> I would rather avoid the "two inodes in icache at the same time, with the
> same inumber" situations completely. We might get away with that, since
> everything else *will* wait, so we can afford a bunch of inodes past the
> point in foo_delete_inode() that has cleared it in bitmap + new locked
> one, but if it's at all possible to avoid, I'd rather avoid it.
The ordering we see on delete when the filesystem is mounted with 'sync'
option is:
DELETE CREATE
generic_delete_inode()
set I_FREEING
ext3_delete_inode
get transaction handle
do work
get transaction handle
ext3_new_inode()
reallocate inode
insert_inode_locked()
stop transaction, wait for it to commit
(waiting for CREATE process to drop its
transaction reference)
Now similar race can happen even without 'sync' mount option but it's
much harder to hit:
DELETE CREATE
generic_delete_inode()
set I_FREEING
ext3_delete_inode
get transaction handle
ext3_new_inode()
reallocate inode
insert_inode_locked()
try to get transaction handle -
- transaction is too big so we send
current transaction to commit which
again waits for CREATE to drop its
reference.
Honza
On Wed, May 13, 2009 at 07:13:40PM +0100, Al Viro wrote: > > OK, that's probably the easiest way to do that, as much as I don't like it... > Since iget() et.al. will not accept I_FREEING (will wait to go away > and restart), and since we'd better have serialization between new/free > on fs data structures anyway, we can afford simply skipping I_FREEING > et.al. in insert_inode_locked(). > > We do that from new_inode, so it won't race with free_inode in any > interesting > ways and it won't race with iget (of any origin; nfsd or in case of fs > corruption a lookup) since both still will wait for I_LOCK. > > Tentative patch follow; folks, I would very much like review on that one, > since I'm far too low on caffeine and the area is nasty. Sorry for not having time to review this until now. This looks good to me. Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> So Bug #13232 is currently marked as a 2.6.28 regression; do we feel confident enough to push this to Linus for 2.6.30? - Ted On Wed 13-05-09 19:13:40, Al Viro wrote: > On Wed, May 13, 2009 at 05:52:54PM +0100, Al Viro wrote: > > On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote: > > > > Here, we have started a transaction in ext3_create() and then wait in > > > > find_inode_fast() for I_FREEING to be cleared (obviously we have > > > > reallocated the inode and squeezed the allocation before journal_stop() > > > > from the delete was called). > > > > Nasty deadlock and I don't see how to fix it now - have to go home > for > > > > today... Tomorrow I'll have a look what we can do about it. > > > OK, the deadlock has been introduced by ext3 variant of > > > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock > > > is really tough to avoid - we have to first allocate inode on disk so > > > that we know the inode number. For this we need transaction open but we > > > cannot afford waiting for old inode with same INO to be freed when we > have > > > transaction open because of the above deadlock. So we'd have to wait for > > > inode release only after everything is done and we closed the > transaction. But > > > that would mean reordering a lot of code in ext3/namei.c so that all the > > > dcache handling is done after all the IO is done. > > > Hmm, maybe we could change the delete side of the deadlock but that's > > > going to be tricky as well :(. > > > Al, any idea if we could somehow get away without waiting on > > > I_FREEING? > > > > At which point do we actually run into deadlock on delete side? We could, > > in principle, skip everything like that in insert_inode_locked(), but > > I would rather avoid the "two inodes in icache at the same time, with the > > same inumber" situations completely. We might get away with that, since > > everything else *will* wait, so we can afford a bunch of inodes past the > > point in foo_delete_inode() that has cleared it in bitmap + new locked > > one, but if it's at all possible to avoid, I'd rather avoid it. > > OK, that's probably the easiest way to do that, as much as I don't like it... > Since iget() et.al. will not accept I_FREEING (will wait to go away > and restart), and since we'd better have serialization between new/free > on fs data structures anyway, we can afford simply skipping I_FREEING > et.al. in insert_inode_locked(). > > We do that from new_inode, so it won't race with free_inode in any > interesting > ways and it won't race with iget (of any origin; nfsd or in case of fs > corruption a lookup) since both still will wait for I_LOCK. > > Tentative patch follow; folks, I would very much like review on that one, > since I'm far too low on caffeine and the area is nasty. The patch looks fine. Everyone else will either get new inode and wait for I_LOCK or get old inode and wait for I_FREEING so everything should be fine... You can add. Acked-by: Jan Kara <jack@suse.cz> Honza > > diff --git a/fs/inode.c b/fs/inode.c > index 9d26490..4406952 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1053,13 +1053,22 @@ int insert_inode_locked(struct inode *inode) > struct super_block *sb = inode->i_sb; > ino_t ino = inode->i_ino; > struct hlist_head *head = inode_hashtable + hash(sb, ino); > - struct inode *old; > > inode->i_state |= I_LOCK|I_NEW; > while (1) { > + struct hlist_node *node; > + struct inode *old = NULL; > spin_lock(&inode_lock); > - old = find_inode_fast(sb, head, ino); > - if (likely(!old)) { > + hlist_for_each_entry(old, node, head, i_hash) { > + if (old->i_ino != ino) > + continue; > + if (old->i_sb != sb) > + continue; > + if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) > + continue; > + break; > + } > + if (likely(!node)) { > hlist_add_head(&inode->i_hash, head); > spin_unlock(&inode_lock); > return 0; > @@ -1081,14 +1090,24 @@ int insert_inode_locked4(struct inode *inode, > unsigned long hashval, > { > struct super_block *sb = inode->i_sb; > struct hlist_head *head = inode_hashtable + hash(sb, hashval); > - struct inode *old; > > inode->i_state |= I_LOCK|I_NEW; > > while (1) { > + struct hlist_node *node; > + struct inode *old = NULL; > + > spin_lock(&inode_lock); > - old = find_inode(sb, head, test, data); > - if (likely(!old)) { > + hlist_for_each_entry(old, node, head, i_hash) { > + if (old->i_sb != sb) > + continue; > + if (!test(old, data)) > + continue; > + if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) > + continue; > + break; > + } > + if (likely(!node)) { > hlist_add_head(&inode->i_hash, head); > spin_unlock(&inode_lock); > return 0; Created attachment 21436 [details]
Proposed patch from Al Viro
This patch is now in the upstream kernel, so it will be in 2.6.30. |