Bug 13232

Summary: ext3/4 with synchronous writes gets wedged by Postfix
Product: File System Reporter: David Watson (kernel-nospam)
Component: ext3Assignee: fs_ext3 (fs_ext3)
Status: CLOSED CODE_FIX    
Severity: normal CC: fgiunchedi, macli, ole, rjw, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29.2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 12398    
Attachments: Proposed patch from Al Viro

Description David Watson 2009-05-03 19:46:36 UTC
Latest working kernel version: 2.6.28.9
Earliest failing kernel version: 2.6.29
Distribution: Ubuntu 8.04 LTS
Hardware Environment: Intel 875p chipset (ICH5), P4 with HT
Software Environment: Postfix 2.5.1-2ubuntu1.2

Problem Description:

With its queue directory on an ext3 or journalled ext4 file
system, mounted with the 'sync' option (or with the 'S' attribute
on the individual queue directories, which is a config option for
the Ubuntu package), sending more than a few messages in quick
succession through Postfix causes Postfix and any other process
attempting to modify the file system afterwards (e.g. by updating
the atime when listing a directory) to hang in uninterruptible
sleep.  sync(1) will also hang afterwards.  If the FS is mounted
noatime, it is still possible to read from it, although trying to
list the affected queue directories still causes ls to hang (the
list of affected directories seems to vary, but may include at
least "active" and "maildrop").

This happens whether the filesystem is on a hard disk or a RAM
disk (/dev/ramX, not tmpfs).  Enabling or disabling barriers, or
changing the journalling mode doesn't help.

Ext2 seems to be unaffected, as does non-journalled ext4.  XFS,
Reiser and JFS are also fine.  The hang doesn't occur if the
queue directories have the 'D' attribute (and a non-sync mount)
instead of 'S'.

Nothing appears in dmesg at the time of the hang, but here's the
output from SysRq-W afterwards (ext3 FS, kernel 2.6.29.2,
non-SMP, PREEMPT_NONE):

SysRq : Show Blocked State
  task                PC stack   pid father
kjournald     D c01384df     0  2525      2
 cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00
 de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb
 00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80
Call Trace:
 [<c01384df>] ? trace_hardirqs_on+0xb/0xd
 [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf
 [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f
 [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
 [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
 [<c0122a25>] ? del_timer+0x50/0x59
 [<c01c0c75>] kjournald+0xad/0x1bb
 [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
 [<c01c0bc8>] ? kjournald+0x0/0x1bb
 [<c012b442>] kthread+0x37/0x59
 [<c012b40b>] ? kthread+0x0/0x59
 [<c0103667>] kernel_thread_helper+0x7/0x10
pickup        D c01384df     0  2597   2594
 cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00
 cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539
 00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4
Call Trace:
 [<c01384df>] ? trace_hardirqs_on+0xb/0xd
 [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
 [<c01c0539>] log_wait_commit+0x90/0xf7
 [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
 [<c01bba9d>] journal_stop+0x1c8/0x288
 [<c01b4236>] __ext3_journal_stop+0x1c/0x38
 [<c01aeb96>] ext3_delete_inode+0x90/0xc2
 [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2
 [<c017ab82>] generic_delete_inode+0x72/0x100
 [<c02c4ee1>] ? _spin_lock+0x3a/0x40
 [<c017ad4c>] generic_drop_inode+0x13c/0x1da
 [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38
 [<c017a4e7>] iput+0x47/0x4e
 [<c0173db0>] do_unlinkat+0xc1/0x137
 [<c0102f87>] ? sysenter_exit+0xf/0x18
 [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
 [<c0173e36>] sys_unlink+0x10/0x12
 [<c0102f55>] sysenter_do_call+0x12/0x35
qmgr          D c01384df     0  2598   2594
 cfe23e60 00000082 df9fa200 c01384df cfe23e48 c02cb5c0 cfe23e78 de32cf00
 cfe23e60 c012b8b7 00000002 de324800 0000014f cfe23e78 cfe23e98 c01c0539
 00000000 de3248c8 de324910 de324814 00000000 df9fa200 c012b6ee cfaa9e80
Call Trace:
 [<c01384df>] ? trace_hardirqs_on+0xb/0xd
 [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
 [<c01c0539>] log_wait_commit+0x90/0xf7
 [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
 [<c01bba9d>] journal_stop+0x1c8/0x288
 [<c01ac11e>] ? ext3_mark_iloc_dirty+0x1d8/0x34b
 [<c01b4236>] __ext3_journal_stop+0x1c/0x38
 [<c01b309d>] ext3_unlink+0x70/0x157
 [<c0172344>] ? vfs_unlink+0x45/0xb0
 [<c0172358>] vfs_unlink+0x59/0xb0
 [<c0173e17>] do_unlinkat+0x128/0x137
 [<c0102f87>] ? sysenter_exit+0xf/0x18
 [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
 [<c0173e36>] sys_unlink+0x10/0x12
 [<c0102f55>] sysenter_do_call+0x12/0x35
local         D c01384df     0  2654   2594
 de369c74 00000096 df9f8880 c01384df de369c5c 00000286 00000000 de183b80
 de324814 de324800 de324814 dd037e80 de324800 de324880 de369cc4 c01bc306
 c0138489 00000246 00000046 00000001 de324a0c de324814 de324814 df59c060
Call Trace:
 [<c01384df>] ? trace_hardirqs_on+0xb/0xd
 [<c01bc306>] start_this_handle+0x1c2/0x361
 [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
 [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
 [<c01bc54a>] journal_start+0xa5/0x100
 [<c01b43ac>] ext3_journal_start_sb+0x48/0x4d
 [<c01aec4c>] ext3_write_begin+0x84/0x1c6
 [<c0139fe2>] ? __lock_acquire+0x32b/0xa21
 [<c014749e>] generic_file_buffered_write+0xe1/0x292
 [<c0147a94>] __generic_file_aio_write_nolock+0x25c/0x4ab
 [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
 [<c0147f63>] generic_file_aio_write+0x66/0xda
 [<c0138f6b>] ? validate_chain+0x3f5/0x1141
 [<c01aad03>] ext3_file_write+0x27/0xa8
 [<c016a972>] do_sync_write+0xcc/0x102
 [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
 [<c016b16d>] vfs_write+0x8b/0x11f
 [<c0102f87>] ? sysenter_exit+0xf/0x18
 [<c016a8a6>] ? do_sync_write+0x0/0x102
 [<c016b577>] sys_write+0x3d/0x64
 [<c0102f55>] sysenter_do_call+0x12/0x35
postdrop      D c01384df     0  2664   2663
 cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780
 c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b
 dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13
Call Trace:
 [<c01384df>] ? trace_hardirqs_on+0xb/0xd
 [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88
 [<c012b726>] ? wake_bit_function+0x0/0x47
 [<c017a5b5>] find_inode_fast+0x3f/0x4a
 [<c017ba05>] insert_inode_locked+0x50/0xeb
 [<c01ab90b>] ext3_new_inode+0x738/0x88f
 [<c01bc550>] ? journal_start+0xab/0x100
 [<c01b259a>] ext3_create+0x59/0xbf
 [<c01722c4>] vfs_create+0x75/0xb0
 [<c02c4dda>] ? _spin_unlock+0x1d/0x20
 [<c01b2541>] ? ext3_create+0x0/0xbf
 [<c0174bc3>] do_filp_open+0x644/0x713
 [<c02c4dda>] ? _spin_unlock+0x1d/0x20
 [<c01692ce>] do_sys_open+0x45/0xce
 [<c0102f87>] ? sysenter_exit+0xf/0x18
 [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
 [<c01693a3>] sys_open+0x23/0x2b
 [<c0102f55>] sysenter_do_call+0x12/0x35
Sched Debug Version: v0.08, 2.6.29.2-test #1
now at 166155.025991 msecs
  .sysctl_sched_latency                    : 20.000000
  .sysctl_sched_min_granularity            : 4.000000
  .sysctl_sched_wakeup_granularity         : 5.000000
  .sysctl_sched_child_runs_first           : 0.000001
  .sysctl_sched_features                   : 24191

cpu#0, 2798.181 MHz
  .nr_running                    : 2
  .load                          : 2048
  .nr_switches                   : 284843
  .nr_load_updates               : 6497
  .nr_uninterruptible            : 5
  .jiffies                       : 4294953911
  .next_balance                  : 0.000000
  .curr->pid                     : 2466
  .clock                         : 166154.370058
  .cpu_load[0]                   : 0
  .cpu_load[1]                   : 4
  .cpu_load[2]                   : 76
  .cpu_load[3]                   : 160
  .cpu_load[4]                   : 196

cfs_rq[0]:
  .exec_clock                    : 0.000000
  .MIN_vruntime                  : 24407.116340
  .min_vruntime                  : 24407.114218
  .max_vruntime                  : 24407.116340
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 2
  .load                          : 2048
  .nr_spread_over                : 0

rt_rq[0]:
  .rt_nr_running                 : 0
  .rt_throttled                  : 0
  .rt_time                       : 0.000000
  .rt_runtime                    : 950.000000

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
            Xorg  2163     24407.116340     28791   120               0               0               0.000000               0.000000               0.000000
R           bash  2466     24387.418829       274   120               0               0               0.000000               0.000000               0.000000


Steps to reproduce:

# mount -o remount,sync /var
  (or wherever queue_directory points to in the Postfix config)

# while true; do sendmail testaddr@localhost </dev/null; done
Comment 1 Andrew Morton 2009-05-05 21:58:17 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 3 May 2009 19:48:21 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=13232
> 
>            Summary: ext3/4 with synchronous writes gets wedged by Postfix
>            Product: File System
>            Version: 2.5
>     Kernel Version: 2.6.29.2
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: ext3
>         AssignedTo: fs_ext3@kernel-bugs.osdl.org
>         ReportedBy: kernel-nospam@dbwatson.ukfsn.org
>         Regression: Yes
> 
> 
> Latest working kernel version: 2.6.28.9
> Earliest failing kernel version: 2.6.29

A very very bad regression.

> Distribution: Ubuntu 8.04 LTS
> Hardware Environment: Intel 875p chipset (ICH5), P4 with HT
> Software Environment: Postfix 2.5.1-2ubuntu1.2
> 
> Problem Description:
> 
> With its queue directory on an ext3 or journalled ext4 file
> system, mounted with the 'sync' option (or with the 'S' attribute
> on the individual queue directories, which is a config option for
> the Ubuntu package), sending more than a few messages in quick
> succession through Postfix causes Postfix and any other process
> attempting to modify the file system afterwards (e.g. by updating
> the atime when listing a directory) to hang in uninterruptible
> sleep.  sync(1) will also hang afterwards.  If the FS is mounted
> noatime, it is still possible to read from it, although trying to
> list the affected queue directories still causes ls to hang (the
> list of affected directories seems to vary, but may include at
> least "active" and "maildrop").
> 
> This happens whether the filesystem is on a hard disk or a RAM
> disk (/dev/ramX, not tmpfs).  Enabling or disabling barriers, or
> changing the journalling mode doesn't help.
> 
> Ext2 seems to be unaffected, as does non-journalled ext4.  XFS,
> Reiser and JFS are also fine.  The hang doesn't occur if the
> queue directories have the 'D' attribute (and a non-sync mount)
> instead of 'S'.
> 
> Nothing appears in dmesg at the time of the hang, but here's the
> output from SysRq-W afterwards (ext3 FS, kernel 2.6.29.2,
> non-SMP, PREEMPT_NONE):
> 
> SysRq : Show Blocked State
>   task                PC stack   pid father
> kjournald     D c01384df     0  2525      2
>  cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00
>  de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb
>  00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80
> Call Trace:
>  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
>  [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf
>  [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f
>  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
>  [<c0122a25>] ? del_timer+0x50/0x59
>  [<c01c0c75>] kjournald+0xad/0x1bb
>  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
>  [<c01c0bc8>] ? kjournald+0x0/0x1bb
>  [<c012b442>] kthread+0x37/0x59
>  [<c012b40b>] ? kthread+0x0/0x59
>  [<c0103667>] kernel_thread_helper+0x7/0x10

I assume this is

	while (commit_transaction->t_updates) {
		DEFINE_WAIT(wait);

		prepare_to_wait(&journal->j_wait_updates, &wait,
					TASK_UNINTERRUPTIBLE);
		if (commit_transaction->t_updates) {
			spin_unlock(&commit_transaction->t_handle_lock);
			spin_unlock(&journal->j_state_lock);
			schedule();

I'm wondering about

: commit e219cca082f52e7dfea41f3be264b7b5eb204227
: Author:     Theodore Ts'o <tytso@mit.edu>
: AuthorDate: Thu Nov 6 22:37:59 2008 -0500
: Commit:     Theodore Ts'o <tytso@mit.edu>
: CommitDate: Thu Nov 6 22:37:59 2008 -0500
:
:    jbd: don't give up looking for space so easily in __log_wait_for_space

but that patch was present in 2.6.28.



> pickup        D c01384df     0  2597   2594
>  cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00
>  cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539
>  00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4
> Call Trace:
>  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
>  [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
>  [<c01c0539>] log_wait_commit+0x90/0xf7
>  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
>  [<c01bba9d>] journal_stop+0x1c8/0x288
>  [<c01b4236>] __ext3_journal_stop+0x1c/0x38
>  [<c01aeb96>] ext3_delete_inode+0x90/0xc2
>  [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2
>  [<c017ab82>] generic_delete_inode+0x72/0x100
>  [<c02c4ee1>] ? _spin_lock+0x3a/0x40
>  [<c017ad4c>] generic_drop_inode+0x13c/0x1da
>  [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38
>  [<c017a4e7>] iput+0x47/0x4e
>  [<c0173db0>] do_unlinkat+0xc1/0x137
>  [<c0102f87>] ? sysenter_exit+0xf/0x18
>  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<c0173e36>] sys_unlink+0x10/0x12
>  [<c0102f55>] sysenter_do_call+0x12/0x35
> qmgr          D c01384df     0  2598   2594
>  cfe23e60 00000082 df9fa200 c01384df cfe23e48 c02cb5c0 cfe23e78 de32cf00
>  cfe23e60 c012b8b7 00000002 de324800 0000014f cfe23e78 cfe23e98 c01c0539
>  00000000 de3248c8 de324910 de324814 00000000 df9fa200 c012b6ee cfaa9e80
> Call Trace:
>  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
>  [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
>  [<c01c0539>] log_wait_commit+0x90/0xf7
>  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
>  [<c01bba9d>] journal_stop+0x1c8/0x288
>  [<c01ac11e>] ? ext3_mark_iloc_dirty+0x1d8/0x34b
>  [<c01b4236>] __ext3_journal_stop+0x1c/0x38
>  [<c01b309d>] ext3_unlink+0x70/0x157
>  [<c0172344>] ? vfs_unlink+0x45/0xb0
>  [<c0172358>] vfs_unlink+0x59/0xb0
>  [<c0173e17>] do_unlinkat+0x128/0x137
>  [<c0102f87>] ? sysenter_exit+0xf/0x18
>  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<c0173e36>] sys_unlink+0x10/0x12
>  [<c0102f55>] sysenter_do_call+0x12/0x35
> local         D c01384df     0  2654   2594
>  de369c74 00000096 df9f8880 c01384df de369c5c 00000286 00000000 de183b80
>  de324814 de324800 de324814 dd037e80 de324800 de324880 de369cc4 c01bc306
>  c0138489 00000246 00000046 00000001 de324a0c de324814 de324814 df59c060
> Call Trace:
>  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
>  [<c01bc306>] start_this_handle+0x1c2/0x361
>  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
>  [<c01bc54a>] journal_start+0xa5/0x100
>  [<c01b43ac>] ext3_journal_start_sb+0x48/0x4d
>  [<c01aec4c>] ext3_write_begin+0x84/0x1c6
>  [<c0139fe2>] ? __lock_acquire+0x32b/0xa21
>  [<c014749e>] generic_file_buffered_write+0xe1/0x292
>  [<c0147a94>] __generic_file_aio_write_nolock+0x25c/0x4ab
>  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<c0147f63>] generic_file_aio_write+0x66/0xda
>  [<c0138f6b>] ? validate_chain+0x3f5/0x1141
>  [<c01aad03>] ext3_file_write+0x27/0xa8
>  [<c016a972>] do_sync_write+0xcc/0x102
>  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
>  [<c016b16d>] vfs_write+0x8b/0x11f
>  [<c0102f87>] ? sysenter_exit+0xf/0x18
>  [<c016a8a6>] ? do_sync_write+0x0/0x102
>  [<c016b577>] sys_write+0x3d/0x64
>  [<c0102f55>] sysenter_do_call+0x12/0x35
> postdrop      D c01384df     0  2664   2663
>  cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780
>  c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b
>  dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13
> Call Trace:
>  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
>  [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88
>  [<c012b726>] ? wake_bit_function+0x0/0x47
>  [<c017a5b5>] find_inode_fast+0x3f/0x4a
>  [<c017ba05>] insert_inode_locked+0x50/0xeb
>  [<c01ab90b>] ext3_new_inode+0x738/0x88f
>  [<c01bc550>] ? journal_start+0xab/0x100
>  [<c01b259a>] ext3_create+0x59/0xbf
>  [<c01722c4>] vfs_create+0x75/0xb0
>  [<c02c4dda>] ? _spin_unlock+0x1d/0x20
>  [<c01b2541>] ? ext3_create+0x0/0xbf
>  [<c0174bc3>] do_filp_open+0x644/0x713
>  [<c02c4dda>] ? _spin_unlock+0x1d/0x20
>  [<c01692ce>] do_sys_open+0x45/0xce
>  [<c0102f87>] ? sysenter_exit+0xf/0x18
>  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
>  [<c01693a3>] sys_open+0x23/0x2b
>  [<c0102f55>] sysenter_do_call+0x12/0x35
> Sched Debug Version: v0.08, 2.6.29.2-test #1
> now at 166155.025991 msecs
>   .sysctl_sched_latency                    : 20.000000
>   .sysctl_sched_min_granularity            : 4.000000
>   .sysctl_sched_wakeup_granularity         : 5.000000
>   .sysctl_sched_child_runs_first           : 0.000001
>   .sysctl_sched_features                   : 24191
> 
> cpu#0, 2798.181 MHz
>   .nr_running                    : 2
>   .load                          : 2048
>   .nr_switches                   : 284843
>   .nr_load_updates               : 6497
>   .nr_uninterruptible            : 5
>   .jiffies                       : 4294953911
>   .next_balance                  : 0.000000
>   .curr->pid                     : 2466
>   .clock                         : 166154.370058
>   .cpu_load[0]                   : 0
>   .cpu_load[1]                   : 4
>   .cpu_load[2]                   : 76
>   .cpu_load[3]                   : 160
>   .cpu_load[4]                   : 196
> 
> cfs_rq[0]:
>   .exec_clock                    : 0.000000
>   .MIN_vruntime                  : 24407.116340
>   .min_vruntime                  : 24407.114218
>   .max_vruntime                  : 24407.116340
>   .spread                        : 0.000000
>   .spread0                       : 0.000000
>   .nr_running                    : 2
>   .load                          : 2048
>   .nr_spread_over                : 0
> 
> rt_rq[0]:
>   .rt_nr_running                 : 0
>   .rt_throttled                  : 0
>   .rt_time                       : 0.000000
>   .rt_runtime                    : 950.000000
> 
> runnable tasks:
>             task   PID         tree-key  switches  prio     exec-runtime      
>  sum-exec        sum-sleep
>
> ----------------------------------------------------------------------------------------------------------
>             Xorg  2163     24407.116340     28791   120               0       
>       0               0.000000               0.000000               0.000000
> R           bash  2466     24387.418829       274   120               0       
>       0               0.000000               0.000000               0.000000
> 
> 
> Steps to reproduce:
> 
> # mount -o remount,sync /var
>   (or wherever queue_directory points to in the Postfix config)
> 
> # while true; do sendmail testaddr@localhost </dev/null; done
>
Comment 2 Jan Kara 2009-05-12 16:56:04 UTC
  Hi,

> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).

> > SysRq : Show Blocked State
> >   task                PC stack   pid father
> > kjournald     D c01384df     0  2525      2
> >  cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00
> >  de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb
> >  00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80
> > Call Trace:
> >  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> >  [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf
> >  [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f
> >  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> >  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> >  [<c0122a25>] ? del_timer+0x50/0x59
> >  [<c01c0c75>] kjournald+0xad/0x1bb
> >  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> >  [<c01c0bc8>] ? kjournald+0x0/0x1bb
> >  [<c012b442>] kthread+0x37/0x59
> >  [<c012b40b>] ? kthread+0x0/0x59
> >  [<c0103667>] kernel_thread_helper+0x7/0x10
> 
> I assume this is
> 
>       while (commit_transaction->t_updates) {
>               DEFINE_WAIT(wait);
> 
>               prepare_to_wait(&journal->j_wait_updates, &wait,
>                                       TASK_UNINTERRUPTIBLE);
>               if (commit_transaction->t_updates) {
>                       spin_unlock(&commit_transaction->t_handle_lock);
>                       spin_unlock(&journal->j_state_lock);
>                       schedule();
  Yes.

> I'm wondering about
> 
> : commit e219cca082f52e7dfea41f3be264b7b5eb204227
> : Author:     Theodore Ts'o <tytso@mit.edu>
> : AuthorDate: Thu Nov 6 22:37:59 2008 -0500
> : Commit:     Theodore Ts'o <tytso@mit.edu>
> : CommitDate: Thu Nov 6 22:37:59 2008 -0500
> :
> :    jbd: don't give up looking for space so easily in __log_wait_for_space
> 
> but that patch was present in 2.6.28.
  Hmm, I don't see what made this deadlock happening - as far as I can
see it's there for quite some time. See below...

> > pickup        D c01384df     0  2597   2594
> >  cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00
> >  cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539
> >  00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4
> > Call Trace:
> >  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> >  [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
> >  [<c01c0539>] log_wait_commit+0x90/0xf7
> >  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> >  [<c01bba9d>] journal_stop+0x1c8/0x288
> >  [<c01b4236>] __ext3_journal_stop+0x1c/0x38
> >  [<c01aeb96>] ext3_delete_inode+0x90/0xc2
> >  [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2
> >  [<c017ab82>] generic_delete_inode+0x72/0x100
> >  [<c02c4ee1>] ? _spin_lock+0x3a/0x40
> >  [<c017ad4c>] generic_drop_inode+0x13c/0x1da
> >  [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38
> >  [<c017a4e7>] iput+0x47/0x4e
> >  [<c0173db0>] do_unlinkat+0xc1/0x137
> >  [<c0102f87>] ? sysenter_exit+0xf/0x18
> >  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> >  [<c0173e36>] sys_unlink+0x10/0x12
> >  [<c0102f55>] sysenter_do_call+0x12/0x35
  In generic_delete_inode() we mark inode as I_FREEING. Then
ext3_delete_inode() is called and because of O_SYNC it starts a
transaction commit and waits for it.

> > postdrop      D c01384df     0  2664   2663
> >  cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780
> >  c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b
> >  dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13
> > Call Trace:
> >  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> >  [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88
> >  [<c012b726>] ? wake_bit_function+0x0/0x47
> >  [<c017a5b5>] find_inode_fast+0x3f/0x4a
> >  [<c017ba05>] insert_inode_locked+0x50/0xeb
> >  [<c01ab90b>] ext3_new_inode+0x738/0x88f
> >  [<c01bc550>] ? journal_start+0xab/0x100
> >  [<c01b259a>] ext3_create+0x59/0xbf
> >  [<c01722c4>] vfs_create+0x75/0xb0
> >  [<c02c4dda>] ? _spin_unlock+0x1d/0x20
> >  [<c01b2541>] ? ext3_create+0x0/0xbf
> >  [<c0174bc3>] do_filp_open+0x644/0x713
> >  [<c02c4dda>] ? _spin_unlock+0x1d/0x20
> >  [<c01692ce>] do_sys_open+0x45/0xce
> >  [<c0102f87>] ? sysenter_exit+0xf/0x18
> >  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> >  [<c01693a3>] sys_open+0x23/0x2b
> >  [<c0102f55>] sysenter_do_call+0x12/0x35
  Here, we have started a transaction in ext3_create() and then wait in
find_inode_fast() for I_FREEING to be cleared (obviously we have
reallocated the inode and squeezed the allocation before journal_stop()
from the delete was called).
  Nasty deadlock and I don't see how to fix it now - have to go home for
today... Tomorrow I'll have a look what we can do about it.

									Honza
Comment 3 Jan Kara 2009-05-13 13:48:04 UTC
> http://bugzilla.kernel.org/show_bug.cgi?id=13232
> 
> --- Comment #2 from Jan Kara <jack@suse.cz>  2009-05-12 16:56:04 ---
>   Hi,
> 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> 
> > > SysRq : Show Blocked State
> > >   task                PC stack   pid father
> > > kjournald     D c01384df     0  2525      2
> > >  cfcb5f0c 00000082 de27d500 c01384df cfcb5ef4 c02cb5c0 00000001 de32ca00
> > >  de324814 dd037ebc de324814 de324934 dd037ebc cfcb5f5c cfcb5f90 c01bd4bb
> > >  00000046 c0424110 de324a0c de324814 de324800 00000000 00000002 dd037e80
> > > Call Trace:
> > >  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> > >  [<c01bd4bb>] journal_commit_transaction+0xea/0xeaf
> > >  [<c02c534a>] ? _spin_unlock_irqrestore+0x38/0x3f
> > >  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> > >  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> > >  [<c0122a25>] ? del_timer+0x50/0x59
> > >  [<c01c0c75>] kjournald+0xad/0x1bb
> > >  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> > >  [<c01c0bc8>] ? kjournald+0x0/0x1bb
> > >  [<c012b442>] kthread+0x37/0x59
> > >  [<c012b40b>] ? kthread+0x0/0x59
> > >  [<c0103667>] kernel_thread_helper+0x7/0x10
> > 
> > I assume this is
> > 
> >     while (commit_transaction->t_updates) {
> >             DEFINE_WAIT(wait);
> > 
> >             prepare_to_wait(&journal->j_wait_updates, &wait,
> >                                     TASK_UNINTERRUPTIBLE);
> >             if (commit_transaction->t_updates) {
> >                     spin_unlock(&commit_transaction->t_handle_lock);
> >                     spin_unlock(&journal->j_state_lock);
> >                     schedule();
>   Yes.
> 
> > I'm wondering about
> > 
> > : commit e219cca082f52e7dfea41f3be264b7b5eb204227
> > : Author:     Theodore Ts'o <tytso@mit.edu>
> > : AuthorDate: Thu Nov 6 22:37:59 2008 -0500
> > : Commit:     Theodore Ts'o <tytso@mit.edu>
> > : CommitDate: Thu Nov 6 22:37:59 2008 -0500
> > :
> > :    jbd: don't give up looking for space so easily in __log_wait_for_space
> > 
> > but that patch was present in 2.6.28.
>   Hmm, I don't see what made this deadlock happening - as far as I can
> see it's there for quite some time. See below...
> 
> > > pickup        D c01384df     0  2597   2594
> > >  cfaa9e5c 00000086 df9faa80 c01384df cfaa9e44 00000282 cfaa9e74 de32ca00
> > >  cfaa9e5c c012b8b7 00000002 de324800 0000014f cfaa9e74 cfaa9e94 c01c0539
> > >  00000000 de3248c8 de324910 de324814 00000000 df9faa80 c012b6ee de3248e4
> > > Call Trace:
> > >  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> > >  [<c012b8b7>] ? prepare_to_wait+0x42/0x4a
> > >  [<c01c0539>] log_wait_commit+0x90/0xf7
> > >  [<c012b6ee>] ? autoremove_wake_function+0x0/0x38
> > >  [<c01bba9d>] journal_stop+0x1c8/0x288
> > >  [<c01b4236>] __ext3_journal_stop+0x1c/0x38
> > >  [<c01aeb96>] ext3_delete_inode+0x90/0xc2
> > >  [<c01aeb06>] ? ext3_delete_inode+0x0/0xc2
> > >  [<c017ab82>] generic_delete_inode+0x72/0x100
> > >  [<c02c4ee1>] ? _spin_lock+0x3a/0x40
> > >  [<c017ad4c>] generic_drop_inode+0x13c/0x1da
> > >  [<c01d4068>] ? _atomic_dec_and_lock+0x10/0x38
> > >  [<c017a4e7>] iput+0x47/0x4e
> > >  [<c0173db0>] do_unlinkat+0xc1/0x137
> > >  [<c0102f87>] ? sysenter_exit+0xf/0x18
> > >  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> > >  [<c0173e36>] sys_unlink+0x10/0x12
> > >  [<c0102f55>] sysenter_do_call+0x12/0x35
>   In generic_delete_inode() we mark inode as I_FREEING. Then
> ext3_delete_inode() is called and because of O_SYNC it starts a
> transaction commit and waits for it.
> 
> > > postdrop      D c01384df     0  2664   2663
> > >  cfcbfd6c 00000086 dd13f700 c01384df cfcbfd54 c02cb5c0 00000001 deedc780
> > >  c03a1690 c1402348 c03a1690 cfcbfd7c c1402348 cfcbfd90 cfcbfd9c c017a55b
> > >  dd758c48 00000007 00000000 dd13f700 c012b726 c1402364 c1402364 00152b13
> > > Call Trace:
> > >  [<c01384df>] ? trace_hardirqs_on+0xb/0xd
> > >  [<c017a55b>] __wait_on_freeing_inode+0x6d/0x88
> > >  [<c012b726>] ? wake_bit_function+0x0/0x47
> > >  [<c017a5b5>] find_inode_fast+0x3f/0x4a
> > >  [<c017ba05>] insert_inode_locked+0x50/0xeb
> > >  [<c01ab90b>] ext3_new_inode+0x738/0x88f
> > >  [<c01bc550>] ? journal_start+0xab/0x100
> > >  [<c01b259a>] ext3_create+0x59/0xbf
> > >  [<c01722c4>] vfs_create+0x75/0xb0
> > >  [<c02c4dda>] ? _spin_unlock+0x1d/0x20
> > >  [<c01b2541>] ? ext3_create+0x0/0xbf
> > >  [<c0174bc3>] do_filp_open+0x644/0x713
> > >  [<c02c4dda>] ? _spin_unlock+0x1d/0x20
> > >  [<c01692ce>] do_sys_open+0x45/0xce
> > >  [<c0102f87>] ? sysenter_exit+0xf/0x18
> > >  [<c0138489>] ? trace_hardirqs_on_caller+0x145/0x190
> > >  [<c01693a3>] sys_open+0x23/0x2b
> > >  [<c0102f55>] sysenter_do_call+0x12/0x35
>   Here, we have started a transaction in ext3_create() and then wait in
> find_inode_fast() for I_FREEING to be cleared (obviously we have
> reallocated the inode and squeezed the allocation before journal_stop()
> from the delete was called).
>   Nasty deadlock and I don't see how to fix it now - have to go home for
> today... Tomorrow I'll have a look what we can do about it.
  OK, the deadlock has been introduced by ext3 variant of
261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
is really tough to avoid - we have to first allocate inode on disk so
that we know the inode number. For this we need transaction open but we
cannot afford waiting for old inode with same INO to be freed when we have
transaction open because of the above deadlock. So we'd have to wait for
inode release only after everything is done and we closed the transaction. But
that would mean reordering a lot of code in ext3/namei.c so that all the
dcache handling is done after all the IO is done.
  Hmm, maybe we could change the delete side of the deadlock but that's
going to be tricky as well :(.
  Al, any idea if we could somehow get away without waiting on
I_FREEING?

									Honza
Comment 4 Theodore Tso 2009-05-13 16:07:33 UTC
On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
>   OK, the deadlock has been introduced by ext3 variant of
> 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC).

What do you mean by this?

I'm puzzled why we haven't hit this before.  This looks like
long-standing issue; what unmasked it now?

> The deadlock
> is really tough to avoid - we have to first allocate inode on disk so
> that we know the inode number.

Well, the simple thing to do is to have a way of quickly determining
that a particular inode number is in the I_FREEING state, and simply
try to avoid using that inode number.  If there are no inodes
available, it can simply close the handle (since nothing else has
changed at that point), wait for the current transaction to close, and
then try again.  That should fix the problem, I think.

     	 	      	     	     	      - Ted
Comment 5 Vincent Li 2009-05-13 16:18:57 UTC
I tried to run Postfix on ubuntu 9.04 server version and followed the reporduce
command, had the same problem.

kernel version: 2.6.30-rc4
Distribution: Ubuntu 9.04 
Software Environment: Postfix 2.5.1-2ubuntu1.2

also have following message poping up on control terminal:

[844,700070] INFO: task kjournald: 628 blocked for more than 120 seconds,
[844,700148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
meassage.
Comment 6 Anonymous Emailer 2009-05-13 16:52:56 UTC
Reply-To: viro@ZenIV.linux.org.uk

On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
> >   Here, we have started a transaction in ext3_create() and then wait in
> > find_inode_fast() for I_FREEING to be cleared (obviously we have
> > reallocated the inode and squeezed the allocation before journal_stop()
> > from the delete was called).
> >   Nasty deadlock and I don't see how to fix it now - have to go home for
> > today... Tomorrow I'll have a look what we can do about it.
>   OK, the deadlock has been introduced by ext3 variant of
> 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
> is really tough to avoid - we have to first allocate inode on disk so
> that we know the inode number. For this we need transaction open but we
> cannot afford waiting for old inode with same INO to be freed when we have
> transaction open because of the above deadlock. So we'd have to wait for
> inode release only after everything is done and we closed the transaction.
> But
> that would mean reordering a lot of code in ext3/namei.c so that all the
> dcache handling is done after all the IO is done.
>   Hmm, maybe we could change the delete side of the deadlock but that's
> going to be tricky as well :(.
>   Al, any idea if we could somehow get away without waiting on
> I_FREEING?

At which point do we actually run into deadlock on delete side?  We could,
in principle, skip everything like that in insert_inode_locked(), but
I would rather avoid the "two inodes in icache at the same time, with the
same inumber" situations completely.  We might get away with that, since
everything else *will* wait, so we can afford a bunch of inodes past the
point in foo_delete_inode() that has cleared it in bitmap + new locked
one, but if it's at all possible to avoid, I'd rather avoid it.
Comment 7 Anonymous Emailer 2009-05-13 18:13:42 UTC
Reply-To: viro@ZenIV.linux.org.uk

On Wed, May 13, 2009 at 05:52:54PM +0100, Al Viro wrote:
> On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
> > >   Here, we have started a transaction in ext3_create() and then wait in
> > > find_inode_fast() for I_FREEING to be cleared (obviously we have
> > > reallocated the inode and squeezed the allocation before journal_stop()
> > > from the delete was called).
> > >   Nasty deadlock and I don't see how to fix it now - have to go home for
> > > today... Tomorrow I'll have a look what we can do about it.
> >   OK, the deadlock has been introduced by ext3 variant of
> > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
> > is really tough to avoid - we have to first allocate inode on disk so
> > that we know the inode number. For this we need transaction open but we
> > cannot afford waiting for old inode with same INO to be freed when we have
> > transaction open because of the above deadlock. So we'd have to wait for
> > inode release only after everything is done and we closed the transaction.
> But
> > that would mean reordering a lot of code in ext3/namei.c so that all the
> > dcache handling is done after all the IO is done.
> >   Hmm, maybe we could change the delete side of the deadlock but that's
> > going to be tricky as well :(.
> >   Al, any idea if we could somehow get away without waiting on
> > I_FREEING?
> 
> At which point do we actually run into deadlock on delete side?  We could,
> in principle, skip everything like that in insert_inode_locked(), but
> I would rather avoid the "two inodes in icache at the same time, with the
> same inumber" situations completely.  We might get away with that, since
> everything else *will* wait, so we can afford a bunch of inodes past the
> point in foo_delete_inode() that has cleared it in bitmap + new locked
> one, but if it's at all possible to avoid, I'd rather avoid it.

OK, that's probably the easiest way to do that, as much as I don't like it...
Since iget() et.al. will not accept I_FREEING (will wait to go away
and restart), and since we'd better have serialization between new/free
on fs data structures anyway, we can afford simply skipping I_FREEING
et.al. in insert_inode_locked().

We do that from new_inode, so it won't race with free_inode in any interesting
ways and it won't race with iget (of any origin; nfsd or in case of fs
corruption a lookup) since both still will wait for I_LOCK.

Tentative patch follow; folks, I would very much like review on that one,
since I'm far too low on caffeine and the area is nasty.

diff --git a/fs/inode.c b/fs/inode.c
index 9d26490..4406952 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1053,13 +1053,22 @@ int insert_inode_locked(struct inode *inode)
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
-	struct inode *old;
 
 	inode->i_state |= I_LOCK|I_NEW;
 	while (1) {
+		struct hlist_node *node;
+		struct inode *old = NULL;
 		spin_lock(&inode_lock);
-		old = find_inode_fast(sb, head, ino);
-		if (likely(!old)) {
+		hlist_for_each_entry(old, node, head, i_hash) {
+			if (old->i_ino != ino)
+				continue;
+			if (old->i_sb != sb)
+				continue;
+			if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+				continue;
+			break;
+		}
+		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_lock);
 			return 0;
@@ -1081,14 +1090,24 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 {
 	struct super_block *sb = inode->i_sb;
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
-	struct inode *old;
 
 	inode->i_state |= I_LOCK|I_NEW;
 
 	while (1) {
+		struct hlist_node *node;
+		struct inode *old = NULL;
+
 		spin_lock(&inode_lock);
-		old = find_inode(sb, head, test, data);
-		if (likely(!old)) {
+		hlist_for_each_entry(old, node, head, i_hash) {
+			if (old->i_sb != sb)
+				continue;
+			if (!test(old, data))
+				continue;
+			if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+				continue;
+			break;
+		}
+		if (likely(!node)) {
 			hlist_add_head(&inode->i_hash, head);
 			spin_unlock(&inode_lock);
 			return 0;
Comment 8 Jan Kara 2009-05-18 12:45:23 UTC
On Wed 13-05-09 12:07:24, Theodore Tso wrote:
> On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
> >   OK, the deadlock has been introduced by ext3 variant of
> > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC).
> 
> What do you mean by this?
> 
> I'm puzzled why we haven't hit this before.  This looks like
> long-standing issue; what unmasked it now?
  Unless you mount the fs with 'sync' option, hitting this is much harder
(the window is quite small in nosync case). I think that is the main reason
why we didn't see this earlier.

> > The deadlock
> > is really tough to avoid - we have to first allocate inode on disk so
> > that we know the inode number.
> 
> Well, the simple thing to do is to have a way of quickly determining
> that a particular inode number is in the I_FREEING state, and simply
> try to avoid using that inode number.  If there are no inodes
> available, it can simply close the handle (since nothing else has
> changed at that point), wait for the current transaction to close, and
> then try again.  That should fix the problem, I think.
  Yes, we could work-around it like that but other filesystems might need
similar things and generally it would be nicer if we could avoid using this
vfs-internal information in the filesystems. Al seems to have found some
other solution without changing filesystems so that would be easier for us...

									Honza
Comment 9 Jan Kara 2009-05-18 12:54:00 UTC
On Wed 13-05-09 17:52:54, Al Viro wrote:
> On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
> > >   Here, we have started a transaction in ext3_create() and then wait in
> > > find_inode_fast() for I_FREEING to be cleared (obviously we have
> > > reallocated the inode and squeezed the allocation before journal_stop()
> > > from the delete was called).
> > >   Nasty deadlock and I don't see how to fix it now - have to go home for
> > > today... Tomorrow I'll have a look what we can do about it.
> >   OK, the deadlock has been introduced by ext3 variant of
> > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
> > is really tough to avoid - we have to first allocate inode on disk so
> > that we know the inode number. For this we need transaction open but we
> > cannot afford waiting for old inode with same INO to be freed when we have
> > transaction open because of the above deadlock. So we'd have to wait for
> > inode release only after everything is done and we closed the transaction.
> But
> > that would mean reordering a lot of code in ext3/namei.c so that all the
> > dcache handling is done after all the IO is done.
> >   Hmm, maybe we could change the delete side of the deadlock but that's
> > going to be tricky as well :(.
> >   Al, any idea if we could somehow get away without waiting on
> > I_FREEING?
> 
> At which point do we actually run into deadlock on delete side?  We could,
> in principle, skip everything like that in insert_inode_locked(), but
> I would rather avoid the "two inodes in icache at the same time, with the
> same inumber" situations completely.  We might get away with that, since
> everything else *will* wait, so we can afford a bunch of inodes past the
> point in foo_delete_inode() that has cleared it in bitmap + new locked
> one, but if it's at all possible to avoid, I'd rather avoid it.
  The ordering we see on delete when the filesystem is mounted with 'sync'
option is:
  DELETE					CREATE
generic_delete_inode()
  set I_FREEING
  ext3_delete_inode
    get transaction handle
    do work
						get transaction handle
						ext3_new_inode()
						  reallocate inode
						  insert_inode_locked()
    stop transaction, wait for it to commit
    (waiting for CREATE process to drop its
    transaction reference)

  Now similar race can happen even without 'sync' mount option but it's
much harder to hit:
  DELETE					CREATE
generic_delete_inode()
  set I_FREEING
  ext3_delete_inode
						get transaction handle
						ext3_new_inode()
						  reallocate inode
						  insert_inode_locked()
    try to get transaction handle -
      - transaction is too big so we send
      current transaction to commit which
      again waits for CREATE to drop its
      reference.

								Honza
Comment 10 Theodore Tso 2009-05-18 13:16:04 UTC
On Wed, May 13, 2009 at 07:13:40PM +0100, Al Viro wrote:
> 
> OK, that's probably the easiest way to do that, as much as I don't like it...
> Since iget() et.al. will not accept I_FREEING (will wait to go away
> and restart), and since we'd better have serialization between new/free
> on fs data structures anyway, we can afford simply skipping I_FREEING
> et.al. in insert_inode_locked().
> 
> We do that from new_inode, so it won't race with free_inode in any
> interesting
> ways and it won't race with iget (of any origin; nfsd or in case of fs
> corruption a lookup) since both still will wait for I_LOCK.
> 
> Tentative patch follow; folks, I would very much like review on that one,
> since I'm far too low on caffeine and the area is nasty.

Sorry for not having time to review this until now.  This looks good
to me.

Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>

So Bug #13232 is currently marked as a 2.6.28 regression; do we feel
confident enough to push this to Linus for 2.6.30?

	  	    	      	 - Ted
Comment 11 Jan Kara 2009-05-18 14:10:15 UTC
On Wed 13-05-09 19:13:40, Al Viro wrote:
> On Wed, May 13, 2009 at 05:52:54PM +0100, Al Viro wrote:
> > On Wed, May 13, 2009 at 03:48:02PM +0200, Jan Kara wrote:
> > > >   Here, we have started a transaction in ext3_create() and then wait in
> > > > find_inode_fast() for I_FREEING to be cleared (obviously we have
> > > > reallocated the inode and squeezed the allocation before journal_stop()
> > > > from the delete was called).
> > > >   Nasty deadlock and I don't see how to fix it now - have to go home
> for
> > > > today... Tomorrow I'll have a look what we can do about it.
> > >   OK, the deadlock has been introduced by ext3 variant of
> > > 261bca86ed4f7f391d1938167624e78da61dcc6b (adding Al to CC). The deadlock
> > > is really tough to avoid - we have to first allocate inode on disk so
> > > that we know the inode number. For this we need transaction open but we
> > > cannot afford waiting for old inode with same INO to be freed when we
> have
> > > transaction open because of the above deadlock. So we'd have to wait for
> > > inode release only after everything is done and we closed the
> transaction. But
> > > that would mean reordering a lot of code in ext3/namei.c so that all the
> > > dcache handling is done after all the IO is done.
> > >   Hmm, maybe we could change the delete side of the deadlock but that's
> > > going to be tricky as well :(.
> > >   Al, any idea if we could somehow get away without waiting on
> > > I_FREEING?
> > 
> > At which point do we actually run into deadlock on delete side?  We could,
> > in principle, skip everything like that in insert_inode_locked(), but
> > I would rather avoid the "two inodes in icache at the same time, with the
> > same inumber" situations completely.  We might get away with that, since
> > everything else *will* wait, so we can afford a bunch of inodes past the
> > point in foo_delete_inode() that has cleared it in bitmap + new locked
> > one, but if it's at all possible to avoid, I'd rather avoid it.
> 
> OK, that's probably the easiest way to do that, as much as I don't like it...
> Since iget() et.al. will not accept I_FREEING (will wait to go away
> and restart), and since we'd better have serialization between new/free
> on fs data structures anyway, we can afford simply skipping I_FREEING
> et.al. in insert_inode_locked().
> 
> We do that from new_inode, so it won't race with free_inode in any
> interesting
> ways and it won't race with iget (of any origin; nfsd or in case of fs
> corruption a lookup) since both still will wait for I_LOCK.
> 
> Tentative patch follow; folks, I would very much like review on that one,
> since I'm far too low on caffeine and the area is nasty.
  The patch looks fine. Everyone else will either get new inode and wait
for I_LOCK or get old inode and wait for I_FREEING so everything should be
fine... You can add.
  Acked-by: Jan Kara <jack@suse.cz>

									Honza
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 9d26490..4406952 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1053,13 +1053,22 @@ int insert_inode_locked(struct inode *inode)
>       struct super_block *sb = inode->i_sb;
>       ino_t ino = inode->i_ino;
>       struct hlist_head *head = inode_hashtable + hash(sb, ino);
> -     struct inode *old;
>  
>       inode->i_state |= I_LOCK|I_NEW;
>       while (1) {
> +             struct hlist_node *node;
> +             struct inode *old = NULL;
>               spin_lock(&inode_lock);
> -             old = find_inode_fast(sb, head, ino);
> -             if (likely(!old)) {
> +             hlist_for_each_entry(old, node, head, i_hash) {
> +                     if (old->i_ino != ino)
> +                             continue;
> +                     if (old->i_sb != sb)
> +                             continue;
> +                     if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +                             continue;
> +                     break;
> +             }
> +             if (likely(!node)) {
>                       hlist_add_head(&inode->i_hash, head);
>                       spin_unlock(&inode_lock);
>                       return 0;
> @@ -1081,14 +1090,24 @@ int insert_inode_locked4(struct inode *inode,
> unsigned long hashval,
>  {
>       struct super_block *sb = inode->i_sb;
>       struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> -     struct inode *old;
>  
>       inode->i_state |= I_LOCK|I_NEW;
>  
>       while (1) {
> +             struct hlist_node *node;
> +             struct inode *old = NULL;
> +
>               spin_lock(&inode_lock);
> -             old = find_inode(sb, head, test, data);
> -             if (likely(!old)) {
> +             hlist_for_each_entry(old, node, head, i_hash) {
> +                     if (old->i_sb != sb)
> +                             continue;
> +                     if (!test(old, data))
> +                             continue;
> +                     if (old->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
> +                             continue;
> +                     break;
> +             }
> +             if (likely(!node)) {
>                       hlist_add_head(&inode->i_hash, head);
>                       spin_unlock(&inode_lock);
>                       return 0;
Comment 12 Theodore Tso 2009-05-19 20:38:32 UTC
Created attachment 21436 [details]
Proposed patch from Al Viro
Comment 13 Theodore Tso 2009-06-07 19:56:04 UTC
This patch is now in the upstream kernel, so it will be in 2.6.30.