Latest working kernel version: 2.6.25, maybe up to 2.6.25-git6
Earliest failing kernel version: 2.6.25-git7
Hardware Environment: GA-P35-DS3R mobo, Intel E4300 CPU, 2 GB RAM, 500 GB WD SATA HDD
Software Environment: KDE desktop
Problem Description: some processes get into a stall, effectively killing all write access to the root partition (XFS); a lot of messeges in the logs like this (if I wait long enough) :
[ 1032.940632] INFO: task xfsdatad/0:317 blocked for more than 120 seconds.
[ 1032.940638] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1032.940642] xfsdatad/0 D ffff81000102cdc0 0 317 2
[ 1032.940683] ffff81007f9bbdd0 0000000000000046 0000000000000000 0000000000000000
[ 1032.940691] ffff81007fb335a0 ffffffff8089c4a0 ffff81007fb338e8 000000010000ea09
[ 1032.940697] 0000000000000000 0000000000000000 0000000000000003 0000000000000000
[ 1032.940703] Call Trace:
[ 1032.940711] [<ffffffff803a1f60>] xfs_end_bio_delalloc+0x0/0x20
[ 1032.940717] [<ffffffff806b8a29>] __down_write_nested+0x79/0xc0
[ 1032.940800] [<ffffffff8037f125>] xfs_ilock+0xa5/0xe0
[ 1032.940811] [<ffffffff803a1db0>] xfs_setfilesize+0x40/0xc0
[ 1032.940814] [<ffffffff803a1f70>] xfs_end_bio_delalloc+0x10/0x20
[ 1032.940817] [<ffffffff8024c8f0>] run_workqueue+0x140/0x220
[ 1032.940820] [<ffffffff8024caa0>] worker_thread+0x0/0xd0
[ 1032.940822] [<ffffffff8024cb31>] worker_thread+0x91/0xd0
[ 1032.940825] [<ffffffff80250840>] autoremove_wake_function+0x0/0x30
[ 1032.940828] [<ffffffff8024caa0>] worker_thread+0x0/0xd0
[ 1032.940830] [<ffffffff8024caa0>] worker_thread+0x0/0xd0
[ 1032.940832] [<ffffffff802503ab>] kthread+0x4b/0x80
[ 1032.940835] [<ffffffff8020c428>] child_rip+0xa/0x12
[ 1032.940837] [<ffffffff802504f4>] kthreadd+0x114/0x1a0
[ 1032.940839] [<ffffffff80250360>] kthread+0x0/0x80
[ 1032.940940] [<ffffffff8020c41e>] child_rip+0x0/0x12
[ 1032.940943] INFO: lockdep is turned off.
[ 1032.940953] INFO: task pdflush:27505 blocked for more than 120 seconds.
[ 1032.940955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1032.940957] pdflush D ffff81000102cdc0 0 27505 2
[ 1032.940960] ffff810065bef4c0 0000000000000046 0000000000000000 ffffffff8040ce06
[ 1032.940964] ffff81006dad5960 ffffffff8089c4a0 ffff81006dad5ca8 000000010000e9f2
[ 1032.940968] 0000000000036a20 000000008038c9f7 0000000000000003 ffffffff8038a767
[ 1032.940972] Call Trace:
[ 1032.940976] [<ffffffff8040ce06>] __spin_lock_init+0x36/0x70
[ 1032.940979] [<ffffffff8038a767>] xlog_grant_push_ail+0x47/0x160
[ 1032.940982] [<ffffffff806b8a29>] __down_write_nested+0x79/0xc0
[ 1032.940984] [<ffffffff8037f125>] xfs_ilock+0xa5/0xe0
[ 1032.940987] [<ffffffff8038661b>] xfs_iomap_write_allocate+0x11b/0x3c0
[ 1032.940990] [<ffffffff806b8ea1>] _spin_lock_irqsave+0x41/0x60
[ 1032.940993] [<ffffffff8038742e>] xfs_iomap+0x23e/0x2d0
[ 1032.940995] [<ffffffff803a2067>] xfs_map_blocks+0x37/0x90
[ 1032.940997] [<ffffffff803a3576>] xfs_page_state_convert+0x296/0x640
[ 1032.941001] [<ffffffff80253635>] ktime_get_ts+0x25/0x60
[ 1032.941003] [<ffffffff806b9519>] _spin_unlock+0x29/0x50
[ 1032.941006] [<ffffffff8025367c>] ktime_get+0xc/0x50
[ 1032.941008] [<ffffffff803a3a58>] xfs_vm_writepage+0x68/0x110
[ 1032.941012] [<ffffffff8027800e>] shrink_page_list+0x52e/0x680
[ 1032.941015] [<ffffffff803ec57d>] blk_recount_segments+0x3d/0x80
[ 1032.941018] [<ffffffff8026fc7b>] mempool_alloc+0x4b/0x140
[ 1032.941020] [<ffffffff80277771>] isolate_lru_pages+0x1a1/0x240
[ 1032.941023] [<ffffffff802782c4>] shrink_inactive_list+0x164/0x450
[ 1032.941026] [<ffffffff80278993>] shrink_zone+0xb3/0x130
[ 1032.941028] [<ffffffff8027919f>] try_to_free_pages+0x24f/0x3d0
[ 1032.941031] [<ffffffff80277810>] isolate_pages_global+0x0/0x40
[ 1032.941034] [<ffffffff802728f5>] __alloc_pages_internal+0x1b5/0x460
[ 1032.941036] [<ffffffff80272c35>] __get_free_pages+0x15/0x60
[ 1032.941038] [<ffffffff803a1b5b>] kmem_alloc+0x5b/0x100
[ 1032.941041] [<ffffffff8038410a>] xfs_iflush_cluster+0x4a/0x3b0
[ 1032.941043] [<ffffffff806b9519>] _spin_unlock+0x29/0x50
[ 1032.941046] [<ffffffff80383049>] xfs_iflush_int+0x2d9/0x340
[ 1032.941048] [<ffffffff803846e0>] xfs_iflush+0x270/0x310
[ 1032.941052] [<ffffffff8039bea1>] xfs_inode_flush+0xb1/0xe0
[ 1032.941055] [<ffffffff803ab8d5>] xfs_fs_write_inode+0x25/0x70
[ 1032.941058] [<ffffffff802b901f>] __writeback_single_inode+0x25f/0x350
[ 1032.941061] [<ffffffff806b9519>] _spin_unlock+0x29/0x50
[ 1032.941064] [<ffffffff803899aa>] xfs_log_need_covered+0x7a/0xd0
[ 1032.941066] [<ffffffff802b9577>] sync_sb_inodes+0x207/0x310
[ 1032.941069] [<ffffffff802b98d2>] writeback_inodes+0xa2/0xf0
[ 1032.941071] [<ffffffff80273df6>] wb_kupdate+0xa6/0x120
[ 1032.941073] [<ffffffff80274ee0>] pdflush+0x0/0x1f0
[ 1032.941076] [<ffffffff80274ee0>] pdflush+0x0/0x1f0
[ 1032.941078] [<ffffffff80275001>] pdflush+0x121/0x1f0
[ 1032.941080] [<ffffffff80273d50>] wb_kupdate+0x0/0x120
[ 1032.941082] [<ffffffff802503ab>] kthread+0x4b/0x80
[ 1032.941084] [<ffffffff8020c428>] child_rip+0xa/0x12
[ 1032.941087] [<ffffffff802504f4>] kthreadd+0x114/0x1a0
[ 1032.941089] [<ffffffff80250360>] kthread+0x0/0x80
[ 1032.941091] [<ffffffff8020c41e>] child_rip+0x0/0x12
[ 1032.941093] INFO: lockdep is turned off.
After this - the system becomes unusable - I've been able for example to switch to the text console, and issue a "reboot" command - it didn't work; if I try to start SeaMonkey no long enough after the stall occurs - it works, but not for long - it stalls too if I try to something, requiring write access to the root partition; I tried "sync"-ing - it blocks, too.
Steps to reproduce: not sure exactly; to me it looks like a "git pull" command is triggering the problem; of all the other programs running maybe notable is ktorrent, but it happened once even when ktorrent was not started; On the system the problem occurs there are only XFS and NTFS partitions - I used the NTFS to save some of dmesg's output, since the XFS ones were of no use
See also: http://lkml.org/lkml/2008/4/30/176 - this is an email I sent to ask for help on how to debug the problem, but as I see it - all the devs are busy right now due to the merge window - hence, this bug-report
Created attachment 15996 [details]
dmesg output #1
dmesg output #1
Created attachment 15997 [details]
dmesg output #2
dmesg output #2
Created attachment 15998 [details]
kernel config used
kernel config used
Just to address you lkml post - you should post XFS bug reports
to email@example.com, otherwise they may not be seen. if you do
post to lkml, putting "XFS" in the subject line means it will
probably be noticed by the XFS folk. Better is "[BUG, XFS] <rest of subject>".
As it is, it looks this commit:
is using the wrong allocation primitive - it's allowing memory
reclaim to recurse into the filesystem and then deadlocking.
I'll attach a patch to fix it in a second.
BTW, thanks for reporting the bug with all the relevant info.
Created attachment 15999 [details]
Don't allow memory reclaim to recurse into the filesystem in inode cluster writeback.
This patch should fix it. Can you apply it, test, and post the results
to the bugzilla?
I did not put "XFS" in the subject line in my lkml message because
I wasn't sure its a XFS problem - I was hoping the traces would be
seen by someone, who can at least point me in the right direction...
I'm glad that what I included in my posts was enough for you to see
a cause for my troubles.
Currently testing the proposed patch. Will get back to you/bugzilla
Hello again, Dave!
I can confirm that with the patch you supplied on top of the
latest git tree from Linus, a have no problems whatsoever :
just finished five recompiles one after the other, while
ktorrent was running in the background, and its recompiling
the kernel as I write this!
Nice work, and again,
thanks for your time!
Ok, good to hear. I'll get this out for review and into the main
tree as quickly as possible. I'll close this bug out now. Thanks
for testing so quickly. ;)
Regression list annotation:
Handled-By : Dave Chinner <firstname.lastname@example.org>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=15999
The problem is present in 2.6.26-rc3.
References : http://lkml.org/lkml/2008/5/19/20
fixed by commit 49383b0e98ad1f69ff4c816eb1961f703df12318