Since 2.6.33.1 various process lockups occour like this: INFO: task crond:3509 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. crond D ffff880008c0f948 0 3509 1 0x00000000 ffff880221ae3498 0000000000000086 ffff880221ae3460 ffff880221ae345c 0000000000000000 ffff880008c0f948 ffff880008d152c0 ffff88022504df70 00000001002ef773 ffff880221ae3fd8 ffff880221ae2000 ffff880221ae2000 Call Trace: [<ffffffffa0554290>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs] [<ffffffff8135970e>] io_schedule+0x6e/0xb0 [<ffffffffa0554299>] nfs_wait_bit_uninterruptible+0x9/0x10 [nfs] [<ffffffff81359e77>] __wait_on_bit+0x57/0x80 [<ffffffffa0554290>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs] [<ffffffff81359f13>] out_of_line_wait_on_bit+0x73/0x90 [<ffffffff81073090>] ? wake_bit_function+0x0/0x40 [<ffffffffa0554287>] nfs_wait_on_request+0x27/0x30 [nfs] [<ffffffffa05593d0>] nfs_sync_mapping_wait+0x120/0x280 [nfs] [<ffffffffa0559ba9>] nfs_wb_page+0x79/0xd0 [nfs] [<ffffffffa05488bf>] nfs_release_page+0x5f/0x80 [nfs] [<ffffffff810dabfb>] try_to_release_page+0x2b/0x50 [<ffffffff810e8af3>] shrink_page_list+0x453/0x5f0 [<ffffffff810e8f59>] shrink_inactive_list+0x2c9/0x770 [<ffffffff810088ee>] ? __switch_to+0x29e/0x360 [<ffffffff810e3e45>] ? determine_dirtyable_memory+0x15/0x30 [<ffffffff810e3ee2>] ? get_dirty_limits+0x22/0x2e0 [<ffffffff810e9671>] shrink_zone+0x271/0x460 [<ffffffff810e9a2a>] ? shrink_slab+0x13a/0x170 [<ffffffff810ea47e>] do_try_to_free_pages+0xde/0x420 [<ffffffff810ea94b>] try_to_free_pages+0x6b/0x70 [<ffffffff810e7df0>] ? isolate_pages_global+0x0/0x270 [<ffffffff810e2b6d>] __alloc_pages_nodemask+0x41d/0x700 [<ffffffff810e4c90>] __do_page_cache_readahead+0x110/0x280 [<ffffffff810e4e1c>] ra_submit+0x1c/0x20 [<ffffffff810dc7df>] filemap_fault+0x37f/0x440 [<ffffffff810f625f>] __do_fault+0x4f/0x590 [<ffffffff8107735c>] ? lock_hrtimer_base+0x2c/0x60 [<ffffffff810f8a0b>] handle_mm_fault+0x19b/0xa20 [<ffffffff8135af60>] ? do_nanosleep+0x40/0xd0 [<ffffffff81077a5f>] ? hrtimer_nanosleep+0xbf/0x180 [<ffffffff81031683>] do_page_fault+0x133/0x320 [<ffffffff8135c965>] page_fault+0x25/0x30 I can almost always reproduce the lockup when using tab completion in bash on a nfs mount. The nfs server is running latest 2.6.27.45. I haven't seen it in initial 2.6.33 release but did not use it for long so maybe I only missed it there. I'll try to reproduce it there. Maybe it's realted to the stable fix: Trond Myklebust (2): SUNRPC: Handle EINVAL error returns from the TCP connect operation NFS: Fix an allocation-under-spinlock bug fs/nfs/dns_resolve.c | 18 fs/nfsd/nfs4state.c | 2 I'll try to reproduce it in 2.6.33
The above just means that the thread is waiting for another write request to complete. In order to figure out why it is hanging, you need to look at the other threads to figure out which one is actually responsible for doing the writeout and see why it is hanging. That said, there is a known problem with a deadlock in shrink_page_list() and I've got a patch queued up to fix it. Could you perhaps give it a try, and see if it fixes your hang.
Created attachment 25567 [details] NFS: Avoid a deadlock in nfs_release_page
I may be seeing a similar error. When copying large amounts (hundreds of GBs) of data over NFS the copying process (rsync, tar, etc.) will get into a D state and then flush and kswapd enter a D state. Finally any process writing to disk will block. This occurs with the Debian 2.6.33-1~experimental (2.6.33.1) kernel and the Debian 2.6.32-10 (2.6.32.9 equivalent) kernels. I applied the provided patch to the Debian 2.6.33.1 kernel and it did not solve the problem. Below is the call trace from the blocked process. This did not kill the process. INFO: task rsync:17708 blocked for more than 600 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rsync D 0000000000000002 0 17708 17707 0x00000000 ffff880c7cfb6200 0000000000000086 0000000000000000 ffff88044b409334 0000000000000286 000000000000f8e0 ffff88044b409fd8 0000000000015680 0000000000015680 ffff88067bc6e900 ffff88067bc6ebf0 0000000181066af9 Call Trace: [<ffffffff8100f789>] ? read_tsc+0xa/0x20 [<ffffffff8109411e>] ? delayacct_end+0x74/0x7f [<ffffffffa03c3c04>] ? nfs_wait_bit_uninterruptible+0x0/0xd [nfs] [<ffffffff812ed1cb>] ? io_schedule+0x73/0xb7 [<ffffffffa03c3c0d>] ? nfs_wait_bit_uninterruptible+0x9/0xd [nfs] [<ffffffff812ed6f6>] ? __wait_on_bit+0x41/0x70 [<ffffffff81183d3c>] ? __lookup_tag+0xad/0x11b [<ffffffffa03c3c04>] ? nfs_wait_bit_uninterruptible+0x0/0xd [nfs] [<ffffffff812ed790>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff8105f300>] ? wake_bit_function+0x0/0x23 [<ffffffffa03c7b87>] ? nfs_sync_mapping_wait+0xfa/0x227 [nfs] [<ffffffffa03c7d48>] ? nfs_wb_page+0x94/0xc3 [nfs] [<ffffffffa03badfc>] ? nfs_release_page+0x3a/0x57 [nfs] [<ffffffff810b9382>] ? shrink_page_list+0x48d/0x617 [<ffffffff8103a9a1>] ? pick_next_task_fair+0xca/0xd5 [<ffffffff810b843d>] ? isolate_pages_global+0x1a0/0x20f [<ffffffff810b9ee3>] ? shrink_zone+0x710/0xae9 [<ffffffff810b52ba>] ? __alloc_pages_nodemask+0x2a7/0x5e1 [<ffffffff81279ab1>] ? tcp_established_options+0x2d/0xa9 [<ffffffff8103dcac>] ? find_busiest_group+0x3b0/0x875 [<ffffffff8100f789>] ? read_tsc+0xa/0x20 [<ffffffff810bb121>] ? do_try_to_free_pages+0x1ce/0x31e [<ffffffff810bb396>] ? try_to_free_pages+0x72/0x78 [<ffffffff810b829d>] ? isolate_pages_global+0x0/0x20f [<ffffffff810b53cb>] ? __alloc_pages_nodemask+0x3b8/0x5e1 [<ffffffff81053aaf>] ? lock_timer_base+0x26/0x4b [<ffffffff810e1314>] ? new_slab+0x42/0x1ca [<ffffffff810e168c>] ? __slab_alloc+0x1f0/0x3a2 [<ffffffff810b0a58>] ? mempool_alloc+0x55/0x106 [<ffffffff810b0a58>] ? mempool_alloc+0x55/0x106 [<ffffffff810e1d66>] ? kmem_cache_alloc+0x7f/0xf0 [<ffffffff810b0a58>] ? mempool_alloc+0x55/0x106 [<ffffffff8105f2d2>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa03c840e>] ? nfs_writedata_alloc+0x19/0x98 [nfs] [<ffffffffa03c84a1>] ? nfs_flush_one+0x14/0xce [nfs] [<ffffffffa03c3abc>] ? nfs_pageio_doio+0x2a/0x51 [nfs] [<ffffffffa03c3ba8>] ? nfs_pageio_add_request+0xc5/0xd5 [nfs] [<ffffffffa03c6e01>] ? nfs_do_writepage+0xe0/0x103 [nfs] [<ffffffffa03c72e7>] ? nfs_writepages_callback+0xf/0x21 [nfs] [<ffffffff810b5c38>] ? write_cache_pages+0x1be/0x2a4 [<ffffffffa03c72d8>] ? nfs_writepages_callback+0x0/0x21 [nfs] [<ffffffffa03c7297>] ? nfs_writepages+0xde/0x11f [nfs] [<ffffffffa03c848d>] ? nfs_flush_one+0x0/0xce [nfs] [<ffffffffa03c8336>] ? nfs_write_mapping+0x55/0x8e [nfs] [<ffffffffa03bb273>] ? nfs_do_fsync+0x1c/0x3c [nfs] [<ffffffff810e820a>] ? filp_close+0x37/0x62 [<ffffffff810e82c9>] ? sys_close+0x94/0xcd [<ffffffff81008ac2>] ? system_call_fastpath+0x16/0x1b
Yes. Looking again at the traces, I'm thinking this might be the same bug as Bugzilla entry 15578. Can you try the proposed patch at https://bugzilla.kernel.org/attachment.cgi?id=25612 and see if that fixes the hang?
I've installed the proposed patch and have not had any problems yet. It seems to take many GBs of NFS writing to cause the problem, so I can't yet say if it has fixed the issue.
My problem occoured in 2.6.33 and 2.6.33.1. I've tried the patch appended here in this bug. So far it seems to solve my problem.
OK. Tentatively marking this bug as a duplicate of 15578 due to the similarity of the reported traces. *** This bug has been marked as a duplicate of bug 15578 ***