Either the rm process or more rarely the process that has the file open will block when a file that was moved to other folder is unlinked while still open by other process. This is not always reproducible, but like 20% reproducible. Here is the most common trace: [27600.716129] INFO: task rm:28925 blocked for more than 120 seconds. [27600.716139] Tainted: G W 3.12-1-rt-amd64 #1 [27600.716142] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [27600.716147] rm D ffff88007eb8a5f0 0 28925 28924 0x00000004 [27600.716159] ffff88007eb8a280 0000000000000086 ffff88004542bfd8 ffff88004542bfd8 [27600.716162] ffff88004542bfd8 ffff88004542bfd8 ffff88007eb8a280 ffff88007eb8a280 [27600.716165] 0000000000000002 ffffffff81123410 ffff88004542bd50 ffff88004542bdb8 [27600.716172] Call Trace: [27600.716185] [<ffffffff81123410>] ? wait_on_page_read+0x60/0x60 [27600.716190] [<ffffffff814aecb1>] ? schedule+0x21/0x90 [27600.716194] [<ffffffff814aeda3>] ? io_schedule+0x83/0xd0 [27600.716197] [<ffffffff81123415>] ? sleep_on_page+0x5/0x10 [27600.716201] [<ffffffff814ad594>] ? __wait_on_bit+0x54/0x80 [27600.716204] [<ffffffff8112322f>] ? wait_on_page_bit+0x7f/0x90 [27600.716209] [<ffffffff81082050>] ? wake_atomic_t_function+0x30/0x30 [27600.716213] [<ffffffff81130795>] ? pagevec_lookup+0x15/0x20 [27600.716217] [<ffffffff8113284b>] ? truncate_inode_pages_range+0x26b/0x490 [27600.716265] [<ffffffffa02f944d>] ? ext4_evict_inode+0x13d/0x540 [ext4] [27600.716270] [<ffffffff8119e055>] ? evict+0xb5/0x1b0 [27600.716279] [<ffffffff8119180b>] ? do_unlinkat+0x19b/0x250 [27600.716282] [<ffffffff81090140>] ? migrate_enable+0xa0/0x1e0 [27600.716286] [<ffffffff811c3795>] ? fsnotify_find_inode_mark+0x65/0x90 [27600.716288] [<ffffffff811c4d64>] ? dnotify_flush+0x34/0x110 [27600.716291] [<ffffffff814b7ab9>] ? system_call_fastpath+0x16/0x1b And this trace has only happened once: [49440.736350] INFO: task mplayer:30035 blocked for more than 120 seconds. [49440.736360] Tainted: G W 3.12-1-rt-amd64 #1 [49440.736362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49440.736365] mplayer D ffff88001d4dd9b0 0 30035 23580 0x00000000 [49440.736375] ffff88001d4dd640 0000000000000086 ffff88003e31bfd8 ffff88003e31bfd8 [49440.736377] ffff88003e31bfd8 ffff88003e31bfd8 ffff88001d4dd640 ffff88001d4dd640 [49440.736379] 0000000000000002 ffffffff81123410 ffff88003e31bd40 ffff88003e31bda0 [49440.736383] Call Trace: [49440.736396] [<ffffffff81123410>] ? wait_on_page_read+0x60/0x60 [49440.736400] [<ffffffff814aecb1>] ? schedule+0x21/0x90 [49440.736403] [<ffffffff814aeda3>] ? io_schedule+0x83/0xd0 [49440.736405] [<ffffffff81123415>] ? sleep_on_page+0x5/0x10 [49440.736407] [<ffffffff814ad594>] ? __wait_on_bit+0x54/0x80 [49440.736409] [<ffffffff8112322f>] ? wait_on_page_bit+0x7f/0x90 [49440.736412] [<ffffffff81082050>] ? wake_atomic_t_function+0x30/0x30 [49440.736416] [<ffffffff81130795>] ? pagevec_lookup+0x15/0x20 [49440.736419] [<ffffffff8113284b>] ? truncate_inode_pages_range+0x26b/0x490 [49440.736491] [<ffffffffa031944d>] ? ext4_evict_inode+0x13d/0x540 [ext4] [49440.736495] [<ffffffff8119e055>] ? evict+0xb5/0x1b0 [49440.736497] [<ffffffff8119a2b8>] ? dentry_kill+0x218/0x270 [49440.736499] [<ffffffff8119a374>] ? dput+0x64/0x110 [49440.736502] [<ffffffff811857fb>] ? __fput+0x13b/0x200 [49440.736505] [<ffffffff8107e077>] ? task_work_run+0x97/0xd0 [49440.736510] [<ffffffff810149e1>] ? do_notify_resume+0x81/0xe0 [49440.736513] [<ffffffff814b7d76>] ? int_signal+0x12/0x17 This is the kernel used: http://ftp.debian.org/debian/pool/main/l/linux/linux-image-3.12-1-rt-amd64_3.12.9-1_amd64.deb
So one note is that we are likely waiting for writeback on some page in that file to finish. Another note is that I don't think another process has the file open - the stacktrace clearly shows that we are completely deleting the file which means noone can have the file open (if you can really show otherwise, it would be a major bug). I guess the file has been written quite recently and the filesystem is quite loaded with the IO, isn't it? Can you try mounting the filesystem with 'noauto_da_alloc' mount option?
I guess that the journal mode is 'data=ordered', right? Could you please try to switch the journal mode to 'data=writeback' and look at whether or not the problem can be reproduced? In our product system, we met a hang which is caused by 'data=ordered' under a heavy IO workload. When the user uses 'rm' command to delete a file, ext4_begin_ordered_truncate() should be called with 'data=ordered' and it could trigger write back kernel thread to write out the dirty data. Then truncate_inode_pages() will wait on write back. Regards, - Zheng
Yes, the journal mode was 'data=ordered'. This problem is reproduced on this external USB drive on my laptop using a quite light workload. It is not always reproducible and I had to change the way I work so I might not be able to reproduce this bug in the near time. There was another hang when shutdown tried to umount this drive: [374641.816086] INFO: task shutdown:20015 blocked for more than 120 seconds. [374641.816096] Tainted: G W 3.12-1-rt-amd64 #1 [374641.816099] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [374641.816103] shutdown D ffff8800b4ce6af0 0 20015 1 0x00000000 [374641.816116] ffff8800b4ce6780 0000000000000086 ffff880103177fd8 ffff880103177fd8 [374641.816119] ffff880103177fd8 ffff880103177fd8 ffff8800b4ce6780 ffff8800b4ce6780 [374641.816122] 0000000000000002 ffffffff81123410 ffff880103177d90 000000000000c394 [374641.816129] Call Trace: [374641.816142] [<ffffffff81123410>] ? wait_on_page_read+0x60/0x60 [374641.816147] [<ffffffff814aecb1>] ? schedule+0x21/0x90 [374641.816151] [<ffffffff814aeda3>] ? io_schedule+0x83/0xd0 [374641.816154] [<ffffffff81123415>] ? sleep_on_page+0x5/0x10 [374641.816158] [<ffffffff814ad594>] ? __wait_on_bit+0x54/0x80 [374641.816162] [<ffffffff8112322f>] ? wait_on_page_bit+0x7f/0x90 [374641.816166] [<ffffffff81082050>] ? wake_atomic_t_function+0x30/0x30 [374641.816171] [<ffffffff811307b8>] ? pagevec_lookup_tag+0x18/0x20 [374641.816175] [<ffffffff81123318>] ? filemap_fdatawait_range+0xd8/0x150 [374641.816179] [<ffffffff8108e6b9>] ? get_parent_ip+0x9/0x20 [374641.816182] [<ffffffff8108e6b9>] ? get_parent_ip+0x9/0x20 [374641.816186] [<ffffffff814b42d7>] ? add_preempt_count+0x97/0xe0 [374641.816191] [<ffffffff8105ed50>] ? pin_current_cpu+0x90/0x1b0 [374641.817981] [<ffffffff8108e6b9>] ? get_parent_ip+0x9/0x20 [374641.817983] [<ffffffff8105ee7d>] ? unpin_current_cpu+0xd/0x60 [374641.817986] [<ffffffff811ac869>] ? sync_inodes_sb+0x169/0x1f0 [374641.817989] [<ffffffff811b1c80>] ? generic_write_sync+0x60/0x60 [374641.817991] [<ffffffff811877ba>] ? iterate_supers+0xba/0x120 [374641.817993] [<ffffffff811b1ecd>] ? sys_sync+0x2d/0x90 [374641.817996] [<ffffffff814b7ab9>] ? system_call_fastpath+0x16/0x1b I have upgraded to: http://ftp.debian.org/debian/pool/main/l/linux/linux-image-3.13-1-amd64_3.13.5-1_amd64.deb I have enabled 'data=writeback' now. So, I will report if I see this problem again.