Bug 71791 - Unlinking a file that was moved to another folder but still open by other process blocks either process (not always reproducible)
Summary: Unlinking a file that was moved to another folder but still open by other pro...
Status: NEEDINFO
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-10 14:01 UTC by edpeur
Modified: 2014-03-18 15:35 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.12.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description edpeur 2014-03-10 14:01:44 UTC
Either the rm process or more rarely the process that has the file open will block when a file that was moved to other folder is unlinked while still open by other process. This is not always reproducible, but like 20% reproducible.

Here is the most common trace:

[27600.716129] INFO: task rm:28925 blocked for more than 120 seconds.
[27600.716139]       Tainted: G        W    3.12-1-rt-amd64 #1
[27600.716142] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27600.716147] rm              D ffff88007eb8a5f0     0 28925  28924 0x00000004
[27600.716159]  ffff88007eb8a280 0000000000000086 ffff88004542bfd8 ffff88004542bfd8
[27600.716162]  ffff88004542bfd8 ffff88004542bfd8 ffff88007eb8a280 ffff88007eb8a280
[27600.716165]  0000000000000002 ffffffff81123410 ffff88004542bd50 ffff88004542bdb8
[27600.716172] Call Trace:
[27600.716185]  [<ffffffff81123410>] ? wait_on_page_read+0x60/0x60
[27600.716190]  [<ffffffff814aecb1>] ? schedule+0x21/0x90
[27600.716194]  [<ffffffff814aeda3>] ? io_schedule+0x83/0xd0
[27600.716197]  [<ffffffff81123415>] ? sleep_on_page+0x5/0x10
[27600.716201]  [<ffffffff814ad594>] ? __wait_on_bit+0x54/0x80
[27600.716204]  [<ffffffff8112322f>] ? wait_on_page_bit+0x7f/0x90
[27600.716209]  [<ffffffff81082050>] ? wake_atomic_t_function+0x30/0x30
[27600.716213]  [<ffffffff81130795>] ? pagevec_lookup+0x15/0x20
[27600.716217]  [<ffffffff8113284b>] ? truncate_inode_pages_range+0x26b/0x490
[27600.716265]  [<ffffffffa02f944d>] ? ext4_evict_inode+0x13d/0x540 [ext4]
[27600.716270]  [<ffffffff8119e055>] ? evict+0xb5/0x1b0
[27600.716279]  [<ffffffff8119180b>] ? do_unlinkat+0x19b/0x250
[27600.716282]  [<ffffffff81090140>] ? migrate_enable+0xa0/0x1e0
[27600.716286]  [<ffffffff811c3795>] ? fsnotify_find_inode_mark+0x65/0x90
[27600.716288]  [<ffffffff811c4d64>] ? dnotify_flush+0x34/0x110
[27600.716291]  [<ffffffff814b7ab9>] ? system_call_fastpath+0x16/0x1b

And this trace has only happened once:

[49440.736350] INFO: task mplayer:30035 blocked for more than 120 seconds.
[49440.736360]       Tainted: G        W    3.12-1-rt-amd64 #1
[49440.736362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49440.736365] mplayer         D ffff88001d4dd9b0     0 30035  23580 0x00000000
[49440.736375]  ffff88001d4dd640 0000000000000086 ffff88003e31bfd8 ffff88003e31bfd8
[49440.736377]  ffff88003e31bfd8 ffff88003e31bfd8 ffff88001d4dd640 ffff88001d4dd640
[49440.736379]  0000000000000002 ffffffff81123410 ffff88003e31bd40 ffff88003e31bda0
[49440.736383] Call Trace:
[49440.736396]  [<ffffffff81123410>] ? wait_on_page_read+0x60/0x60
[49440.736400]  [<ffffffff814aecb1>] ? schedule+0x21/0x90
[49440.736403]  [<ffffffff814aeda3>] ? io_schedule+0x83/0xd0
[49440.736405]  [<ffffffff81123415>] ? sleep_on_page+0x5/0x10
[49440.736407]  [<ffffffff814ad594>] ? __wait_on_bit+0x54/0x80
[49440.736409]  [<ffffffff8112322f>] ? wait_on_page_bit+0x7f/0x90
[49440.736412]  [<ffffffff81082050>] ? wake_atomic_t_function+0x30/0x30
[49440.736416]  [<ffffffff81130795>] ? pagevec_lookup+0x15/0x20
[49440.736419]  [<ffffffff8113284b>] ? truncate_inode_pages_range+0x26b/0x490
[49440.736491]  [<ffffffffa031944d>] ? ext4_evict_inode+0x13d/0x540 [ext4]
[49440.736495]  [<ffffffff8119e055>] ? evict+0xb5/0x1b0
[49440.736497]  [<ffffffff8119a2b8>] ? dentry_kill+0x218/0x270
[49440.736499]  [<ffffffff8119a374>] ? dput+0x64/0x110
[49440.736502]  [<ffffffff811857fb>] ? __fput+0x13b/0x200
[49440.736505]  [<ffffffff8107e077>] ? task_work_run+0x97/0xd0
[49440.736510]  [<ffffffff810149e1>] ? do_notify_resume+0x81/0xe0
[49440.736513]  [<ffffffff814b7d76>] ? int_signal+0x12/0x17

This is the kernel used:
http://ftp.debian.org/debian/pool/main/l/linux/linux-image-3.12-1-rt-amd64_3.12.9-1_amd64.deb
Comment 1 Jan Kara 2014-03-11 20:14:17 UTC
So one note is that we are likely waiting for writeback on some page in that file to finish. Another note is that I don't think another process has the file open - the stacktrace clearly shows that we are completely deleting the file which means noone can have the file open (if you can really show otherwise, it would be a major bug).

I guess the file has been written quite recently and the filesystem is quite loaded with the IO, isn't it?

Can you try mounting the filesystem with 'noauto_da_alloc' mount option?
Comment 2 Zheng Liu 2014-03-16 07:20:29 UTC
I guess that the journal mode is 'data=ordered', right?  Could you please try to switch the journal mode to 'data=writeback' and look at whether or not the problem can be reproduced?

In our product system, we met a hang which is caused by 'data=ordered' under a heavy IO workload.  When the user uses 'rm' command to delete a file, ext4_begin_ordered_truncate() should be called with 'data=ordered' and it could trigger write back kernel thread to write out the dirty data.  Then truncate_inode_pages() will wait on write back.

Regards,
                                                - Zheng
Comment 3 edpeur 2014-03-18 15:35:03 UTC
Yes, the journal mode was 'data=ordered'.
This problem is reproduced on this external USB drive on my laptop using a quite light workload.
It is not always reproducible and I had to change the way I work so I might not be able to reproduce this bug in the near time.
There was another hang when shutdown tried to umount this drive:
[374641.816086] INFO: task shutdown:20015 blocked for more than 120 seconds.
[374641.816096]       Tainted: G        W    3.12-1-rt-amd64 #1
[374641.816099] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[374641.816103] shutdown        D ffff8800b4ce6af0     0 20015      1 0x00000000
[374641.816116]  ffff8800b4ce6780 0000000000000086 ffff880103177fd8 ffff880103177fd8
[374641.816119]  ffff880103177fd8 ffff880103177fd8 ffff8800b4ce6780 ffff8800b4ce6780
[374641.816122]  0000000000000002 ffffffff81123410 ffff880103177d90 000000000000c394
[374641.816129] Call Trace:
[374641.816142]  [<ffffffff81123410>] ? wait_on_page_read+0x60/0x60
[374641.816147]  [<ffffffff814aecb1>] ? schedule+0x21/0x90
[374641.816151]  [<ffffffff814aeda3>] ? io_schedule+0x83/0xd0
[374641.816154]  [<ffffffff81123415>] ? sleep_on_page+0x5/0x10
[374641.816158]  [<ffffffff814ad594>] ? __wait_on_bit+0x54/0x80
[374641.816162]  [<ffffffff8112322f>] ? wait_on_page_bit+0x7f/0x90
[374641.816166]  [<ffffffff81082050>] ? wake_atomic_t_function+0x30/0x30
[374641.816171]  [<ffffffff811307b8>] ? pagevec_lookup_tag+0x18/0x20
[374641.816175]  [<ffffffff81123318>] ? filemap_fdatawait_range+0xd8/0x150
[374641.816179]  [<ffffffff8108e6b9>] ? get_parent_ip+0x9/0x20
[374641.816182]  [<ffffffff8108e6b9>] ? get_parent_ip+0x9/0x20
[374641.816186]  [<ffffffff814b42d7>] ? add_preempt_count+0x97/0xe0
[374641.816191]  [<ffffffff8105ed50>] ? pin_current_cpu+0x90/0x1b0
[374641.817981]  [<ffffffff8108e6b9>] ? get_parent_ip+0x9/0x20
[374641.817983]  [<ffffffff8105ee7d>] ? unpin_current_cpu+0xd/0x60
[374641.817986]  [<ffffffff811ac869>] ? sync_inodes_sb+0x169/0x1f0
[374641.817989]  [<ffffffff811b1c80>] ? generic_write_sync+0x60/0x60
[374641.817991]  [<ffffffff811877ba>] ? iterate_supers+0xba/0x120
[374641.817993]  [<ffffffff811b1ecd>] ? sys_sync+0x2d/0x90
[374641.817996]  [<ffffffff814b7ab9>] ? system_call_fastpath+0x16/0x1b

I have upgraded to:
http://ftp.debian.org/debian/pool/main/l/linux/linux-image-3.13-1-amd64_3.13.5-1_amd64.deb

I have enabled 'data=writeback' now. So, I will report if I see this problem again.

Note You need to log in before you can comment on or make changes to this bug.