Bug 16456 - sync locks up often when run soon after boot
Summary: sync locks up often when run soon after boot
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-24 17:41 UTC by Arvid Norlander
Modified: 2012-08-09 14:59 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.34.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Arvid Norlander 2010-07-24 17:41:09 UTC
To begin with I don't know if this is right component, it could be the file system, block layer, device mapper, software raid, or something else. I have no idea.

The issue  is that when sync(1) is ran recently after boot it tends to lock up. If iostat is used to check activity it is always on the same partition (/var) and trying to unmount or remount that partition makes unmount/mount lock up in an unkillable way as well.

/var is ext4 (mounted with relatime, same as most other partitions) on top of a lvm2 lv. The single pv backing that vg is on top of software RAID 1 (/dev/md1). The software raid is backed by two SATA drives.

This seems similar to bug #14830 but there are some differences:
 * As far as I (and lsof) can tell, there is no IO on the device at the time.
 * That issue mentions it will end after 10-20 minutes. Waiting 2 hour did not help for me. Since this seemed to slow down IO and also slow down/lock up other tasks accessing that same partition I could not wait any longer than that, I need this system for work.
 * The call trace differs, showing another function in this case.

Only way out of the issue was rebooting. Rebooting with sysrq after trying emergency unmount did not work. Had to use reset button on case. I do not know if rebooting without emergency unmount would have worked.

dmesg contained:
[  241.700057] INFO: task sync:2591 blocked for more than 120 seconds.
[  241.700064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  241.700070] sync          D ffffffff8109fb65     0  2591   1408 0x00000004
[  241.700080]  ffff88005d20cd40 0000000000000086 0000000000000000 ffff88005cb2bd78
[  241.700088]  ffff88005ec16d70 ffff88005cb2bfd8 ffff88005cb2bfd8 ffff88005cb2bfd8
[  241.700095]  0000000000000000 0000000000000001 7fffffffffffffff ffff88005cb2be28
[  241.700102] Call Trace:
[  241.700116]  [<ffffffff8109fb65>] ? bdi_sched_wait+0x0/0x10
[  241.700124]  [<ffffffff8109fb6e>] ? bdi_sched_wait+0x9/0x10
[  241.700132]  [<ffffffff813bb669>] ? __wait_on_bit+0x3e/0x71
[  241.700138]  [<ffffffff813bb709>] ? out_of_line_wait_on_bit+0x6d/0x76
[  241.700145]  [<ffffffff8109fb65>] ? bdi_sched_wait+0x0/0x10
[  241.700154]  [<ffffffff81038cd8>] ? wake_bit_function+0x0/0x33
[  241.700161]  [<ffffffff8109fb5f>] ? bdi_sync_writeback+0x88/0x8e
[  241.700168]  [<ffffffff8109fb91>] ? sync_inodes_sb+0x1c/0xac
[  241.700175]  [<ffffffff810a301d>] ? __sync_filesystem+0x44/0x7f
[  241.700182]  [<ffffffff810a30df>] ? sync_filesystems+0x87/0xbd
[  241.700189]  [<ffffffff810a319c>] ? sys_sync+0x1c/0x31
[  241.700196]  [<ffffffff81002828>] ? system_call_fastpath+0x16/0x1b

This trace never got captured fully in /var/log/kernel.log. Rather about half of it was included one time (ending in the middle of a line, and followed by messages from next boot without a newline separating them) and another time none of it.

I never got this issue before 2.6.34, but since I only used this setup with RAID1 and LVM2 since my old (single) disk failed about 2 months ago I have never used this exact setup with other kernels than 2.6.34 and 2.6.34.1. The bug only happens in about 1 out of 5 boots or such.

Considering that this only seems to happen on one specific partition, which has the exact same setup as /tmp and /usr have, I did perform an fsck -vf on that file system. It did not report any problems.

I can not _reliably_ reproduce it. It might take several tries. And since rebooting in the forceful way I have to do after it happens requires a resync of the underlying software RAID device, it is highly inconvenient. In general it is inconvenient to test on this system.

Is there any other info that would be helpful?
Comment 1 Arvid Norlander 2010-07-24 17:47:22 UTC
Forgot to mention that the system is non-SMP, x86-64 (AMD Sempron 3300+), NOHZ, using hpet. Oh and that resyncing RAID is that inconvenient due to taking slightly more than 3.5 hours (large drives, but not the fastest hardware)...

Note You need to log in before you can comment on or make changes to this bug.