Bug 12815
Summary: | JBD: barrier-based sync failed on dm-1:8 - disabling barriers -- and then hang | ||
---|---|---|---|
Product: | File System | Reporter: | Joey Hess (joey) |
Component: | ext4 | Assignee: | Eric Sandeen (sandeen) |
Status: | RESOLVED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | sandeen, stuffcorpse, tytso, vaurora, yanfali |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.28 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Joey Hess
2009-03-03 21:08:46 UTC
ext4 has barriers on by default, so you'll see the disabling barriers message on lvm every time you mount (it does not support barriers). if you'd like to rule that out, mount with -o barrier=0 When you're hung, try echo w > /proc/sysrq-trigger (or SysRq-W) to get a list of sleeping processes. -Eric Depending on what you see from sysrq, this is probably fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=2acf2c261b823d9d9ed954f348b97620297a36b5 which may make it to .28.y eventually... I'm getting the JBD message not only on initial mount. Is that still expected? I'll try to get some info from sysrq if it happens again. Is there any mount option I can use to disable delayed allocation or something to try to work around the problem? (In reply to comment #3) > I'm getting the JBD message not only on initial mount. Is that still > expected? You should get it on each mount. When else do you see it? > I'll try to get some info from sysrq if it happens again. > > Is there any mount option I can use to disable delayed allocation or > something > to try to work around the problem? You can mount with -o nodelalloc, though there is no reason to think that this is related to the problem at this point ... it may be more productive to test a newer kernel. > You should get it on each mount. When else do you see it?
I do see the jbd message on each mount. But I also saw one 2.5 hours after mount. But, probably a red herring.
I added barrier=0 and did not see the problem again until today. W/o JBD messages, so that was a red herring. This time there was nothing special in dmesg. sysrq shows the following. This is not all the hung processes, but should be representative: Mar 7 12:00:54 turtle kernel: [82352.940000] sh D c0213428 0 21596 21594 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c02130b4>] (schedule+0x0/0x3d0) from [<c0213b94>] (__mutex_lock_slowpath+0x6c/0x98) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c0213b28>] (__mutex_lock_slowpath+0x0/0x98) from [<c0213be0>] (mutex_lock+0x20/0x24) Mar 7 12:00:54 turtle kernel: [82352.940000] r8:c76d379c r7:c223de98 r6:c76d3728 r5:c7685d98 r4:00000000 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c0213bc0>] (mutex_lock+0x0/0x24) from [<c009e41c>] (do_lookup+0x78/0x194) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009e3a4>] (do_lookup+0x0/0x194) from [<c00a0198>] (__link_path_walk+0x3ac/0xe24) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009fdec>] (__link_path_walk+0x0/0xe24) from [<c00a0d98>] (path_walk+0x50/0xa0) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c00a0d48>] (path_walk+0x0/0xa0) from [<c00a0edc>] (do_path_lookup+0xf4/0x11c) Mar 7 12:00:54 turtle kernel: [82352.940000] r6:ffffff9c r5:c223de98 r4:00000001 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c00a0de8>] (do_path_lookup+0x0/0x11c) from [<c00a198c>] (user_path_at+0x5c/0x9c) Mar 7 12:00:54 turtle kernel: [82352.940000] r7:c223df08 r6:ffffff9c r5:00000001 r4:c68d9000 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c00a1930>] (user_path_at+0x0/0x9c) from [<c009a354>] (vfs_stat_fd+0x24/0x54) Mar 7 12:00:54 turtle kernel: [82352.940000] r7:000000c3 r6:c223df08 r5:c223df40 r4:bed3da18 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009a330>] (vfs_stat_fd+0x0/0x54) from [<c009a438>] (vfs_stat+0x1c/0x20) Mar 7 12:00:54 turtle kernel: [82352.940000] r6:000c5668 r5:c223df40 r4:bed3da18 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009a41c>] (vfs_stat+0x0/0x20) from [<c009a45c>] (sys_stat64+0x20/0x3c) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009a43c>] (sys_stat64+0x0/0x3c) from [<c0025e00>] (ret_fast_syscall+0x0/0x3c) Mar 7 12:00:54 turtle kernel: [82352.940000] r5:000bd6f8 r4:000c5668 Mar 7 12:00:54 turtle kernel: [82352.940000] sh D c0213428 0 22291 22289 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c02130b4>] (schedule+0x0/0x3d0) from [<c0213b94>] (__mutex_lock_slowpath+0x6c/0x98) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c0213b28>] (__mutex_lock_slowpath+0x0/0x98) from [<c0213be0>] (mutex_lock+0x20/0x24) Mar 7 12:00:54 turtle kernel: [82352.940000] r8:c76d379c r7:c4079e98 r6:c76d3728 r5:c7685d98 r4:00000000 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c0213bc0>] (mutex_lock+0x0/0x24) from [<c009e41c>] (do_lookup+0x78/0x194) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009e3a4>] (do_lookup+0x0/0x194) from [<c00a0198>] (__link_path_walk+0x3ac/0xe24) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009fdec>] (__link_path_walk+0x0/0xe24) from [<c00a0d98>] (path_walk+0x50/0xa0) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c00a0d48>] (path_walk+0x0/0xa0) from [<c00a0edc>] (do_path_lookup+0xf4/0x11c) Mar 7 12:00:54 turtle kernel: [82352.940000] r6:ffffff9c r5:c4079e98 r4:00000001 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c00a0de8>] (do_path_lookup+0x0/0x11c) from [<c00a198c>] (user_path_at+0x5c/0x9c) Mar 7 12:00:54 turtle kernel: [82352.940000] r7:c4079f08 r6:ffffff9c r5:00000001 r4:c6841000 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c00a1930>] (user_path_at+0x0/0x9c) from [<c009a354>] (vfs_stat_fd+0x24/0x54) Mar 7 12:00:54 turtle kernel: [82352.940000] r7:000000c3 r6:c4079f08 r5:c4079f40 r4:be967a18 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009a330>] (vfs_stat_fd+0x0/0x54) from [<c009a438>] (vfs_stat+0x1c/0x20) Mar 7 12:00:54 turtle kernel: [82352.940000] r6:000c5668 r5:c4079f40 r4:be967a18 Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009a41c>] (vfs_stat+0x0/0x20) from [<c009a45c>] (sys_stat64+0x20/0x3c) Mar 7 12:00:54 turtle kernel: [82352.940000] [<c009a43c>] (sys_stat64+0x0/0x3c) from [<c0025e00>] (ret_fast_syscall+0x0/0x3c) Mar 7 12:00:54 turtle kernel: [82352.940000] r5:000bd6f8 r4:000c5668 Mar 7 12:00:54 turtle kernel: [82352.940000] sh D c0213428 0 24027 24025 please attach the entire output of the sysrq-w command, so that we can sort out what's relevant. Attaching it rather than pasting it in will have the added advantage of not wrapping the output :) Your trace snippets don't show any ext4 code in those callchains. But they may be waiting on some stuck ext4 process which you didn't include. (the summary probably should be changed, as this has nothing to do w/ barriers...) Thanks, -Eric Hi Joey, Are you still seeing this problem? Can you reproduce it, especially on a newer/more recent kernel? Can this bug be closed? No response from the submitter for 5 months, and it does kind of smell like a hardware problem. Found this bug a google search. I'm running a 2.6.32 kernel, and recently converted an ext3 partition running on soft raid 5 to ext4. Today I found this in my dmesg: [ 119.414297] JBD: barrier-based sync failed on md1-8 - disabling barriers Could this be triggered by turning off write caching on the individual drives. I use hdparm -W0 on all the RAID drives for improved data integrity. Is it safe to turn back write caching back on with write barriers? I created the fs using e2fsprogs-1.41.9 with default options mkfs.ext4: ext4 = { features = has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize inode_size = 256 } The device is a simple RAID5 running across 3 disks. #cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md1 : active raid5 sdc1[2] sdb1[1] sda1[0] 162754304 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/156 pages [0KB], 256KB chunk md10 : active raid5 sdc5[2] sdb5[1] sda5[0] 1302389248 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] bitmap: 1/156 pages [4KB], 2048KB chunk md0 : active raid1 sdg2[0] sdd2[1] 17818560 blocks [2/2] [UU] bitmap: 2/136 pages [8KB], 64KB chunk unused devices: <none> This is a plain RAID5, with ext4 running directly on the md device. dumpe2fs: Filesystem volume name: /home Last mounted on: /home Filesystem UUID: 02cb8e4a-d8cf-4e8d-80e7-2fa2eb309db1 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 10174464 Block count: 40688576 Reserved block count: 2034428 Free blocks: 17097770 Free inodes: 9910627 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1014 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Tue Jan 12 22:44:15 2010 Last mount time: Wed Jan 13 00:26:54 2010 Last write time: Wed Jan 13 00:26:54 2010 Mount count: 3 Maximum mount count: 28 Last checked: Tue Jan 12 22:44:15 2010 Check interval: 15552000 (6 months) Next check after: Sun Jul 11 23:44:15 2010 Lifetime writes: 90 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: ee5d2952-78e6-4884-bccf-e7c41411e38b Journal backup: inode blocks Journal size: 128M closing original bug due to lack of info & response. Regarding comment #10: ext3 is just telling you that a barrier write failed on the md device, so it won't try again, because it looks like the device does not support barriers. It should not have anything to do with write caches on or off on the drives, I think. You might ask the md mailing list if this configuration is expected to support barrier requests ... |