Created attachment 52752 [details] kernel log for hang #1 Since the big XFS refacturing, kernels tend to hang on one particular machine. It is not a hardware problem, since the very same machine works flawlessly with the very same configuration, by just using an older (<=2.6.26) kernel. Especially bad in this respect is the latest 2.6.38 as given in the email subject: The problem appears right after the reboot. I am filing this bug report on linux-image-2.6.37-1-686 which also shows hangs but only 1-2 times per week. In order to isolate this bug, I am attaching the kernel log for two distinct hangs (with a reboot in between) and disk information. I suspect that the problem occurs only on special types of disks, as I run almost the same configuration on multiple machines (same kernel, XFS formatted disks with the same XFS format version, same partitioning, etc.) without having any problems. As Maximilian Attems explains in the Debian bug report #620216 (see: http://bugs.debian.org/620216 ) he also noticed similar hangs on an unrelated machine.
Created attachment 52762 [details] kernel log for hang #2
Created attachment 52772 [details] disk info
Your system is waiting on log space to become available. Can you post the output of xfs_info on the filesytem that is hanging, the contents of dmesg from the time the filesystem is mounted (please include the mount messages), and describe the workload that is running on your system at the time the hangs occur? Cheers, Dave.
Created attachment 53232 [details] xfs_info and dmesg output
The condition is usually triggered when heavy disk IO is occurring on / (mounted from /dev/sda5). This heavy disk activity usually stems from running aptitude upgrade on my Debian system, ie. heavy disk utilization from package installation. HTH, Thomas
Can you please attatch the _full_ dmesg output from the time the filesystem was mounted to the error that occurred - filtered versions are useless for diagnosis. Also for note, it's a filesystem with 15 x 600MB AGs w/ a 10MB v1 log and lazy-count disabled. Looks like an old filesystem structure. Pasting it here so I don't have to keep refering back to an attachment: xfs_info /meta-data=/dev/disk/by-uuid/f0c24e97-9607-4f41-a1d2-2f7c70da1aa0 isize=256 agcount=15, agsize=156130 blks = sectsz=512 attr=0 data = bsize=4096 blocks=2247084, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0
Created attachment 53802 [details] dmesg output after hang Right after the system hangs, there is no dmesg output that points to this hang. Only after some time, the first messages were printed (up to and including second 721). During this time, the system was getting less and less responsive, every application that needed to write to the disk started to hang (reading seems to be fine). In second 841 the remaining messages were printed --- from this time onwards the system was fully responsive again. HTH, Thomas
Created attachment 55092 [details] hang #3 Another example of a hang. Any news on this?
This particular bug seems unrelated to the log version. I just recreated the filesystem, restored the data from the backup but the bug is still there. Current filesystem info: # xfs_info /dev/sda5 meta-data=/dev/disk/by-uuid/042d9a74-9941-4aee-a2d8-2abf03e6e8ea isize=256 agcount=4, agsize=561771 blks = sectsz=512 attr=2 data = bsize=4096 blocks=2247084, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Any news on this bug? Do you need help testing?
(In reply to comment #9) > This particular bug seems unrelated to the log version. > I just recreated the filesystem, restored the data from the backup but the > bug > is still there. OK. > Any news on this bug? Do you need help testing? Alex seems to occasionally be able to reproduce this with xfstest 234, but until I have a local reproducer, I'm not going to be able to get to the bottom of the problem. Perhaps you could have created your new filesytsem with a larger log to work around they problem?
This was also my initial idea. I did not proceed since the man page states that "the size suboption is only needed if the log section of the filesystem should occupy less space than the size of the special file". OTOH it is worth a try.
For reference purposes, the problem seems to be fixed: - http://comments.gmane.org/gmane.comp.file-systems.xfs.general/41907 - http://bugs.debian.org/655353 - https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=17b3847 I am currently testing the new kernel, but so far the bug did not emerge.
It seems that the above patch really fixed this bug - even under heavy load, the kernel does not stall any more. The bug can be closed.