Bug 32342

Summary: kernel 2.6.38 for 686 hangs with XFS
Product: File System Reporter: tprokos+bugs
Component: XFSAssignee: Christoph Hellwig (hch)
Status: RESOLVED CODE_FIX    
Severity: high CC: david, hch, tprokos+bugs
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel log for hang #1
kernel log for hang #2
disk info
xfs_info and dmesg output
dmesg output after hang
hang #3

Description tprokos+bugs 2011-03-31 13:38:34 UTC
Created attachment 52752 [details]
kernel log for hang #1

Since the big XFS refacturing, kernels tend to hang on one particular machine. It is not a hardware problem, since the very same machine works flawlessly with the very same configuration, by just using an older (<=2.6.26) kernel.

Especially bad in this respect is the latest 2.6.38 as given in the email subject: The problem appears right after the reboot. I am filing this bug report on linux-image-2.6.37-1-686 which also shows hangs but only 1-2 times per week.

In order to isolate this bug, I am attaching the kernel log for two distinct hangs (with a reboot in between) and disk information. I suspect that the problem occurs only on special types of disks, as I run almost the same configuration on multiple machines (same kernel, XFS formatted disks with the same XFS format version, same partitioning, etc.) without having any problems.

As Maximilian Attems explains in the Debian bug report #620216 (see: http://bugs.debian.org/620216 ) he also noticed similar hangs on an unrelated machine.
Comment 1 tprokos+bugs 2011-03-31 13:39:20 UTC
Created attachment 52762 [details]
kernel log for hang #2
Comment 2 tprokos+bugs 2011-03-31 13:39:50 UTC
Created attachment 52772 [details]
disk info
Comment 3 Dave Chinner 2011-04-01 01:32:16 UTC
Your system is waiting on log space to become available. Can you post the output of xfs_info on the filesytem that is hanging, the contents of dmesg from the time the filesystem is mounted (please include the mount messages), and describe the workload that is running on your system at the time the hangs occur?

Cheers,

Dave.
Comment 4 tprokos+bugs 2011-04-02 13:32:38 UTC
Created attachment 53232 [details]
xfs_info and dmesg output
Comment 5 tprokos+bugs 2011-04-02 13:37:03 UTC
The condition is usually triggered when heavy disk IO is occurring on / (mounted from /dev/sda5).

This heavy disk activity usually stems from running aptitude upgrade on my Debian system, ie. heavy disk utilization from package installation.

HTH,
Thomas
Comment 6 Dave Chinner 2011-04-02 23:56:38 UTC
Can you please attatch the _full_ dmesg output from the time the filesystem was mounted to the error that occurred - filtered versions are useless for diagnosis.

Also for note, it's a filesystem with 15 x 600MB AGs w/ a 10MB v1 log and lazy-count disabled. Looks like an old filesystem structure. Pasting it here so I don't have to keep refering back to an attachment:

xfs_info /meta-data=/dev/disk/by-uuid/f0c24e97-9607-4f41-a1d2-2f7c70da1aa0 isize=256    agcount=15, agsize=156130 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=2247084, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=2560, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
Comment 7 tprokos+bugs 2011-04-07 17:40:02 UTC
Created attachment 53802 [details]
dmesg output after hang

Right after the system hangs, there is no dmesg output that points to this hang. Only after some time, the first messages were printed (up to and including second 721). During this time, the system was getting less and less responsive, every application that needed to write to the disk started to hang (reading seems to be fine). In second 841 the remaining messages were printed --- from this time onwards the system was fully responsive again.

HTH, Thomas
Comment 8 tprokos+bugs 2011-04-23 08:37:16 UTC
Created attachment 55092 [details]
hang #3

Another example of a hang.
Any news on this?
Comment 9 tprokos+bugs 2011-07-03 09:28:39 UTC
This particular bug seems unrelated to the log version.
I just recreated the filesystem, restored the data from the backup but the bug is still there.

Current filesystem info:
# xfs_info /dev/sda5
meta-data=/dev/disk/by-uuid/042d9a74-9941-4aee-a2d8-2abf03e6e8ea isize=256    agcount=4, agsize=561771 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2247084, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Any news on this bug? Do you need help testing?
Comment 10 Dave Chinner 2011-07-04 00:34:49 UTC
(In reply to comment #9)
> This particular bug seems unrelated to the log version.
> I just recreated the filesystem, restored the data from the backup but the
> bug
> is still there.

OK.

> Any news on this bug? Do you need help testing?

Alex seems to occasionally be able to reproduce this with xfstest 234, but until I have a local reproducer, I'm not going to be able to get to the bottom of the problem.

Perhaps you could have created your new filesytsem with a larger log to work around they problem?
Comment 11 tprokos+bugs 2011-07-04 09:21:11 UTC
This was also my initial idea. I did not proceed since the man page states that "the size suboption is only needed if  the log  section  of  the  filesystem should occupy less space than the size of the special file".

OTOH it is worth a try.
Comment 12 tprokos+bugs 2012-02-06 10:12:16 UTC
For reference purposes, the problem seems to be fixed:
- http://comments.gmane.org/gmane.comp.file-systems.xfs.general/41907
- http://bugs.debian.org/655353
- https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=17b3847

I am currently testing the new kernel, but so far the bug did not emerge.
Comment 13 tprokos+bugs 2012-02-06 19:18:57 UTC
It seems that the above patch really fixed this bug - even under heavy load, the kernel does not stall any more.

The bug can be closed.