Bug 13621

Summary: xfs hangs with assertion failed
Product: File System Reporter: Johannes Engel (jcnengel)
Component: XFSAssignee: Christoph Hellwig (hch)
Status: CLOSED CODE_FIX    
Severity: high CC: charles, hch, rjw, sandeen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: kernel log of the xfs failure
git bisect log
kernel configuration
kernel configuration
kernel log of the xfs failure from 2.6.29-rc5 (after reverting the critical patch)
Patch to fix spinlock

Description Johannes Engel 2009-06-25 10:07:11 UTC
Mounting external HD is fine, but extensive r/w (rsync) or attempting to unmount the device makes the kernel hang in the following sense: The device keeps mounted, no operations possible, every process accessing the device becomes unkillable. Apparently no major damage to the fs is caused. Log is attached.
Comment 1 Johannes Engel 2009-06-25 10:09:15 UTC
Created attachment 22092 [details]
kernel log of the xfs failure
Comment 2 Eric Sandeen 2009-06-28 03:47:02 UTC
From Christoph on the mailing list:

> I have no idea how this could trigger correctly.  If you look at
> the callers of xlog_state_want_sync all of them take the log
> less than five lines from the call and never have a branch before
> taking the log and calling the function.

Also note that the machine in question was only 1 CPU.  Stack corruption perhaps?  Is this a 4kstacks kernel?
Comment 3 Eric Sandeen 2009-06-28 03:55:14 UTC
"but extensive r/w (rsync) or attempting to
unmount the device makes the kernel hang in the following sense:"

Do you mean that it hangs on every unmount?  If so maybe I could lazily ask if you'd be willing to bisect it, if it looks like a regression.  :)
Comment 4 Johannes Engel 2009-06-28 11:08:25 UTC
This is not a 4kstacks kernel. I don't know when I will find the time to bisect, right now I am quite busy. But I will keep that in mind.
Comment 5 Johannes Engel 2009-06-28 11:09:09 UTC
(In reply to comment #3)
> Do you mean that it hangs on every unmount?
Yes, it does.
Comment 6 Johannes Engel 2009-07-11 17:29:51 UTC
Created attachment 22312 [details]
git bisect log

I ran a bisect restricting myself to the folder fs/xfs only. That spits out the culprit as commit d2859751cd0bf586941ffa7308635a293f943c17 (see attachment).

If that cannot be the core issue for reasons I do not see, someone might have to run a full bisect for which I do not have enough time at the moment.
Comment 7 Eric Sandeen 2009-07-11 20:38:28 UTC
Thanks for running the bisect, we'll take a look at that commit.
Comment 8 Eric Sandeen 2009-07-11 20:51:18 UTC
Could you add your .config as well, just in case there's something unique there?  I don't immediately see how that commit would affect this problem, but that's just from a quick look.

Thanks,
-Eric
Comment 9 Johannes Engel 2009-07-11 20:53:29 UTC
Created attachment 22316 [details]
kernel configuration

Of course. Thanks for looking into this. :)
Comment 10 Eric Sandeen 2009-07-16 22:12:07 UTC
Seems that kernel config was from 2.6.28, do you have the 2.6.30 version?

Thanks,
-eric
Comment 11 Johannes Engel 2009-07-17 08:33:48 UTC
Created attachment 22387 [details]
kernel configuration

Here is one I compiled 2.6.31-rc3 with. Is that ok?
Comment 12 Eric Sandeen 2009-07-17 14:48:03 UTC
That depends, does the resulting built kernel still hang this way? :)
Comment 13 Johannes Engel 2009-08-06 20:22:24 UTC
Still there with 2.6.31-rc5.
Comment 14 Christoph Hellwig 2009-08-06 20:46:56 UTC
d2859751cd0bf586941ffa7308635a293f943c17 actually got backed out not much later.  Please try revision 3a011a171906a3a51a43bb860fb7c66a64cab140 which is the commit reverting it.
Comment 15 Johannes Engel 2009-08-06 20:53:56 UTC
This revert should be part of 2.6.30-rc1 and thus of 2.6.31-rc5, isn't it?
Comment 16 Christoph Hellwig 2009-08-06 21:14:28 UTC
In this case the reverts did obviously confused the git bisect results.  So trying the first version with that patch reverted will get us another possible anchor to look for the offending commit.
Comment 17 Johannes Engel 2009-08-07 10:11:23 UTC
Created attachment 22632 [details]
kernel log of the xfs failure from 2.6.29-rc5 (after reverting the critical patch)

I tried commit 3a011a171906a3a51a43bb860fb7c66a64cab140 and still the umount does not work. Nonetheless, the error seems to be a bit different...
Comment 18 Christoph Hellwig 2009-08-10 02:29:05 UTC
Are you running a uni-processor kernel maybe?  From looking around at the implementations I have the fear that spin_is_locked doesn't work correctly on uni-processor kernels with CONFIG_PREEMPT, although I can't find any defintive documentation on it.

Try replacing the

    ASSERT(spin_is_locked(&log->l_icloglock));

the tripped off for you with a

    assert_spin_locked(&log->l_icloglock);

Alternatively try just reverting commit 39e2defe73106ca2e1c85e5286038a0a13f49513 which shouldn't cause a problem, but introduced this spin_is_locked assert, interestingly the only non-negated one in XFS, and one of very few all over the tree.
Comment 19 Johannes Engel 2009-08-10 08:49:13 UTC
Created attachment 22659 [details]
Patch to fix spinlock

Indeed, I am running an uniprocessor machine. Thanks for the hint, Christoph. The attached patch fixes the issue for me.
Comment 20 Rafael J. Wysocki 2009-08-10 13:45:28 UTC
Handled-By : Christoph Hellwig <hch@lst.de>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=22659
Comment 21 Rafael J. Wysocki 2009-08-12 21:09:53 UTC
Fixed by commit a8914f3a6d72c97328597a556a99daaf5cc288ae .