Bug 16278 - lvm snapshot causes deadlock in 2.6.35
Summary: lvm snapshot causes deadlock in 2.6.35
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Eric Sandeen
Depends on:
Blocks: 16055
  Show dependency tree
Reported: 2010-06-23 16:55 UTC by Phillip Susi
Modified: 2010-08-29 22:57 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.35-rc[123]
Regression: Yes
Bisected commit-id:

Proposed patch (1.57 KB, application/octet-stream)
2010-06-24 15:27 UTC, Eric Sandeen

Description Phillip Susi 2010-06-23 16:55:38 UTC
Copying from: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/595489

Attempting to snapshot the root lv causes a deadlock in 2.6.35 when it suspends the root lv device to replace the table. The lvcreate -n snap -s -L 1g lv/root command hangs, can not be killed, and no further IO is possible, and the system must be hard booted with magic-sysrq.

Narrowed down cause to commit 6b0310fbf087ad6: "ext4: don't return to userspace after freezing the fs with a mutex held".  After reversing this change, the problem goes away.
Comment 1 Eric Sandeen 2010-06-24 15:27:57 UTC
Created attachment 26933 [details]
Proposed patch

I've sent this patch upstream, I think it should solve the most egregious problem here, though I think other deadlocks remain (looking as if they are unrelated to the commit mentioned in the original comment)
Comment 2 Phillip Susi 2010-06-25 13:40:01 UTC
I tested this patch last night and it seemed to fix it.
Comment 3 Eric Sandeen 2010-06-25 14:41:07 UTC
Good deal.

FWIW there is at least one other deadlock possible; it seems that the flushing during freeze isn't pushing all transactions out (or something...) and writeback tries to do more post-freeze.  This takes s_umount and gets stopped in jbd, and then thaw wants s_umount to unfreeze the fs.  This results in a) an inconsistent snapshot, and b) a stuck unfreeze (or snapshot create, which does unfreeze post-snap).  I'm working on that.
Comment 4 Eric Sandeen 2010-06-28 19:58:03 UTC
Just to be clear, the patch in #1 should resolve the actual regression, the blathering in comment #3 needs further scrutiny but is a symptom of a problem which has existed for a while, I think.

So let's not hold up on the primary fix here.
Comment 5 Rafael J. Wysocki 2010-06-28 21:08:28 UTC
Handled-By : Eric Sandeen <sandeen@redhat.com>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=26933
Comment 6 Phillip Susi 2010-07-13 17:17:25 UTC
It seems this patch still has not been applied to Linus's tree.  Any idea why?
Comment 7 Rafael J. Wysocki 2010-08-01 13:38:56 UTC
@Eric: Any chance to push the patch upstream?
Comment 8 Rafael J. Wysocki 2010-08-29 22:57:16 UTC
Fixed by commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f .

Note You need to log in before you can comment on or make changes to this bug.