Bug 41602
Summary: | DRBD: possible deadlock in Ahead mode | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Cyril B. (cbay) |
Component: | Other | Assignee: | Lars Ellenberg (lars) |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | lars |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.0.1 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Cyril B.
2011-08-23 11:29:37 UTC
This looks more like a misconfiguration or misunderstanding. Though I have to admit that the call trace does not make much sense to me. DRBD uses two tcp connections. For some reason, sendmsg on the bulk connection does not make progress, even though there is no problem doing full roundtrips on what we call the meta connection. Do you have some MTU problem? To have DRBD detect and automatically recover from similar situations, you can use the "knock out counter" feature, e.g. add ko-count 4; to your drbd.conf. More on this in the DRBD User's Guide: http://www.drbd.org/users-guide/ http://www.drbd.org/users-guide/re-drbdconf.html I can't rule out a misconfiguration, but I've been using DRBD since 8.0 on many servers and I've never seen such a thing happen. It's only happened very recently since I began using Linux 3.0 on a few servers, a few days ago. Even if it's a temporary network failure that I've never had before: - how comes even a drbdadm disconnect doesn't work (i.e. hangs), and I have to reboot the primary server to be able to write on my filesystem again? - isn't the Ahead mode supposed to not block IO if the buffer is full and/or the secondary server is not accessible? On Wed, Aug 24, 2011 at 03:14:47PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=41602 > --- Comment #2 from Cyril B. <cbay@excellency.fr> 2011-08-24 15:14:47 --- > I can't rule out a misconfiguration, but I've been using DRBD since 8.0 on > many > servers and I've never seen such a thing happen. It's only happened very > recently since I began using Linux 3.0 on a few servers, a few days ago. > > Even if it's a temporary network failure that I've never had before: > > - how comes even a drbdadm disconnect doesn't work (i.e. hangs), and I have > to > reboot the primary server to be able to write on my filesystem again? Arguably a design bug, the default disconnect waits for all current requests to be handled. You can try to --force disconnect, though. > - isn't the Ahead mode supposed to not block IO if the buffer is full and/or > the secondary server is not accessible? Well, yes, but it still needs to send some bytes (as oposed to the full data block) to the peer. And if sendmsg blocks/times out/makes no progress for whatever reason, we cannot do anything about that. Thanks, that makes sense. |