Bug 10938 - lockd error causes system hang requiring server reboot
Summary: lockd error causes system hang requiring server reboot
Status: CLOSED DUPLICATE of bug 10939
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-19 11:42 UTC by Bryan Hockey
Modified: 2010-02-03 21:08 UTC (History)
0 users

See Also:
Kernel Version: 2.6.18-92.1.1.el5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Bryan Hockey 2008-06-19 11:42:44 UTC
Note: Sorry for the mistakes, but the "Bug Filing FAQ" produced a 404
Latest working kernel version:
Earliest failing kernel version: earliest known: 2.6.18-92.1.1.el5
Distribution: Red Hat (also: Ubuntu)
Hardware Environment: Dell PowerEdge 2950 (dual Xeon, 8GB RAM), Dell PV220S disk storage
Software Environment: RHEL 5.2
Problem Description:
Intermittently all clients will hang.  This can only be remedied with a server reboot.  This seems to happen across different distros and different kernel versions, hence the post here instead of to Red Hat's bugzilla.

Client- and server-side /var/log/messages both show:
lockd: server <ip> not responding, timed out

Restarting nfs on the server fails on "starting nfs daemon." Giving his error:
lockd_down: lockd failed to exist, clearing pid  
Doing this also causes a second instance of [lockd] to be running, where before there was one.

Pinging and ssh'ing to the server continue to function throughout.

The bug seems to be a kernel issue, as it has appeared in different versions across different kernels.  
This seems to be the same problem, in Ubuntu 2.6.22: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/181996
That page contains all necessary system messages and significant debugging output, which I'm not going to bother re-posting here.  

Other perhaps related problems:
http://www.mail-archive.com/linux-nfs@vger.kernel.org/msg01373.html
https://bugzilla.redhat.com/show_bug.cgi?id=430160

Steps to reproduce:
According to "the.jxc" on that first link above, "The failure is very regular. It happens whenever the garbage collection is performed as a result of a lock request."  I can't be much more helpful than that.

Let me know if more information is needed, or if this is a duplicate of another submission (my search produced no results).
Comment 1 Trond Myklebust 2008-06-19 12:17:13 UTC

*** This bug has been marked as a duplicate of bug 10939 ***

Note You need to log in before you can comment on or make changes to this bug.