Bug 10938

Summary: lockd error causes system hang requiring server reboot
Product: File System Reporter: Bryan Hockey (bryan.hockey)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED DUPLICATE    
Severity: high    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.18-92.1.1.el5 Subsystem:
Regression: No Bisected commit-id:

Description Bryan Hockey 2008-06-19 11:42:44 UTC
Note: Sorry for the mistakes, but the "Bug Filing FAQ" produced a 404
Latest working kernel version:
Earliest failing kernel version: earliest known: 2.6.18-92.1.1.el5
Distribution: Red Hat (also: Ubuntu)
Hardware Environment: Dell PowerEdge 2950 (dual Xeon, 8GB RAM), Dell PV220S disk storage
Software Environment: RHEL 5.2
Problem Description:
Intermittently all clients will hang.  This can only be remedied with a server reboot.  This seems to happen across different distros and different kernel versions, hence the post here instead of to Red Hat's bugzilla.

Client- and server-side /var/log/messages both show:
lockd: server <ip> not responding, timed out

Restarting nfs on the server fails on "starting nfs daemon." Giving his error:
lockd_down: lockd failed to exist, clearing pid  
Doing this also causes a second instance of [lockd] to be running, where before there was one.

Pinging and ssh'ing to the server continue to function throughout.

The bug seems to be a kernel issue, as it has appeared in different versions across different kernels.  
This seems to be the same problem, in Ubuntu 2.6.22: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/181996
That page contains all necessary system messages and significant debugging output, which I'm not going to bother re-posting here.  

Other perhaps related problems:
http://www.mail-archive.com/linux-nfs@vger.kernel.org/msg01373.html
https://bugzilla.redhat.com/show_bug.cgi?id=430160

Steps to reproduce:
According to "the.jxc" on that first link above, "The failure is very regular. It happens whenever the garbage collection is performed as a result of a lock request."  I can't be much more helpful than that.

Let me know if more information is needed, or if this is a duplicate of another submission (my search produced no results).
Comment 1 Trond Myklebust 2008-06-19 12:17:13 UTC

*** This bug has been marked as a duplicate of bug 10939 ***