10939 – lockd error causes system hang requiring server reboot

Bug 10939 - lockd error causes system hang requiring server reboot

Summary: lockd error causes system hang requiring server reboot

Status:	REJECTED DOCUMENTED

Alias:	None

Product:	File System
Classification:	Unclassified
Component:	NFS (show other bugs)
Hardware:	All Linux

Importance:	P1 high
Assignee:	Trond Myklebust

URL:
Keywords:

Duplicates (1):	10938 (view as bug list)
Depends on:
Blocks:

Reported:	2008-06-19 12:11 UTC by Bryan Hockey
Modified:	2008-09-26 06:44 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:	2.6.18-92.1.1.el5
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
dmesg output (plaintext) (119.97 KB, text/plain) 2008-06-19 14:30 UTC, Bryan Hockey	Details
Add an attachment (proposed patch, testcase, etc.)

Description Bryan Hockey 2008-06-19 12:11:00 UTC

Note: Sorry for the mistakes, but the "Bug Filing FAQ" produced a 404
Latest working kernel version:
Earliest failing kernel version: earliest known: 2.6.18-92.1.1.el5
Distribution: Red Hat (also: Ubuntu)
Hardware Environment: Dell PowerEdge 2950 (dual Xeon, 8GB RAM), Dell PV220S disk storage
Software Environment: RHEL 5.2
Problem Description:
Intermittently all clients will hang.  This can only be remedied with a server reboot.  This seems to happen across different distros and different kernel versions, hence the post here instead of to Red Hat's bugzilla.

Client- and server-side /var/log/messages both show:
lockd: server <ip> not responding, timed out

Restarting nfs on the server fails on "starting nfs daemon." Giving his error:
lockd_down: lockd failed to exist, clearing pid  
Doing this also causes a second instance of [lockd] to be running, where before there was one.

Pinging and ssh'ing to the server continue to function throughout.

The bug seems to be a kernel issue, as it has appeared in different versions across different kernels.  
This seems to be the same problem, in Ubuntu 2.6.22: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/181996
That page contains all necessary system messages and significant debugging output, which I'm not going to bother re-posting here.  

Other perhaps related problems:
http://www.mail-archive.com/linux-nfs@vger.kernel.org/msg01373.html
https://bugzilla.redhat.com/show_bug.cgi?id=430160

Steps to reproduce:
According to "the.jxc" on that first link above, "The failure is very regular. It happens whenever the garbage collection is performed as a result of a lock request."  I can't be much more helpful than that.

Let me know if more information is needed, or if this is a duplicate of another submission (my search produced no results).

Comment 1 Trond Myklebust 2008-06-19 12:15:39 UTC

What does 'echo t >/proc/sysrq-trigger' tell you about the hanging lockd process?

Comment 2 Trond Myklebust 2008-06-19 12:17:13 UTC

*** Bug 10938 has been marked as a duplicate of this bug. ***

Comment 3 Bryan Hockey 2008-06-19 14:30:34 UTC

Created attachment 16554 [details]
dmesg output (plaintext)

Source: server
NOT during the problem.

Comment 4 Bryan Hockey 2008-06-19 14:32:23 UTC

I just submitted our server dmesg output after entering that command you suggested (I assume I did that right...).  I wasn't sure if you just wanted that or if you wanted it when the issue was actually occurring, so I'll try to grab it the next time it rears its ugly head.

Comment 5 Trond Myklebust 2008-06-19 15:25:16 UTC

Sorry. I should have been more specific.

If the hang occurs again on the server, can you try getting a dump using the 
above command.
It looked as if there are a lot of processes running on it, so if you use
dmesg, could you please rather use 'dmesg -s 90000' in order to try to get
the maximum information possible. I wasn't able to find 'lockd' at all in the
above output.

Comment 6 bfields 2008-06-20 12:06:27 UTC

What's the most recent kernel you've tried?  Someone on the ubuntu bug report claims there problem was addressed by two mutex fixes from Trond referenced here:

https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/181996/comments/15

Also, two lockd's shouldn't be permitted to run at once as of d751a7cd0695554498f25d3026ca6710dbb3698f "NLM: Convert lockd to use kthreads".

Comment 7 Jeff Layton 2008-06-20 14:48:09 UTC

> Restarting nfs on the server fails on "starting nfs daemon." Giving his
> error:
> lockd_down: lockd failed to exist, clearing pid  
> Doing this also causes a second instance of [lockd] to be running, where
> before
> there was one.

Sounds like lockd was hung hard and wouldn't come down for some reason. The problem with this symptom is that there are a lot of different possible causes. There might even be more than one problem at play here.

The stack trace for lockd is going to be essential for tracking down the cause. 

I'm not sure that the patches that Bruce references in comment #6 are relevant for RHEL5. They might be, but I thought that was a lock inversion that was introduced by a patch that's not in RHEL5. I could wrong on this though.

I'd still recommend opening a case with RH support for the problem in RHEL5.

Comment 8 Bryan Hockey 2008-07-14 09:25:12 UTC

Sorry for the lack of a response.  At this point I think the issue is fixed, via the installation of this firmware update here: 
SAS RAID: Dell SAS 6/iR Integrated, Firmware, Multi Language, Multi System, v.00.20.48.00.06.14.10.00 , A03
SAS 6/iR Integrated Firmware Release Firmware: 00.20.48.00 BIOS: 06.14.10.00 
http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R176968&SystemID=PWE_2950&servicetag=8062LF1&os=WNET&osl=en&deviceid=13856&devlib=0&typecnt=0&vercnt=2&catid=-1&impid=-1&formatcnt=5&libid=46&fileid=241280

Note You need to log in before you can comment on or make changes to this bug.