Note: Sorry for the mistakes, but the "Bug Filing FAQ" produced a 404 Latest working kernel version: Earliest failing kernel version: earliest known: 2.6.18-92.1.1.el5 Distribution: Red Hat (also: Ubuntu) Hardware Environment: Dell PowerEdge 2950 (dual Xeon, 8GB RAM), Dell PV220S disk storage Software Environment: RHEL 5.2 Problem Description: Intermittently all clients will hang. This can only be remedied with a server reboot. This seems to happen across different distros and different kernel versions, hence the post here instead of to Red Hat's bugzilla. Client- and server-side /var/log/messages both show: lockd: server <ip> not responding, timed out Restarting nfs on the server fails on "starting nfs daemon." Giving his error: lockd_down: lockd failed to exist, clearing pid Doing this also causes a second instance of [lockd] to be running, where before there was one. Pinging and ssh'ing to the server continue to function throughout. The bug seems to be a kernel issue, as it has appeared in different versions across different kernels. This seems to be the same problem, in Ubuntu 2.6.22: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/181996 That page contains all necessary system messages and significant debugging output, which I'm not going to bother re-posting here. Other perhaps related problems: http://www.mail-archive.com/linux-nfs@vger.kernel.org/msg01373.html https://bugzilla.redhat.com/show_bug.cgi?id=430160 Steps to reproduce: According to "the.jxc" on that first link above, "The failure is very regular. It happens whenever the garbage collection is performed as a result of a lock request." I can't be much more helpful than that. Let me know if more information is needed, or if this is a duplicate of another submission (my search produced no results).
What does 'echo t >/proc/sysrq-trigger' tell you about the hanging lockd process?
*** Bug 10938 has been marked as a duplicate of this bug. ***
Created attachment 16554 [details] dmesg output (plaintext) Source: server NOT during the problem.
I just submitted our server dmesg output after entering that command you suggested (I assume I did that right...). I wasn't sure if you just wanted that or if you wanted it when the issue was actually occurring, so I'll try to grab it the next time it rears its ugly head.
Sorry. I should have been more specific. If the hang occurs again on the server, can you try getting a dump using the above command. It looked as if there are a lot of processes running on it, so if you use dmesg, could you please rather use 'dmesg -s 90000' in order to try to get the maximum information possible. I wasn't able to find 'lockd' at all in the above output.
What's the most recent kernel you've tried? Someone on the ubuntu bug report claims there problem was addressed by two mutex fixes from Trond referenced here: https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/181996/comments/15 Also, two lockd's shouldn't be permitted to run at once as of d751a7cd0695554498f25d3026ca6710dbb3698f "NLM: Convert lockd to use kthreads".
> Restarting nfs on the server fails on "starting nfs daemon." Giving his > error: > lockd_down: lockd failed to exist, clearing pid > Doing this also causes a second instance of [lockd] to be running, where > before > there was one. Sounds like lockd was hung hard and wouldn't come down for some reason. The problem with this symptom is that there are a lot of different possible causes. There might even be more than one problem at play here. The stack trace for lockd is going to be essential for tracking down the cause. I'm not sure that the patches that Bruce references in comment #6 are relevant for RHEL5. They might be, but I thought that was a lock inversion that was introduced by a patch that's not in RHEL5. I could wrong on this though. I'd still recommend opening a case with RH support for the problem in RHEL5.
Sorry for the lack of a response. At this point I think the issue is fixed, via the installation of this firmware update here: SAS RAID: Dell SAS 6/iR Integrated, Firmware, Multi Language, Multi System, v.00.20.48.00.06.14.10.00 , A03 SAS 6/iR Integrated Firmware Release Firmware: 00.20.48.00 BIOS: 06.14.10.00 http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R176968&SystemID=PWE_2950&servicetag=8062LF1&os=WNET&osl=en&deviceid=13856&devlib=0&typecnt=0&vercnt=2&catid=-1&impid=-1&formatcnt=5&libid=46&fileid=241280