Most recent kernel where this bug did not occur: Distribution: Fedora Core release 3 (Heidleberg) Hardware Environment: Software Environment: FIle system mounted with "mount -t nfs". No specific option. Problem Description: The client sends a NLM_LOCK request for a blocking range lock. This lock cannot be granted by the server because already held by another client. The server replies with BLOCKED status. The client is expected to wait for the server to send NLM_GRANTED callback. Instead, the client loops retrying the NLM_LOCK request (resend 0.3 ms after BLOCKED reply is received). Steps to reproduce: I use fcntl() to take the blocking range lock from 2 different processes. Following is a snoop trace of the issue: // take primary lock 13 41:04.87460 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3412 FH=44E2 PID=5064 Region=10:20 17 41:04.88221 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3412 granted // set a pending lock from another process 19 41:06.42073 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 Region=10:20 20 41:06.42118 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked // Unexpected retries (indefinite) from Linux ? 22 41:06.42153 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 Region=10:20 23 41:06.43169 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked 24 41:06.43204 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 Region=10:20 25 41:06.44339 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked 26 41:06.44370 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 Region=10:20 27 41:06.45511 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked 28 41:06.45541 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 Region=10:20 29 41:06.46683 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked ...
This is most deliberate, and has _always_ been a feature of NLM. It is needed as a workaround for those servers that may drop NLM requests (see the comment before nlmclnt_lock()).
The other thing that may trigger it is signals. If the userland process has set up alarms or other signals that are not blocked, then the single unix spec says that we must abort the syscall (and the Linux convention is then to return ERESTARTSYS).
I'm a bit surprised by so quick retries. But, that's OK. Thanks !
Please consider that immediate retries can flood the server and are not in accordance with the protocol. The client is supposed to wait for a server callback. In case some unfair servers may drop the blocked request, the client should perform retries, but not faster than one every 5 or 10 milliseconds at least. Here we can see a delay of 300 microseconds only between the server reply and the client retry; I think this is immediate retry, and will cause unnecessary processing at the server side. Jean-Louis.
The standard retry time is 30 seconds. I suggest looking at a strace log to find out if this is a situation where signals are causing the client to abort the syscall with ERESTARTSYS and then retry.
Please reopen this bug if: - it is still present in kernel 2.6.17 and - you can provide the requested information.