Bug 5954

Summary: Client performs retries of NLM_LOCK request upon BLOCKED status
Product: File System Reporter: Jean-Louis ROCHETTE (rochette_jean-louis)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: bunk
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.9-1.667smp Subsystem:
Regression: --- Bisected commit-id:

Description Jean-Louis ROCHETTE 2006-01-25 06:41:42 UTC
Most recent kernel where this bug did not occur:
Distribution: Fedora Core release 3 (Heidleberg)
Hardware Environment: 
Software Environment: FIle system mounted with "mount -t nfs". No specific 
option.
Problem Description: The client sends a NLM_LOCK request for a blocking range 
lock. This lock cannot be granted by the server because already held by another 
client. The server replies with BLOCKED status. The client is expected to wait 
for the server to send NLM_GRANTED callback. Instead, the client loops retrying 
the NLM_LOCK request (resend 0.3 ms after BLOCKED reply is received).

Steps to reproduce: I use fcntl() to take the blocking range lock from 2 
different processes.

Following is a snoop trace of the issue:
// take primary lock
  13 41:04.87460 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3412 FH=44E2 PID=5064 
Region=10:20
  17 41:04.88221 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3412 granted

// set a pending lock from another process
  19 41:06.42073 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 
Region=10:20
  20 41:06.42118 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked
 
// Unexpected retries (indefinite) from Linux ?
  22 41:06.42153 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 
Region=10:20
  23 41:06.43169 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked
  24 41:06.43204 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 
Region=10:20
  25 41:06.44339 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked
  26 41:06.44370 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 
Region=10:20
  27 41:06.45511 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked
  28 41:06.45541 newlnxjlr->bx19s2_r8 NLM C LOCK4 OH=3512 FH=44E2 PID=5065 
Region=10:20
  29 41:06.46683 bx19s2_r8->newlnxjlr NLM R LOCK4 OH=3512 blocked
 ...
Comment 1 Trond Myklebust 2006-01-25 07:15:14 UTC
This is most deliberate, and has _always_ been a feature of NLM. It is needed as
a workaround for those servers that may drop NLM requests (see the comment
before nlmclnt_lock()).
Comment 2 Trond Myklebust 2006-01-25 07:18:53 UTC
The other thing that may trigger it is signals. If the userland process has set
up alarms or other signals that are not blocked, then the single unix spec says
that we must abort the syscall (and the Linux convention is then to return
ERESTARTSYS).
Comment 3 Jean-Louis ROCHETTE 2006-01-25 07:48:13 UTC
I'm a bit surprised by so quick retries. But, that's OK. Thanks !
Comment 4 Jean-Louis ROCHETTE 2006-01-27 00:19:44 UTC
Please consider that immediate retries can flood the server and are not in 
accordance with the protocol. The client is supposed to wait for a server 
callback. In case some unfair servers may drop the blocked request, the client 
should perform retries, but not faster than one every 5 or 10 milliseconds at 
least.
Here we can see a delay of 300 microseconds only between the server reply and 
the client retry; I think this is immediate retry, and will cause unnecessary 
processing at the server side.
Jean-Louis.
Comment 5 Trond Myklebust 2006-01-27 04:03:20 UTC
The standard retry time is 30 seconds.

I suggest looking at a strace log to find out if this is a situation where
signals are causing the client to abort the syscall with ERESTARTSYS and then retry.
Comment 6 Adrian Bunk 2006-08-22 15:01:44 UTC
Please reopen this bug if:
- it is still present in kernel 2.6.17 and
- you can provide the requested information.