Bug 5956

Summary: Linux client sets block=false in NLM_CANCEL requests
Product: File System Reporter: Jean-Louis ROCHETTE (rochette_jean-louis)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED CODE_FIX    
Severity: normal    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.9-1.667smp Subsystem:
Regression: --- Bisected commit-id:
Attachments: Fix arguments to NLM_CANCEL call

Description Jean-Louis ROCHETTE 2006-01-25 07:38:41 UTC
Most recent kernel where this bug did not occur:
Distribution: Fedora Core release 3 (Heidleberg)
Hardware Environment:
Software Environment: File system mounted with "mount -t nfs". No specific 
option.
Problem Description: A process with a BLOCKED range lock request is killed. It 
sends a NLM_CANCEL request to the server to cancel the blocked range lock 
request. The issue is that it sets "block=false" in the NLM_CANCEL request. 
Thus, the cancel req doesn't match the pending lock req. which has block=true. 
The server replies the NLM_CANCEL with DENIED status.
The NLM protocol says that block field in NLM_CANCEL must be set to true.

Steps to reproduce: I use fcntl() to take the blocking range lock from 2 
different processes. Then I use "kill -9 <pid>" to kill the process with the 
blocking lock request.

Frame 1646 (266 bytes on wire, 266 bytes captured)
Internet Protocol, Src: 10.64.220.148 , Dst: 192.168.8.92
Transmission Control Protocol, Src Port: 796 , Dst Port: 59915
Remote Procedure Call, XID:0xfbd662da    // XID++ at each retry
Network Lock Manager Protocol
    V4 Procedure: LOCK (2)
    cookie: <DATA>
        length: 4
        contents: <DATA> // 0x35120000
    block: Yes
    exclusive: Yes
    lock
        caller_name: newlnxjlr
        fh
        owner: <DATA>
            contents: <DATA>   // "5065@newlnxjlr"
        svid: 5065
        l_offset: 10
        l_len: 20
    reclaim: No

Frame 1647 (106 bytes on wire, 106 bytes captured)
Internet Protocol, Src: 192.168.8.92 , Dst: 10.64.220.148
Transmission Control Protocol, Src Port: 59915 , Dst Port: 796
Remote Procedure Reply XID:0xfbd662da
Network Lock Manager Protocol
    V4 Procedure: LOCK (2)
    cookie: <DATA>
        length: 4
        contents: <DATA> // 0x35120000 -> 0x1235
    stat: NLM_BLOCKED (3)

Frame 1648 (258 bytes on wire, 258 bytes captured)
Internet Protocol, Src: 10.64.220.148 , Dst: 192.168.8.92
Transmission Control Protocol, Src Port: 796 , Dst Port: 59915
Remote Procedure Call, XID:0xfcd662da
Network Lock Manager Protocol
    V4 Procedure: CANCEL (3)
    cookie: <DATA>
        length: 4
        contents: <DATA> // 0x36120000 -> 0x1236
    block: No   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    exclusive: Yes
    lock
        caller_name: newlnxjlr
        fh
        owner: <DATA>
            contents: <DATA>   // "5065@newlnxjlr"
        svid: 5065
        l_offset: 10
        l_len: 20

Frame 1649 (106 bytes on wire, 106 bytes captured)
Internet Protocol, Src: 192.168.8.92 , Dst: 10.64.220.148
Transmission Control Protocol, Src Port: 59915 , Dst Port: 796
Remote Procedure Reply XID:0xfcd662da
Network Lock Manager Protocol
    V4 Procedure: CANCEL (3)
    cookie: <DATA>
        length: 4
        contents: <DATA> // 0x36120000 -> 0x1236
    stat: NLM_DENIED (1)
Comment 1 Diego Calleja 2006-01-25 08:29:35 UTC
Could you try 2.6.15?
Comment 2 Trond Myklebust 2006-01-25 09:16:22 UTC
Created attachment 7142 [details]
Fix arguments to NLM_CANCEL call

 The OpenGroup docs state that the arguments "block", "exclusive" and
 "alock" must exactly match the arguments for the lock call that we are
 trying to cancel.
 Currently, "block" is always set to false, which is wrong.

 See bug# 5956 on bugzilla.kernel.org.

 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Comment 3 Trond Myklebust 2006-03-04 13:58:25 UTC
Patch was applied to 2.6.16-rc2