Most recent kernel where this bug did not occur: Distribution: Fedora Core release 3 (Heidleberg) Hardware Environment: Software Environment: File system mounted with "mount -t nfs". No specific option. Problem Description: A process with a BLOCKED range lock request is killed. It sends a NLM_CANCEL request to the server to cancel the blocked range lock request. The issue is that it sets "block=false" in the NLM_CANCEL request. Thus, the cancel req doesn't match the pending lock req. which has block=true. The server replies the NLM_CANCEL with DENIED status. The NLM protocol says that block field in NLM_CANCEL must be set to true. Steps to reproduce: I use fcntl() to take the blocking range lock from 2 different processes. Then I use "kill -9 <pid>" to kill the process with the blocking lock request. Frame 1646 (266 bytes on wire, 266 bytes captured) Internet Protocol, Src: 10.64.220.148 , Dst: 192.168.8.92 Transmission Control Protocol, Src Port: 796 , Dst Port: 59915 Remote Procedure Call, XID:0xfbd662da // XID++ at each retry Network Lock Manager Protocol V4 Procedure: LOCK (2) cookie: <DATA> length: 4 contents: <DATA> // 0x35120000 block: Yes exclusive: Yes lock caller_name: newlnxjlr fh owner: <DATA> contents: <DATA> // "5065@newlnxjlr" svid: 5065 l_offset: 10 l_len: 20 reclaim: No Frame 1647 (106 bytes on wire, 106 bytes captured) Internet Protocol, Src: 192.168.8.92 , Dst: 10.64.220.148 Transmission Control Protocol, Src Port: 59915 , Dst Port: 796 Remote Procedure Reply XID:0xfbd662da Network Lock Manager Protocol V4 Procedure: LOCK (2) cookie: <DATA> length: 4 contents: <DATA> // 0x35120000 -> 0x1235 stat: NLM_BLOCKED (3) Frame 1648 (258 bytes on wire, 258 bytes captured) Internet Protocol, Src: 10.64.220.148 , Dst: 192.168.8.92 Transmission Control Protocol, Src Port: 796 , Dst Port: 59915 Remote Procedure Call, XID:0xfcd662da Network Lock Manager Protocol V4 Procedure: CANCEL (3) cookie: <DATA> length: 4 contents: <DATA> // 0x36120000 -> 0x1236 block: No <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< exclusive: Yes lock caller_name: newlnxjlr fh owner: <DATA> contents: <DATA> // "5065@newlnxjlr" svid: 5065 l_offset: 10 l_len: 20 Frame 1649 (106 bytes on wire, 106 bytes captured) Internet Protocol, Src: 192.168.8.92 , Dst: 10.64.220.148 Transmission Control Protocol, Src Port: 59915 , Dst Port: 796 Remote Procedure Reply XID:0xfcd662da Network Lock Manager Protocol V4 Procedure: CANCEL (3) cookie: <DATA> length: 4 contents: <DATA> // 0x36120000 -> 0x1236 stat: NLM_DENIED (1)
Could you try 2.6.15?
Created attachment 7142 [details] Fix arguments to NLM_CANCEL call The OpenGroup docs state that the arguments "block", "exclusive" and "alock" must exactly match the arguments for the lock call that we are trying to cancel. Currently, "block" is always set to false, which is wrong. See bug# 5956 on bugzilla.kernel.org. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Patch was applied to 2.6.16-rc2