Bug 7923 - NFS locking problem
Summary: NFS locking problem
Status: CLOSED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Neil Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-02-02 04:49 UTC by Jan "Yenya" Kasprzak
Modified: 2008-09-23 10:57 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.20-rc7
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
x86_64 server, IRIX client (tcpdump -w) (3.31 KB, application/octet-stream)
2007-02-02 04:50 UTC, Jan "Yenya" Kasprzak
Details
i386 server, IRIX client (tcpdump -w) (4.41 KB, application/octet-stream)
2007-02-02 04:51 UTC, Jan "Yenya" Kasprzak
Details
Linux x86_64 server vs. Linux i386 client (tcpdump -w) (6.20 KB, application/octet-stream)
2007-02-02 08:04 UTC, Jan "Yenya" Kasprzak
Details
NLM: Fix double free in __nlm_async_call (1.72 KB, patch)
2007-02-02 15:31 UTC, Trond Myklebust
Details | Diff
RPC: Fix double free in portmapper code (1.19 KB, patch)
2007-02-03 16:06 UTC, Trond Myklebust
Details | Diff

Description Jan "Yenya" Kasprzak 2007-02-02 04:49:02 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.11.10

Distribution: RHEL4 update 3 (x86_64), Fedora 6 (both i386 and x86_64); IRIX 6.5.22

Hardware Environment: HP DL585 (4x Opteron 848), Athlon FX51, Pentium 4 3.0GHz;
SGI Origin2000 (IP27, 8x R12000)

Software Environment: nfs-utils 1.0.6 from RHEL4, also 1.0.10-5 from FC6

Problem Description:
After upgrading the NFS server (HP DL585) from 2.6.11.10 to 2.6.20-rc7 we see
problems with locking from an IRIX 6.5.22 client. After a fresh reboot of both
server and client, the locking sometimes works, but then after some time the
fcntl(,F_SETLK,) just locks up, the process gets stuck in this syscall, and adds
+1 to the system load average on the client. It can be suspended using ^Z on the
terminal, but cannot be killed even with SIGKILL. It mainly affects our users
who read their mail using mutt on the NFS client, because mutt uses the same
syscall for locking. 2.6.11.10 was without problems.

Steps to reproduce:
On a Linux box:
mkdir /tmp/test
ls -l / > /tmp/test/file
echo '/tmp/test  irix.box(ro,sync)' >>/etc/exports
[ start nfs or run exportfs -r ]

On an IRIX box:
mount -t nfs -o vers=3,ro,soft,intr linuxbox:/tmp/test /mnt
cat > lktest.c <<EOF
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

int main ()
{
        struct flock fl;
        int rv;

        fl.l_type   = F_RDLCK;
        fl.l_whence = SEEK_SET;
        fl.l_start  = 0;
        fl.l_len    = 0;

        fprintf(stderr,  "Trying to lock stdin:\n");
        rv = fcntl(0, F_SETLK, &fl);
        if (rv == 1) {
                perror("fcntl(0, F_SETLK, ...)");
                return 1;
        } else {
                fprintf(stderr, "fcntl(0, F_SETLK, ...) == %d\n", rv);
                return 0;
        }
}
EOF

cc -o lktest lktest.c

./lktest < /mnt/file
- it prints "Trying to lock stdin:", but nothing else (i.e. it gets stuck inside
the fcntl call).

Additional info:
- tried both NFSv2 and NFSv3, rw and ro, soft,intr and without soft,intr with
the same results.
- the behaviour does not depend on SMP or UP on the Linux server.
- the behaviour is the same on 32-bit and 64-bit Linux server.
- I will attach two packet dumps of the NFS traffic - one is between the
FC6/x86_64 server and IRIX client, and the other is between FC6/i386 server
and the same IRIX client.
- the client gets stuck in fcntl() even for a freshly created file without any
locks held.
Comment 1 Jan "Yenya" Kasprzak 2007-02-02 04:50:34 UTC
Created attachment 10254 [details]
x86_64 server, IRIX client (tcpdump -w)
Comment 2 Jan "Yenya" Kasprzak 2007-02-02 04:51:44 UTC
Created attachment 10255 [details]
i386 server, IRIX client (tcpdump -w)
Comment 3 Jan "Yenya" Kasprzak 2007-02-02 08:02:48 UTC
Hmm, I have found that the problem is not IRIX-specific - the fcntl() locking
fails also from the Linux client (I will attach tcpdump soon) to the same
server. Apparently it is a problem in lockd on the NFS server side. When I tried
to kill -9 lockd, I've got the following oops:

general protection fault: 0000 [1] SMP 
CPU 0 
Modules linked in: ohci_hcd usbcore i2c_amd756 i2c_core k8temp hwmon qla2xxx
amd74xx ide_core
Pid: 5144, comm: lockd Not tainted 2.6.20-rc7 #1
RIP: 0010:[<ffffffff80334a1e>]  [<ffffffff80334a1e>] nlm_release_host+0x24/0x77
RSP: 0018:ffff8105f1211e50  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 2020202033363542 RCX: ffffffff805cda48
RDX: ffffffff805cda48 RSI: 0000000000000046 RDI: 2020202033363542
RBP: ffff8101547d7c00 R08: ffffffff805cda48 R09: 0000000000000100
R10: 0000000000000000 R11: ffffffff802664da R12: 00000000ffffffdb
R13: 000000000000000a R14: ffff8105f1211eb0 R15: ffffffff804d45a0
FS:  00002b5cafdd4b00(0000) GS:ffffffff8060f000(0000) knlGS:00000000f7fa36c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005b3bb4 CR3: 000000035a6e5000 CR4: 00000000000006e0
Process lockd (pid: 5144, threadinfo ffff8105f1210000, task ffff8105f8e4c870)
Stack:  ffff8101547d7c00 ffffffff803339e3 ffff810142abaac0 ffffffff80333d13
 ffff81001ee5dec0 ffff81001ee5dec0 ffff8101547d7c38 ffff8101547d7c00
 ffff81014be90480 7fffffffffffffff 00000000fffffff4 ffffffff80333d63
Call Trace:
 [<ffffffff803339e3>] nlm_release_call+0xd/0x24
 [<ffffffff80333d13>] __nlm_async_call+0x7f/0x8f
 [<ffffffff80333d63>] nlm_async_call+0x40/0x45
 [<ffffffff803365c9>] nlmsvc_retry_blocked+0x172/0x199
 [<ffffffff80335417>] lockd+0xe3/0x227
 [<ffffffff80335334>] lockd+0x0/0x227
 [<ffffffff80335334>] lockd+0x0/0x227
 [<ffffffff80254328>] child_rip+0xa/0x12
 [<ffffffff80335334>] lockd+0x0/0x227
 [<ffffffff8025431e>] child_rip+0x0/0x12


Code: 83 7b 74 00 79 04 0f 0b eb fe f0 ff 4b 74 0f 94 c0 84 c0 74 
RIP  [<ffffffff80334a1e>] nlm_release_host+0x24/0x77
 RSP <ffff8105f1211e50>
Comment 4 Jan "Yenya" Kasprzak 2007-02-02 08:04:19 UTC
Created attachment 10257 [details]
Linux x86_64 server vs. Linux i386 client (tcpdump -w)

It seems the NFS server does not reply to the NLM v4 LOCK call.
Comment 5 Jan "Yenya" Kasprzak 2007-02-02 09:13:29 UTC
I have tried the stable version (2.6.19.2) on the NFS server, and ran
while ./lktest < lktest.c; do date; sleep 1; done
to see when it fails. The locking has failed after ~15 minutes of uptime.
When trying to do "/etc/init.d/nfslock restart", I've got the following oops:


Feb  2 18:17:53 anxur kernel: lockd: couldn't create RPC handle for nymfe27
Feb  2 18:17:53 anxur kernel: lockd: couldn't create RPC handle for nymfe27
Feb  2 18:17:53 anxur kernel: ----------- [cut here ] --------- [please bite her
e ] ---------
Feb  2 18:17:53 anxur kernel: Kernel BUG at fs/lockd/host.c:288
Feb  2 18:17:53 anxur kernel: invalid opcode: 0000 [1] SMP 
Feb  2 18:17:53 anxur kernel: CPU 2 
Feb  2 18:17:53 anxur kernel: Modules linked in: ohci_hcd usbcore i2c_amd756 i2c
_core k8temp hwmon qla2xxx amd74xx ide_core
Feb  2 18:17:53 anxur kernel: Pid: 5233, comm: lockd Not tainted 2.6.19.2 #2
Feb  2 18:17:53 anxur kernel: RIP: 0010:[<ffffffff80337436>]  [<ffffffff80337436
>] nlm_release_host+0x2f/0x94
Feb  2 18:17:53 anxur kernel: RSP: 0018:ffff8105ee4a9d80  EFLAGS: 00010286
Feb  2 18:17:53 anxur kernel: RAX: 00000000ffffffff RBX: ffff8101f8d449c0 RCX: f
fff8101e99bda28
Feb  2 18:17:53 anxur kernel: RDX: ffff8101e4bd8334 RSI: ffffffff803385e5 RDI: f
fff8101f8d449c0
Feb  2 18:17:53 anxur kernel: RBP: ffff8101bcd8f2c0 R08: ffff8105ee4a8000 R09: 0
5000000e5cb3273
Feb  2 18:17:53 anxur kernel: R10: 05000000e5cb3273 R11: ffff8104000efa40 R12: f
fff8101e99bda10
Feb  2 18:17:53 anxur kernel: R13: ffffffff8033a178 R14: 0000000000000000 R15: 0
0000000fffffff4
Feb  2 18:17:53 anxur kernel: FS:  00002b75b4728b00(0000) GS:ffff8104000ef640(00
00) knlGS:00000000f66f9bb0
Feb  2 18:17:53 anxur kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 18:17:53 anxur kernel: CR2: 0000000000459520 CR3: 00000005c0eb5000 CR4: 0
0000000000006e0
Feb  2 18:17:53 anxur nfslock: lockd -KILL succeeded
Feb  2 18:17:53 anxur kernel: Process lockd (pid: 5233, threadinfo ffff8105ee4a8
000, task ffff8105f1d44730)
Feb  2 18:17:53 anxur kernel: Stack:  ffff8101e4bd8000 ffffffff8033640a ffff8101
e99bda40 ffffffff80338660
Feb  2 18:17:53 anxur rpc.statd[4498]: Caught signal 15, un-registering and exit
ing.

Comment 6 Jan "Yenya" Kasprzak 2007-02-02 09:15:45 UTC
Hmm, in comment #5 it seems not all messages got through syslog. I am adding the
dmesg output of the same incident as well:

lockd: couldn't create RPC handle for nymfe27
lockd: couldn't create RPC handle for nymfe27
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/lockd/host.c:288
invalid opcode: 0000 [1] SMP 
CPU 2 
Modules linked in: ohci_hcd usbcore i2c_amd756 i2c_core k8temp hwmon qla2xxx
amd74xx ide_core
Pid: 5233, comm: lockd Not tainted 2.6.19.2 #2
RIP: 0010:[<ffffffff80337436>]  [<ffffffff80337436>] nlm_release_host+0x2f/0x94
RSP: 0018:ffff8105ee4a9d80  EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff8101f8d449c0 RCX: ffff8101e99bda28
RDX: ffff8101e4bd8334 RSI: ffffffff803385e5 RDI: ffff8101f8d449c0
RBP: ffff8101bcd8f2c0 R08: ffff8105ee4a8000 R09: 05000000e5cb3273
R10: 05000000e5cb3273 R11: ffff8104000efa40 R12: ffff8101e99bda10
R13: ffffffff8033a178 R14: 0000000000000000 R15: 00000000fffffff4
FS:  00002b75b4728b00(0000) GS:ffff8104000ef640(0000) knlGS:00000000f66f9bb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000459520 CR3: 00000005c0eb5000 CR4: 00000000000006e0
Process lockd (pid: 5233, threadinfo ffff8105ee4a8000, task ffff8105f1d44730)
Stack:  ffff8101e4bd8000 ffffffff8033640a ffff8101e99bda40 ffffffff80338660
 ffff8101bcd8f2c0 ffff8101bcd8f2c0 ffffffff803385e5 ffffffff8022f79d
 ffffffff805595a0 ffff8101bcd8f2c0 ffff8101e99bd980 ffffffff80338713
Call Trace:
 [<ffffffff8033640a>] nlm_release_call+0xd/0x2a
 [<ffffffff80338660>] nlmsvc_free_block+0x7b/0x90
 [<ffffffff803385e5>] nlmsvc_free_block+0x0/0x90
 [<ffffffff8022f79d>] kref_put+0x7a/0x85
 [<ffffffff80338713>] nlmsvc_traverse_blocks+0x9e/0xa3
 [<ffffffff8033a178>] nlmsvc_is_client+0x0/0x1e
 [<ffffffff80339e8b>] nlm_traverse_files+0x78/0x21d
 [<ffffffff80273cd2>] __cond_resched+0x1c/0x44
 [<ffffffff80257daa>] cond_resched+0x2e/0x39
 [<ffffffff804b54e3>] svc_recv+0x18f/0x3d8
 [<ffffffff802734f5>] default_wake_function+0x0/0xe
 [<ffffffff80337e40>] lockd+0xcc/0x227
 [<ffffffff80337d74>] lockd+0x0/0x227
 [<ffffffff80337d74>] lockd+0x0/0x227
 [<ffffffff802552b8>] child_rip+0xa/0x12
 [<ffffffff80337d74>] lockd+0x0/0x227
 [<ffffffff802552ae>] child_rip+0x0/0x12


Code: 0f 0b 68 a7 71 4f 80 c2 20 01 f0 ff 4b 74 0f 94 c0 84 c0 74 
RIP  [<ffffffff80337436>] nlm_release_host+0x2f/0x94
 RSP <ffff8105ee4a9d80>
 svc: unknown version (3)
svc: unknown version (3)
svc: unknown version (3)
general protection fault: 0000 [2] SMP 
CPU 2 
Modules linked in: ohci_hcd usbcore i2c_amd756 i2c_core k8temp hwmon qla2xxx
amd74xx ide_core
Pid: 23768, comm: procmail Not tainted 2.6.19.2 #2
RIP: 0010:[<ffffffff80273528>]  [<ffffffff80273528>] __wake_up_common+0x25/0x61
RSP: 0018:ffff810198a85de8  EFLAGS: 00010046
RAX: ffff8101e4bd8008 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8101e4bd8000
RBP: ffff810198a85e18 R08: 38313a3831203220 R09: ffff810400011440
R10: 0000003e4ee31e88 R11: ffff81040005c680 R12: ffff8101e4bd8000
R13: 0000000000000001 R14: 0000000000000246 R15: 0000000000000000
FS:  00002b7776acedc0(0000) GS:ffff8104000ef640(0000) knlGS:00000000f7fcb6c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003e4ee2e740 CR3: 0000000000201000 CR4: 00000000000006e0
Process procmail (pid: 23768, threadinfo ffff810198a84000, task ffff8101f071ca70)
Stack:  0000000100000000 ffff8101e4bd8000 0000000000000000 0000000000000001
 0000000000000246 0000000000000001 ffff810198a85e58 ffffffff802296ac
 00000000ffffffff 0000000000000001 0000000000000000 ffff8101eef63278
Call Trace:
 [<ffffffff802296ac>] __wake_up+0x35/0x4b
 [<ffffffff8023dd2d>] pipe_release+0x66/0xa9
 [<ffffffff802103bf>] __fput+0x9e/0x179
 [<ffffffff80220258>] filp_close+0x59/0x61
 [<ffffffff80232cd2>] put_files_struct+0x69/0xac
 [<ffffffff80212cef>] do_exit+0x233/0x841
 [<ffffffff8024280a>] cpu_idle+0x0/0x6f
 [<ffffffff802544fe>] system_call+0x7e/0x83


Code: 4d 8b 30 49 39 c0 74 28 49 8d 78 e8 45 8b 68 e8 4c 89 f9 8b 
RIP  [<ffffffff80273528>] __wake_up_common+0x25/0x61
 RSP <ffff810198a85de8>
 <1>Fixing recursive fault but reboot is needed!
Comment 7 Andrew Morton 2007-02-02 13:03:34 UTC
reassign to neilb: this is an NFS server problem, I think.

A pretty serious one too - the kernel oopsed.
Comment 8 Trond Myklebust 2007-02-02 15:31:35 UTC
Created attachment 10267 [details]
NLM: Fix double free in __nlm_async_call

rpc_call_async() will always call rpc_release_calldata(), so it is an
error for __nlm_async_call() to do so as well.

Does this suffice to fix your bug?
Comment 9 Andrew Morton 2007-02-02 15:48:25 UTC
whee, thanks.  Jan, please test this as soon as possible, thanks.
Comment 10 Jan "Yenya" Kasprzak 2007-02-02 17:05:30 UTC
Tried patch in the comment #8. The server now has an uptime of 30 minutes, with
some locking tests running on its NFS clients. I tried "/etc/init.d/nfslock
restart", and it did not crash, and locking still works. So it is definitely an
improvement (thanks for the fast reply!).

However, it is Saturday 02am in this part of the world, so the traffic on the
NFS server is low (not to mention the main NFS load is from students'
workstations, which are not much used during the weekend). I would probably wait
till Monday evening for a final ACK.
Comment 11 Jan "Yenya" Kasprzak 2007-02-03 03:23:03 UTC
The locking problem is probably gone, but there are still problems:
I ran the following testing loop to see whether the locking fails:

while ./lktest < lktest.c; do date; sleep 1; done

on the Linux NFS client (FC6-x86_64, 2.6.20-rc7 kernel), and the loop sometimes
exits with "lktest.c: no such file or directory". The total runtime of the loop
varies somewhere between 5 minutes and 1 hour. When I restart the loop, it
continues for another few minutes between exiting with the same error.

Moreover, the NFS server has oopsed in __d_lookup (see bellow). I have already
seen this oops (see http://lkml.org/lkml/2005/10/22/13 as an example - it was
2.6.14-rc4), which is the reason I still run 2.6.11.10. In fact, every time I
try to upgrade from 2.6.11.10 to something newer, the oops in __d_lookup (called
from the sys_lstat()) appears  usually no later than 1 day since the reboot. We
have already replaced the server mainboard and some other components, so I think
HW failure is unlikely (but who knows). Should I open a separate bug for the
__d_lookup problem?

Unable to handle kernel paging request at 00000000fffffff4 RIP: 
 [<ffffffff80208dea>] __d_lookup+0x72/0x112
PGD 5cea88067 PUD 0 
Oops: 0000 [46] SMP 
CPU 2 
Modules linked in: ohci_hcd usbcore i2c_amd756 i2c_core k8temp hwmon qla2xxx
amd74xx ide_core
Pid: 32250, comm: rsync Not tainted 2.6.20-rc7 #96
RIP: 0010:[<ffffffff80208dea>]  [<ffffffff80208dea>] __d_lookup+0x72/0x112
RSP: 0018:ffff81012f09bbf8  EFLAGS: 00010202
RAX: 00000000fffffff4 RBX: ffff81067d184080 RCX: 0000000000000016
RDX: 0000000000117295 RSI: 018723826d117295 RDI: ffff8103b4ad7188
RBP: 00000000fffffff4 R08: ffff81012f09be48 R09: 0000000000000246
R10: 0000000000000246 R11: 0000000000000246 R12: ffff8103b4ad7188
R13: ffff81012f09bcc8 R14: 00000000bd3c3a0d R15: 0000000000000006
FS:  00002ad0394e23a0(0000) GS:ffff8104000ef640(0000) knlGS:00000000f6ecebb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000fffffff4 CR3: 00000005b8630000 CR4: 00000000000006e0
Process rsync (pid: 32250, threadinfo ffff81012f09a000, task ffff810138a7d850)
Stack:  0000000000000246 0000000000000000 ffff810578fc9015 0000000000000000
 ffff81012f09be48 ffff8103b4ad55e8 0000000000000000 ffff81012f09be48
 ffff81012f09bcc8 ffffffff8020c3ef ffff81012f09be48 ffff8104002ba580
Call Trace:
 [<ffffffff8020c3ef>] do_lookup+0x2a/0x1b8
 [<ffffffff8020971e>] __link_path_walk+0x894/0xccb
 [<ffffffff80299598>] zone_statistics+0x41/0x63
 [<ffffffff8020dc88>] link_path_walk+0x4c/0xc2
 [<ffffffff8020c25e>] do_path_lookup+0x1b0/0x20c
 [<ffffffff802200ac>] __user_walk_fd+0x37/0x4c
 [<ffffffff80238ce2>] vfs_lstat_fd+0x18/0x47
 [<ffffffff80226723>] sys_newlstat+0x19/0x31
 [<ffffffff8025357e>] system_call+0x7e/0x83


Code: 48 8b 45 00 0f 18 08 48 8d 5d e8 44 39 73 30 75 72 4c 39 63 
RIP  [<ffffffff80208dea>] __d_lookup+0x72/0x112
 RSP <ffff81012f09bbf8>
CR2: 00000000fffffff4
Comment 12 Trond Myklebust 2007-02-03 10:16:31 UTC
Yes, please file the d_lookup() issue as a separate bug. That looks like
dcache corruption rather than an NFS/NLM specific thing.

When you file it, please remember to list the different types of filesystem that
are in use on your machine in addition to which filesystem the "rsync" task was
running on (if you know).
Comment 13 Jan "Yenya" Kasprzak 2007-02-03 12:28:02 UTC
The __d_lookup() problem filled as bug #7929.

I am not sure about the other problem ("no such file or directory" when opening
a file). Maybe it should be filled as a separate bug as well?

I ran "while cat /some/file >/dev/null; do date; sleep 1; done", and the loop
also exited with "/some/file: no such file or directory" after several
iterations (/some/file is NFS-mounted). The NFS client was 2.6.20-rc7 Fedora
6/x86_64 UP, the NFS server this time was different - 2.6.19-rc4 Fedora 5/x86_64
SMP.
Comment 14 Trond Myklebust 2007-02-03 13:24:59 UTC
About the "file not found" issue. Have you turned off subtree checking on the
server using the "no_subtree_check" option? There was an issue (that should be
fixed in newer kernels) whereby lookups on the server would fail if you had
subtree checking enabled.
Comment 15 Jan "Yenya" Kasprzak 2007-02-03 13:42:31 UTC
One of the servers (comment #11) has no_subtree_check option enabled,
the other (comment #13) has not.
Comment 16 Trond Myklebust 2007-02-03 16:06:57 UTC
Created attachment 10275 [details]
RPC: Fix double free in portmapper code

This fixes a similar issue in the kernel's RPC portmapper code.
Comment 17 Andrew Morton 2007-02-03 17:35:44 UTC
Trond, should I whizz that into 2.6.20?

And 2.6.19?
Comment 18 Trond Myklebust 2007-02-03 19:45:09 UTC
Yes. I think these 2 are both nasty enough to warrant fast-tracking into mainline.

AFAICS the NLM problem has existed since at least 2.6.17, whereas the portmapper
problem started with 2.6.19.
Comment 19 Jan "Yenya" Kasprzak 2007-03-26 13:02:16 UTC
In order to work around the Oops from comment #11 (subsequently reported as bug#7929), I have moved all my XFS filesystems to ext3, so I am now able to get
uptimes longer than few hours with something newer than 2.6.11.10. However, in
this longer timeframe, the problems from the original bug report of this bug
(incl. comments #1 and #2) are back. After two days of uptime (2.6.21-rc4, so
patches from the comment #8 and comment #16 are already in), my NFS clients
started to hang on fcntl lock operations. I am even not able to kill -9 lockd,
it just remains running. Restarting NFS services on the server does not help. In
dmesg output there is nothing special, maybe the following lines:

rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
svc: unknown version (0 for prog 100003, nfsd)
rpcsvc: received unknown control message:-2145049676/-1
svc: unknown version (0 for prog 100003, nfsd)
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
svc: unknown version (0 for prog 100003, nfsd)
svc: unknown version (0 for prog 100003, nfsd)
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
svc: unknown version (0 for prog 100003, nfsd)
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
rpcsvc: received unknown control message:-2145049676/-1
svc: unknown version (0 for prog 100003, nfsd)

(and the "svc: unknown version" has been there in 2.6.11.10 as well, IIRC).
Since I don't have XFS anymore, I will probably try RHEL4 or 5 kernels.
Comment 20 Natalie Protasevich 2007-07-06 23:03:33 UTC
Jan,
Have you resolved the issues, any updates on the problem?
Thanks.
Comment 21 Jan "Yenya" Kasprzak 2007-07-09 03:13:14 UTC
I have downgraded my system back to the RHEL4 kernel, so I do not see this issue now. I may try the newer kernel probably in August, as I cannot reboot my server now.
Comment 22 Jan "Yenya" Kasprzak 2007-08-30 01:26:40 UTC
Last week we have upgraded the system to RHEL5, and so far it seems that NFS problems are not present in the 2.6.18-8.1.8.el5 kernel.

Note You need to log in before you can comment on or make changes to this bug.