Created attachment 24991 [details] Config for kernel 2.6.32.8 Running 2.6.32.8 in an SLES10SP2 with a SUN X4540 gcc version 4.1.2 20070115 (SUSE Linux) There are a bonding (802.3ad) interface with the 4 network interfaces slaved. Some clients reports poor nfs performance, I'm using just NFSv3 without ACL's or QUOTAS. Firstly I compile nfsd into the kernel. When crash occurs the system hangs and doesn't serves nor nfs nor smbd. I changed it to modules configuration. Then it seems comes better but still having slow performance and ocasional disconnections. Warnings shows: Feb 11 16:23:05 sStorage kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140 bytes - shutting down socket Feb 11 16:23:12 sStorage kernel: ------------[ cut here ]------------ Feb 11 16:23:12 sStorage kernel: WARNING: at lib/kref.c:43 kref_get+0x2d/0x30() Feb 11 16:23:12 sStorage kernel: Hardware name: Sun Fire X4540 Feb 11 16:23:12 sStorage kernel: Modules linked in: nfsd exportfs nfs lockd sunrpc Feb 11 16:23:12 sStorage kernel: Pid: 6569, comm: nfsd Tainted: G D W 2.6.32.8 #2 Feb 11 16:23:12 sStorage kernel: Call Trace: Feb 11 16:23:12 sStorage kernel: [<ffffffffa0017306>] ? svc_xprt_free+0x46/0x60 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffff811cbf7d>] ? kref_get+0x2d/0x30 Feb 11 16:23:12 sStorage kernel: [<ffffffff8103c887>] warn_slowpath_common+0x87/0xb0 Feb 11 16:23:12 sStorage kernel: [<ffffffff8103c8bf>] warn_slowpath_null+0xf/0x20 Feb 11 16:23:12 sStorage kernel: [<ffffffff811cbf7d>] kref_get+0x2d/0x30 Feb 11 16:23:12 sStorage kernel: [<ffffffffa0017fe6>] svc_recv+0x406/0x860 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffffa000ac20>] ? svc_process+0x2a0/0x770 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffff81036950>] ? default_wake_function+0x0/0x10 Feb 11 16:23:12 sStorage kernel: [<ffffffffa00927f0>] ? nfsd+0x0/0x140 [nfsd] Feb 11 16:23:12 sStorage kernel: [<ffffffffa0092881>] nfsd+0x91/0x140 [nfsd] Feb 11 16:23:12 sStorage kernel: [<ffffffff810511ee>] kthread+0x8e/0xa0 Feb 11 16:23:12 sStorage kernel: [<ffffffff8100cb5a>] child_rip+0xa/0x20 Feb 11 16:23:12 sStorage kernel: [<ffffffff81051160>] ? kthread+0x0/0xa0 Feb 11 16:23:12 sStorage kernel: [<ffffffff8100cb50>] ? child_rip+0x0/0x20 Feb 11 16:23:12 sStorage kernel: ---[ end trace e781cc98ce2aa42e ]--- Feb 11 16:23:12 sStorage kernel: kernel BUG at fs/inode.c:1343! Feb 11 16:23:12 sStorage kernel: CPU 10 Feb 11 16:23:12 sStorage kernel: Modules linked in: nfsd exportfs nfs lockd sunrpc Feb 11 16:23:12 sStorage kernel: Process nfsd (pid: 6569, threadinfo ffff880813e9a000, task ffff880813e99950) Feb 11 16:23:12 sStorage kernel: ffff880813e9bdb0 ffff880613b92800 ffff880813e9bd70 ffffffff813b413a Feb 11 16:23:12 sStorage kernel: <0> ffff880613b92800 ffffffffa00210c0 ffff880813e9bd90 ffffffffa000c358 Feb 11 16:23:12 sStorage kernel: <0> ffff880813e9bd90 ffff880613b92800 ffff880813e9bdb0 ffffffffa00172fe Feb 11 16:23:12 sStorage kernel: [<ffffffff813b413a>] sock_release+0x7a/0x80 Feb 11 16:23:12 sStorage kernel: [<ffffffffa000c358>] svc_sock_free+0x48/0x60 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffffa00172fe>] svc_xprt_free+0x3e/0x60 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffffa00172c0>] ? svc_xprt_free+0x0/0x60 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffff811cbf17>] kref_put+0x37/0x70 Feb 11 16:23:12 sStorage kernel: [<ffffffffa0017054>] svc_xprt_put+0x14/0x20 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffffa0017267>] svc_xprt_release+0xd7/0xf0 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffffa001841a>] svc_recv+0x83a/0x860 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffffa000ac20>] ? svc_process+0x2a0/0x770 [sunrpc] Feb 11 16:23:12 sStorage kernel: [<ffffffff81036950>] ? default_wake_function+0x0/0x10 Feb 11 16:23:12 sStorage kernel: [<ffffffffa00927f0>] ? nfsd+0x0/0x140 [nfsd] Feb 11 16:23:12 sStorage kernel: [<ffffffffa0092881>] nfsd+0x91/0x140 [nfsd] Feb 11 16:23:12 sStorage kernel: [<ffffffff810511ee>] kthread+0x8e/0xa0 Feb 11 16:23:12 sStorage kernel: [<ffffffff8100cb5a>] child_rip+0xa/0x20 Feb 11 16:23:12 sStorage kernel: [<ffffffff81051160>] ? kthread+0x0/0xa0 Feb 11 16:23:12 sStorage kernel: [<ffffffff8100cb50>] ? child_rip+0x0/0x20 Feb 11 16:23:12 sStorage kernel: RSP <ffff880813e9bd40> Feb 11 16:23:12 sStorage kernel: ---[ end trace e781cc98ce2aa42f ]---
hm, we don't have a fs/nfsd category, so I put it in fs/nfs.
Could be the same as: http://marc.info/?t=126349257700007&r=1&w=2 which I haven't figured out yet. Is this the first WARNING you got? I assume you're not using RDMA or kerberos?
It could be. I have no information about NFSv4. You assume right. It's just NFSv3 without kerberos. And there is no Infiniband, I supose there is no sense to use RDMA without it. Just ethernet(bonding with 802.3ad) network and linux clients too. I'have tried with 2.6.32.[2,6,8] and all runs the same way. Now I'm come back to 2.6.29.6. It's a production system and on monday will know if it runs better.
*** Bug 15324 has been marked as a duplicate of this bug. ***
Got this oops over the weekend and lost all the exported disks, just gave permission denied errors on the clients. Linux echo24 2.6.32.2 #1 SMP Tue Dec 29 09:14:14 WST 2009 x86_64 x86_64 x86_64 GNU/Linux Feb 20 10:17:19 echo24 kernel: BUG: unable to handle kernel NULL pointer dereference at (null) Feb 20 10:17:19 echo24 kernel: IP: [<ffffffff811a9086>] _atomic_dec_and_lock+0xa/0x50 Feb 20 10:17:19 echo24 kernel: PGD 0 Feb 20 10:17:19 echo24 kernel: Oops: 0000 [#2] SMP Feb 20 10:17:19 echo24 kernel: last sysfs file: /sys/devices/platform/i5k_amb.0/temp2_input Feb 20 10:17:19 echo24 kernel: CPU 7 Feb 20 10:17:19 echo24 kernel: Modules linked in: nfsd exportfs mx_driver(P) mx_mcp(P) nfs lockd nfs_acl auth_rpcgss sunrpc ipv6 ext4 jbd2 crc16 dm_multipath uinput e1000e mptsas mptscsih i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr mptbase i5k_amb i2c_core shpchp hwmon serio_raw ioatdma scsi_transport_sas dca ata_generic [last unloaded: myri10ge] Feb 20 10:17:19 echo24 kernel: Pid: 2755, comm: rpc.mountd Tainted: P D W 2.6.32.2 #1 X7DWE Feb 20 10:17:19 echo24 kernel: RIP: 0010:[<ffffffff811a9086>] [<ffffffff811a9086>] _atomic_dec_and_lock+0xa/0x50 Feb 20 10:17:19 echo24 kernel: RSP: 0018:ffff88020ac67c58 EFLAGS: 00010296 Feb 20 10:17:19 echo24 kernel: RAX: 0000000000000021 RBX: 0000000000000000 RCX: 00000000000000eb Feb 20 10:17:19 echo24 kernel: RDX: 0000000000000000 RSI: ffffffffa01b1a10 RDI: 0000000000000000 Feb 20 10:17:19 echo24 kernel: RBP: ffff88020ac67c68 R08: 00000000000000ec R09: ffff88020ac67ca8 Feb 20 10:17:19 echo24 kernel: R10: ffff88020ac67ca8 R11: 0000000000000000 R12: ffffffffa01b1a10 Feb 20 10:17:19 echo24 kernel: R13: ffffffffa01b0c40 R14: ffff88022f249000 R15: 0000000000000000 Feb 20 10:17:19 echo24 kernel: FS: 00007fb3a2c1f740(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000 Feb 20 10:17:19 echo24 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Feb 20 10:17:19 echo24 kernel: CR2: 0000000000000000 CR3: 00000001fd0e4000 CR4: 00000000000406e0 Feb 20 10:17:19 echo24 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 10:17:19 echo24 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Feb 20 10:17:19 echo24 kernel: Process rpc.mountd (pid: 2755, threadinfo ffff88020ac66000, task ffff88020ade8000) Feb 20 10:17:19 echo24 kernel: Stack: Feb 20 10:17:19 echo24 kernel: 0000000000000000 ffffffffa01a0a26 ffff88020ac67c88 ffffffffa01a0819 Feb 20 10:17:19 echo24 kernel: <0> ffff88022f249000 ffff880204a8b780 ffff88020ac67ca8 ffffffffa01a0a48 Feb 20 10:17:19 echo24 kernel: <0> ffff88020ac67cd8 ffff880204a8b798 ffff88020ac67cc8 ffffffff811ab411 Feb 20 10:17:19 echo24 kernel: Call Trace: Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a0a26>] ? ip_map_put+0x0/0x2e [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a0819>] auth_domain_put+0x18/0x54 [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a0a48>] ip_map_put+0x22/0x2e [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffff811ab411>] kref_put+0x43/0x4f Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a4c1b>] cache_put+0x2d/0x2f [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a58a9>] cache_clean+0x1dd/0x1f1 [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a5920>] cache_flush+0x23/0x4c [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa031391c>] svc_export_parse+0x52d/0x5ac [nfsd] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a4a58>] cache_do_downcall+0x39/0x4e [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a5503>] cache_write+0xc8/0x135 [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a55a3>] cache_write_procfs+0x19/0x1b [sunrpc] Feb 20 10:17:19 echo24 kernel: [<ffffffff81132ef8>] proc_reg_write+0x72/0x8c Feb 20 10:17:19 echo24 kernel: [<ffffffff810ecb19>] vfs_write+0xab/0x105 Feb 20 10:17:19 echo24 kernel: [<ffffffff810ecc37>] sys_write+0x47/0x6f Feb 20 10:17:19 echo24 kernel: [<ffffffff81010c42>] system_call_fastpath+0x16/0x1b Feb 20 10:17:19 echo24 kernel: Code: 41 5d c9 c3 55 be 40 00 00 00 48 89 e5 e8 d5 02 00 00 ba 40 00 00 00 83 f8 40 0f 4f c2 c9 c3 90 90 55 48 89 e5 41 54 49 89 f4 53 <8b> 0f 48 89 fb 83 f9 01 74 18 8d 41 ff 48 63 d1 48 63 f0 48 89 Feb 20 10:17:19 echo24 kernel: RIP [<ffffffff811a9086>] _atomic_dec_and_lock+0xa/0x50 Feb 20 10:17:19 echo24 kernel: RSP <ffff88020ac67c58> Feb 20 10:17:19 echo24 kernel: CR2: 0000000000000000 Feb 20 10:17:19 echo24 kernel: ---[ end trace dacd8fe1ce7d497c ]---
Created attachment 25285 [details] fix refcnt bugs Could you try the attached?
Been running 2.6.33 with the patch for 2 days with no errors to report.
I think that I am seeing the same problem. It results in nfsd processes dying. I started the machine with 24 nfsd processes and a few days later, it was down to 8 running processes. Is it also related to using bonding on the Ethernet ports? This is running Gentoo's version of a 2.6.32 kernel (2.6.32-gentoo-r7 ) Jun 22 17:53:43 server2 kernel: ------------[ cut here ]------------ Jun 22 17:53:43 server2 kernel: WARNING: at lib/kref.c:43 kref_get+0x1b/0x22() Jun 22 17:53:43 server2 kernel: Hardware name: System Product Name Jun 22 17:53:43 server2 kernel: Modules linked in: hwmon_vid bonding ns83820 sky2 tg3 libphy atl1e pdc202xx_new r128 siimage asus_atk0110 forcedeth Jun 22 17:53:43 server2 kernel: Pid: 7307, comm: nfsd Tainted: G D W 2.6.32-gentoo-r7 #1 Jun 22 17:53:43 server2 kernel: Call Trace: Jun 22 17:53:43 server2 kernel: [<c10294f3>] warn_slowpath_common+0x65/0x7c Jun 22 17:53:43 server2 kernel: [<c117550d>] ? kref_get+0x1b/0x22 Jun 22 17:53:43 server2 kernel: [<c1029517>] warn_slowpath_null+0xd/0x10 Jun 22 17:53:43 server2 kernel: [<c117550d>] kref_get+0x1b/0x22 Jun 22 17:53:43 server2 kernel: [<c1398f9b>] svc_recv+0x22b/0x689 Jun 22 17:53:43 server2 kernel: [<c10264fa>] ? default_wake_function+0x0/0xd Jun 22 17:53:43 server2 kernel: [<c111c70d>] nfsd+0x8c/0x10b Jun 22 17:53:43 server2 kernel: [<c111c681>] ? nfsd+0x0/0x10b Jun 22 17:53:43 server2 kernel: [<c103cada>] kthread+0x5f/0x64 Jun 22 17:53:43 server2 kernel: [<c103ca7b>] ? kthread+0x0/0x64 Jun 22 17:53:43 server2 kernel: [<c1003c27>] kernel_thread_helper+0x7/0x10 Jun 22 17:53:43 server2 kernel: ---[ end trace 80ce67f68fd830be ]--- Jun 22 17:53:43 server2 kernel: ------------[ cut here ]------------ Jun 22 17:53:43 server2 kernel: klogd 1.4.1, ---------- state change ---------- Jun 22 17:53:43 server2 kernel: kernel BUG at fs/inode.c:1343! Jun 22 17:53:43 server2 kernel: invalid opcode: 0000 [#3] SMP Jun 22 17:53:43 server2 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/host3/uevent Jun 22 17:53:43 server2 kernel: Modules linked in: hwmon_vid bonding ns83820 sky2 tg3 libphy atl1e pdc202xx_new r128 siimage asus_atk0110 forcedeth Jun 22 17:53:43 server2 kernel: Jun 22 17:53:43 server2 kernel: Pid: 7307, comm: nfsd Tainted: G D W (2.6.32-gentoo-r7 #1) System Product Name Jun 22 17:53:43 server2 kernel: EIP: 0060:[<c10a0cf1>] EFLAGS: 00010246 CPU: 1 Jun 22 17:53:43 server2 kernel: EIP is at iput+0x13/0x4d Jun 22 17:53:43 server2 kernel: EAX: d5c19ce8 EBX: d5c19ce8 ECX: f5c17e40 EDX: 00000000 Jun 22 17:53:43 server2 kernel: ESI: 00000000 EDI: f7352030 EBP: f554ff0c ESP: f554ff08 Jun 22 17:53:43 server2 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Jun 22 17:53:43 server2 kernel: Process nfsd (pid: 7307, ti=f554e000 task=f7352030 task.ti=f554e000) Jun 22 17:53:43 server2 kernel: Stack: Jun 22 17:53:43 server2 kernel: d5c19cc0 f554ff1c c130eaba f62cd200 00000000 f554ff28 c1390973 f62cd200 Jun 22 17:53:43 server2 kernel: <0> f554ff38 c13995f6 f62cd208 c13995ce f554ff48 c11754e9 f54eb000 f62cd200 Jun 22 17:53:43 server2 kernel: <0> f554ff50 c1398aec f554ff64 c1398cb7 f54eb000 fffffff5 f7352030 f554ffa4 Jun 22 17:53:43 server2 kernel: Call Trace: Jun 22 17:53:43 server2 kernel: [<c130eaba>] ? sock_release+0x49/0x59 Jun 22 17:53:43 server2 kernel: [<c1390973>] ? svc_sock_free+0x37/0x43 Jun 22 17:53:43 server2 kernel: [<c13995f6>] ? svc_xprt_free+0x28/0x33 Jun 22 17:53:43 server2 kernel: [<c13995ce>] ? svc_xprt_free+0x0/0x33 Jun 22 17:53:43 server2 kernel: [<c11754e9>] ? kref_put+0x39/0x42 Jun 22 17:53:43 server2 kernel: [<c1398aec>] ? svc_xprt_put+0x10/0x12 Jun 22 17:53:43 server2 kernel: [<c1398cb7>] ? svc_xprt_release+0xa7/0xaf Jun 22 17:53:43 server2 kernel: [<c13993ab>] ? svc_recv+0x63b/0x689 Jun 22 17:53:43 server2 kernel: [<c10264fa>] ? default_wake_function+0x0/0xd Jun 22 17:53:43 server2 kernel: [<c111c70d>] ? nfsd+0x8c/0x10b Jun 22 17:53:43 server2 kernel: [<c111c681>] ? nfsd+0x0/0x10b Jun 22 17:53:43 server2 kernel: [<c103cada>] ? kthread+0x5f/0x64 Jun 22 17:53:43 server2 kernel: [<c103ca7b>] ? kthread+0x0/0x64 Jun 22 17:53:43 server2 kernel: [<c1003c27>] ? kernel_thread_helper+0x7/0x10 Jun 22 17:53:43 server2 kernel: Code: 11 89 f0 89 55 f0 e8 6e 25 00 00 8b 55 f0 85 c0 74 d4 5a 5b 5e 5f 5d c3 55 85 c0 89 e5 53 89 c3 74 40 83 b8 38 01 00 00 40 75 04 <0f> 0b eb fe 8d 40 24 ba 28 ca 61 c1 e8 76 2b 0d 00 85 c0 74 22 Jun 22 17:53:43 server2 kernel: EIP: [<c10a0cf1>] iput+0x13/0x4d SS:ESP 0068:f554ff08 Jun 22 17:53:43 server2 kernel: ---[ end trace 80ce67f68fd830bf ]---
Tomorrow morning I will use a production system with 2.6.34. It seems to be more stable. It is using a bonding 802.3ad link aggregation mode.