Bug 15274 - NFSD hangs connection
Summary: NFSD hangs connection
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: bfields
URL:
Keywords:
: 15324 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-02-11 16:35 UTC by Saxa
Modified: 2012-07-05 15:33 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.32.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Config for kernel 2.6.32.8 (11.81 KB, application/octet-stream)
2010-02-11 16:35 UTC, Saxa
Details
fix refcnt bugs (2.02 KB, patch)
2010-03-01 03:50 UTC, bfields
Details | Diff

Description Saxa 2010-02-11 16:35:24 UTC
Created attachment 24991 [details]
Config for kernel 2.6.32.8

Running 2.6.32.8 in an SLES10SP2 with a SUN X4540
gcc version 4.1.2 20070115 (SUSE Linux)

There are a bonding (802.3ad) interface with the 4 network interfaces slaved.

Some clients reports poor nfs performance, I'm using just NFSv3 without ACL's or QUOTAS.

Firstly I compile nfsd into the kernel. When crash occurs the system hangs and doesn't serves nor nfs nor smbd.

I changed it to modules configuration. Then it seems comes better but still having slow performance and ocasional disconnections. 

Warnings shows:

Feb 11 16:23:05 sStorage kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140 bytes - shutting down socket
Feb 11 16:23:12 sStorage kernel: ------------[ cut here ]------------
Feb 11 16:23:12 sStorage kernel: WARNING: at lib/kref.c:43 kref_get+0x2d/0x30()
Feb 11 16:23:12 sStorage kernel: Hardware name: Sun Fire X4540
Feb 11 16:23:12 sStorage kernel: Modules linked in: nfsd exportfs nfs lockd sunrpc
Feb 11 16:23:12 sStorage kernel: Pid: 6569, comm: nfsd Tainted: G      D W  2.6.32.8 #2
Feb 11 16:23:12 sStorage kernel: Call Trace:
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa0017306>] ? svc_xprt_free+0x46/0x60 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffff811cbf7d>] ? kref_get+0x2d/0x30
Feb 11 16:23:12 sStorage kernel:  [<ffffffff8103c887>] warn_slowpath_common+0x87/0xb0
Feb 11 16:23:12 sStorage kernel:  [<ffffffff8103c8bf>] warn_slowpath_null+0xf/0x20
Feb 11 16:23:12 sStorage kernel:  [<ffffffff811cbf7d>] kref_get+0x2d/0x30
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa0017fe6>] svc_recv+0x406/0x860 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa000ac20>] ? svc_process+0x2a0/0x770 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffff81036950>] ? default_wake_function+0x0/0x10
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa00927f0>] ? nfsd+0x0/0x140 [nfsd]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa0092881>] nfsd+0x91/0x140 [nfsd]
Feb 11 16:23:12 sStorage kernel:  [<ffffffff810511ee>] kthread+0x8e/0xa0
Feb 11 16:23:12 sStorage kernel:  [<ffffffff8100cb5a>] child_rip+0xa/0x20
Feb 11 16:23:12 sStorage kernel:  [<ffffffff81051160>] ? kthread+0x0/0xa0
Feb 11 16:23:12 sStorage kernel:  [<ffffffff8100cb50>] ? child_rip+0x0/0x20
Feb 11 16:23:12 sStorage kernel: ---[ end trace e781cc98ce2aa42e ]---
Feb 11 16:23:12 sStorage kernel: kernel BUG at fs/inode.c:1343!
Feb 11 16:23:12 sStorage kernel: CPU 10
Feb 11 16:23:12 sStorage kernel: Modules linked in: nfsd exportfs nfs lockd sunrpc
Feb 11 16:23:12 sStorage kernel: Process nfsd (pid: 6569, threadinfo ffff880813e9a000, task ffff880813e99950)
Feb 11 16:23:12 sStorage kernel:  ffff880813e9bdb0 ffff880613b92800 ffff880813e9bd70 ffffffff813b413a
Feb 11 16:23:12 sStorage kernel: <0> ffff880613b92800 ffffffffa00210c0 ffff880813e9bd90 ffffffffa000c358
Feb 11 16:23:12 sStorage kernel: <0> ffff880813e9bd90 ffff880613b92800 ffff880813e9bdb0 ffffffffa00172fe
Feb 11 16:23:12 sStorage kernel:  [<ffffffff813b413a>] sock_release+0x7a/0x80
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa000c358>] svc_sock_free+0x48/0x60 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa00172fe>] svc_xprt_free+0x3e/0x60 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa00172c0>] ? svc_xprt_free+0x0/0x60 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffff811cbf17>] kref_put+0x37/0x70
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa0017054>] svc_xprt_put+0x14/0x20 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa0017267>] svc_xprt_release+0xd7/0xf0 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa001841a>] svc_recv+0x83a/0x860 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa000ac20>] ? svc_process+0x2a0/0x770 [sunrpc]
Feb 11 16:23:12 sStorage kernel:  [<ffffffff81036950>] ? default_wake_function+0x0/0x10
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa00927f0>] ? nfsd+0x0/0x140 [nfsd]
Feb 11 16:23:12 sStorage kernel:  [<ffffffffa0092881>] nfsd+0x91/0x140 [nfsd]
Feb 11 16:23:12 sStorage kernel:  [<ffffffff810511ee>] kthread+0x8e/0xa0
Feb 11 16:23:12 sStorage kernel:  [<ffffffff8100cb5a>] child_rip+0xa/0x20
Feb 11 16:23:12 sStorage kernel:  [<ffffffff81051160>] ? kthread+0x0/0xa0
Feb 11 16:23:12 sStorage kernel:  [<ffffffff8100cb50>] ? child_rip+0x0/0x20
Feb 11 16:23:12 sStorage kernel:  RSP <ffff880813e9bd40>
Feb 11 16:23:12 sStorage kernel: ---[ end trace e781cc98ce2aa42f ]---
Comment 1 Andrew Morton 2010-02-12 19:35:53 UTC
hm, we don't have a fs/nfsd category, so I put it in fs/nfs.
Comment 2 bfields 2010-02-12 20:09:49 UTC
Could be the same as:

http://marc.info/?t=126349257700007&r=1&w=2

which I haven't figured out yet.

Is this the first WARNING you got?  I assume you're not using RDMA or kerberos?
Comment 3 Saxa 2010-02-13 17:48:37 UTC
It could be. I have no information about NFSv4. 

You assume right. It's just NFSv3 without kerberos. And there is no Infiniband, I supose there is no sense to use RDMA without it.

Just ethernet(bonding with 802.3ad) network and linux clients too. 

I'have tried with 2.6.32.[2,6,8] and all runs the same way.
Now I'm come back to 2.6.29.6. It's a production system and on monday will know if it runs better.
Comment 4 bfields 2010-02-16 17:49:19 UTC
*** Bug 15324 has been marked as a duplicate of this bug. ***
Comment 5 Franco Broi 2010-02-22 01:25:18 UTC
Got this oops over the weekend and lost all the exported disks, just gave permission denied errors on the clients.

Linux echo24 2.6.32.2 #1 SMP Tue Dec 29 09:14:14 WST 2009 x86_64 x86_64 x86_64 GNU/Linux


Feb 20 10:17:19 echo24 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
Feb 20 10:17:19 echo24 kernel: IP: [<ffffffff811a9086>] _atomic_dec_and_lock+0xa/0x50
Feb 20 10:17:19 echo24 kernel: PGD 0
Feb 20 10:17:19 echo24 kernel: Oops: 0000 [#2] SMP
Feb 20 10:17:19 echo24 kernel: last sysfs file: /sys/devices/platform/i5k_amb.0/temp2_input
Feb 20 10:17:19 echo24 kernel: CPU 7
Feb 20 10:17:19 echo24 kernel: Modules linked in: nfsd exportfs mx_driver(P) mx_mcp(P) nfs lockd nfs_acl auth_rpcgss sunrpc ipv6 ext4 jbd2 crc16 dm_multipath uinput e1000e mptsas mptscsih i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr mptbase i5k_amb i2c_core shpchp hwmon serio_raw ioatdma scsi_transport_sas dca ata_generic [last unloaded: myri10ge]
Feb 20 10:17:19 echo24 kernel: Pid: 2755, comm: rpc.mountd Tainted: P      D W  2.6.32.2 #1 X7DWE
Feb 20 10:17:19 echo24 kernel: RIP: 0010:[<ffffffff811a9086>]  [<ffffffff811a9086>] _atomic_dec_and_lock+0xa/0x50
Feb 20 10:17:19 echo24 kernel: RSP: 0018:ffff88020ac67c58  EFLAGS: 00010296
Feb 20 10:17:19 echo24 kernel: RAX: 0000000000000021 RBX: 0000000000000000 RCX: 00000000000000eb
Feb 20 10:17:19 echo24 kernel: RDX: 0000000000000000 RSI: ffffffffa01b1a10 RDI: 0000000000000000
Feb 20 10:17:19 echo24 kernel: RBP: ffff88020ac67c68 R08: 00000000000000ec R09: ffff88020ac67ca8
Feb 20 10:17:19 echo24 kernel: R10: ffff88020ac67ca8 R11: 0000000000000000 R12: ffffffffa01b1a10
Feb 20 10:17:19 echo24 kernel: R13: ffffffffa01b0c40 R14: ffff88022f249000 R15: 0000000000000000
Feb 20 10:17:19 echo24 kernel: FS:  00007fb3a2c1f740(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000
Feb 20 10:17:19 echo24 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 20 10:17:19 echo24 kernel: CR2: 0000000000000000 CR3: 00000001fd0e4000 CR4: 00000000000406e0
Feb 20 10:17:19 echo24 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 20 10:17:19 echo24 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 20 10:17:19 echo24 kernel: Process rpc.mountd (pid: 2755, threadinfo ffff88020ac66000, task ffff88020ade8000)
Feb 20 10:17:19 echo24 kernel: Stack:
Feb 20 10:17:19 echo24 kernel: 0000000000000000 ffffffffa01a0a26 ffff88020ac67c88 ffffffffa01a0819
Feb 20 10:17:19 echo24 kernel: <0> ffff88022f249000 ffff880204a8b780 ffff88020ac67ca8 ffffffffa01a0a48
Feb 20 10:17:19 echo24 kernel: <0> ffff88020ac67cd8 ffff880204a8b798 ffff88020ac67cc8 ffffffff811ab411
Feb 20 10:17:19 echo24 kernel: Call Trace:
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a0a26>] ? ip_map_put+0x0/0x2e [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a0819>] auth_domain_put+0x18/0x54 [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a0a48>] ip_map_put+0x22/0x2e [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffff811ab411>] kref_put+0x43/0x4f
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a4c1b>] cache_put+0x2d/0x2f [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a58a9>] cache_clean+0x1dd/0x1f1 [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a5920>] cache_flush+0x23/0x4c [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa031391c>] svc_export_parse+0x52d/0x5ac [nfsd]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a4a58>] cache_do_downcall+0x39/0x4e [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a5503>] cache_write+0xc8/0x135 [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffffa01a55a3>] cache_write_procfs+0x19/0x1b [sunrpc]
Feb 20 10:17:19 echo24 kernel: [<ffffffff81132ef8>] proc_reg_write+0x72/0x8c
Feb 20 10:17:19 echo24 kernel: [<ffffffff810ecb19>] vfs_write+0xab/0x105
Feb 20 10:17:19 echo24 kernel: [<ffffffff810ecc37>] sys_write+0x47/0x6f
Feb 20 10:17:19 echo24 kernel: [<ffffffff81010c42>] system_call_fastpath+0x16/0x1b
Feb 20 10:17:19 echo24 kernel: Code: 41 5d c9 c3 55 be 40 00 00 00 48 89 e5 e8 d5 02 00 00 ba 40 00 00 00 83 f8 40 0f 4f c2 c9 c3 90 90 55 48 89 e5 41 54 49 89 f4 53 <8b> 0f 48 89 fb 83 f9 01 74 18 8d 41 ff 48 63 d1 48 63 f0 48 89
Feb 20 10:17:19 echo24 kernel: RIP  [<ffffffff811a9086>] _atomic_dec_and_lock+0xa/0x50
Feb 20 10:17:19 echo24 kernel: RSP <ffff88020ac67c58>
Feb 20 10:17:19 echo24 kernel: CR2: 0000000000000000
Feb 20 10:17:19 echo24 kernel: ---[ end trace dacd8fe1ce7d497c ]---
Comment 6 bfields 2010-03-01 03:50:00 UTC
Created attachment 25285 [details]
fix refcnt bugs

Could you try the attached?
Comment 7 Franco Broi 2010-03-04 02:38:51 UTC
Been running 2.6.33 with the patch for 2 days with no errors to report.
Comment 8 simon+kernelbugzilla 2010-06-25 19:27:03 UTC
I think that I am seeing the same problem. It results in nfsd processes dying. I started the machine with 24 nfsd processes and a few days later, it was down to 8 running processes. 

Is it also related to using bonding on the Ethernet ports? 

This is running Gentoo's version of a 2.6.32 kernel (2.6.32-gentoo-r7 )

Jun 22 17:53:43 server2 kernel: ------------[ cut here ]------------
Jun 22 17:53:43 server2 kernel: WARNING: at lib/kref.c:43 kref_get+0x1b/0x22()
Jun 22 17:53:43 server2 kernel: Hardware name: System Product Name
Jun 22 17:53:43 server2 kernel: Modules linked in: hwmon_vid bonding ns83820 sky2 tg3 libphy atl1e pdc202xx_new r128 siimage asus_atk0110 forcedeth
Jun 22 17:53:43 server2 kernel: Pid: 7307, comm: nfsd Tainted: G      D W  2.6.32-gentoo-r7 #1
Jun 22 17:53:43 server2 kernel: Call Trace:
Jun 22 17:53:43 server2 kernel:  [<c10294f3>] warn_slowpath_common+0x65/0x7c
Jun 22 17:53:43 server2 kernel:  [<c117550d>] ? kref_get+0x1b/0x22
Jun 22 17:53:43 server2 kernel:  [<c1029517>] warn_slowpath_null+0xd/0x10
Jun 22 17:53:43 server2 kernel:  [<c117550d>] kref_get+0x1b/0x22
Jun 22 17:53:43 server2 kernel:  [<c1398f9b>] svc_recv+0x22b/0x689
Jun 22 17:53:43 server2 kernel:  [<c10264fa>] ? default_wake_function+0x0/0xd
Jun 22 17:53:43 server2 kernel:  [<c111c70d>] nfsd+0x8c/0x10b
Jun 22 17:53:43 server2 kernel:  [<c111c681>] ? nfsd+0x0/0x10b
Jun 22 17:53:43 server2 kernel:  [<c103cada>] kthread+0x5f/0x64
Jun 22 17:53:43 server2 kernel:  [<c103ca7b>] ? kthread+0x0/0x64
Jun 22 17:53:43 server2 kernel:  [<c1003c27>] kernel_thread_helper+0x7/0x10
Jun 22 17:53:43 server2 kernel: ---[ end trace 80ce67f68fd830be ]---
Jun 22 17:53:43 server2 kernel: ------------[ cut here ]------------
Jun 22 17:53:43 server2 kernel: klogd 1.4.1, ---------- state change ----------
Jun 22 17:53:43 server2 kernel: kernel BUG at fs/inode.c:1343!
Jun 22 17:53:43 server2 kernel: invalid opcode: 0000 [#3] SMP
Jun 22 17:53:43 server2 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/host3/uevent
Jun 22 17:53:43 server2 kernel: Modules linked in: hwmon_vid bonding ns83820 sky2 tg3 libphy atl1e pdc202xx_new r128 siimage asus_atk0110 forcedeth
Jun 22 17:53:43 server2 kernel:
Jun 22 17:53:43 server2 kernel: Pid: 7307, comm: nfsd Tainted: G      D W  (2.6.32-gentoo-r7 #1) System Product Name
Jun 22 17:53:43 server2 kernel: EIP: 0060:[<c10a0cf1>] EFLAGS: 00010246 CPU: 1
Jun 22 17:53:43 server2 kernel: EIP is at iput+0x13/0x4d
Jun 22 17:53:43 server2 kernel: EAX: d5c19ce8 EBX: d5c19ce8 ECX: f5c17e40 EDX: 00000000
Jun 22 17:53:43 server2 kernel: ESI: 00000000 EDI: f7352030 EBP: f554ff0c ESP: f554ff08
Jun 22 17:53:43 server2 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jun 22 17:53:43 server2 kernel: Process nfsd (pid: 7307, ti=f554e000 task=f7352030 task.ti=f554e000)
Jun 22 17:53:43 server2 kernel: Stack:
Jun 22 17:53:43 server2 kernel:  d5c19cc0 f554ff1c c130eaba f62cd200 00000000 f554ff28 c1390973 f62cd200
Jun 22 17:53:43 server2 kernel: <0> f554ff38 c13995f6 f62cd208 c13995ce f554ff48 c11754e9 f54eb000 f62cd200
Jun 22 17:53:43 server2 kernel: <0> f554ff50 c1398aec f554ff64 c1398cb7 f54eb000 fffffff5 f7352030 f554ffa4
Jun 22 17:53:43 server2 kernel: Call Trace:
Jun 22 17:53:43 server2 kernel:  [<c130eaba>] ? sock_release+0x49/0x59
Jun 22 17:53:43 server2 kernel:  [<c1390973>] ? svc_sock_free+0x37/0x43
Jun 22 17:53:43 server2 kernel:  [<c13995f6>] ? svc_xprt_free+0x28/0x33
Jun 22 17:53:43 server2 kernel:  [<c13995ce>] ? svc_xprt_free+0x0/0x33
Jun 22 17:53:43 server2 kernel:  [<c11754e9>] ? kref_put+0x39/0x42
Jun 22 17:53:43 server2 kernel:  [<c1398aec>] ? svc_xprt_put+0x10/0x12
Jun 22 17:53:43 server2 kernel:  [<c1398cb7>] ? svc_xprt_release+0xa7/0xaf
Jun 22 17:53:43 server2 kernel:  [<c13993ab>] ? svc_recv+0x63b/0x689
Jun 22 17:53:43 server2 kernel:  [<c10264fa>] ? default_wake_function+0x0/0xd
Jun 22 17:53:43 server2 kernel:  [<c111c70d>] ? nfsd+0x8c/0x10b
Jun 22 17:53:43 server2 kernel:  [<c111c681>] ? nfsd+0x0/0x10b
Jun 22 17:53:43 server2 kernel:  [<c103cada>] ? kthread+0x5f/0x64
Jun 22 17:53:43 server2 kernel:  [<c103ca7b>] ? kthread+0x0/0x64
Jun 22 17:53:43 server2 kernel:  [<c1003c27>] ? kernel_thread_helper+0x7/0x10
Jun 22 17:53:43 server2 kernel: Code: 11 89 f0 89 55 f0 e8 6e 25 00 00 8b 55 f0 85 c0 74 d4 5a 5b 5e 5f 5d c3 55 85 c0 89 e5 53 89 c3 74 40 83 b8 38 01 00 00 40 75 04 <0f> 0b eb fe 8d 40 24 ba 28 ca 61 c1 e8 76 2b 0d 00 85 c0 74 22
Jun 22 17:53:43 server2 kernel: EIP: [<c10a0cf1>] iput+0x13/0x4d SS:ESP 0068:f554ff08
Jun 22 17:53:43 server2 kernel: ---[ end trace 80ce67f68fd830bf ]---
Comment 9 Saxa 2010-06-27 17:14:37 UTC
Tomorrow morning I will use a production system with 2.6.34.

It seems to be more stable. 

It is using a bonding 802.3ad link aggregation mode.

Note You need to log in before you can comment on or make changes to this bug.