Bug 20702 - Kernel bug, possible double free, effecting kernel.org machines
Summary: Kernel bug, possible double free, effecting kernel.org machines
Status: CLOSED OBSOLETE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Slab Allocator (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
: 20672 20682 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-10-18 19:05 UTC by John 'Warthog9' Hawley
Modified: 2012-05-17 15:56 UTC (History)
4 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description John 'Warthog9' Hawley 2010-10-18 19:05:27 UTC
Discovered this on one of the kernel.org machines - it's been happening semi-consistently on a pair of boxes.  Seems like a double free somewhere and at that point the whole box falls over dead basically.

------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1 
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: scsi_wait_scan]

Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 G1
RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000) knlGS:00000000f76fa6d0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task ffff8801f5688000)
Stack:
 ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
<0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
<0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
Call Trace:
 [<ffffffff81125e27>] seq_release_private+0x28/0x44
 [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
 [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
 [<ffffffff8110efbb>] __fput+0x12a/0x1dc
 [<ffffffff8110f087>] fput+0x1a/0x1c
 [<ffffffff8110c0f7>] filp_close+0x68/0x72
 [<ffffffff8110c19e>] sys_close+0x9d/0xd2
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
 RSP <ffff8801f6433df8>
---[ end trace 1a4b1fd758dd1fdb ]---
general protection fault: 0000 [#2] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:19.3/name
CPU 1 
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: scsi_wait_scan]

Pid: 16274, comm: snmpd Tainted: G      D    2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 G1
RIP: 0010:[<ffffffff8110136b>]  [<ffffffff8110136b>] __kmalloc_track_caller+0xe3/0x14c
RSP: 0018:ffff8800380dbcc8  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0003000000000000 RCX: 000000000000000b
RDX: 0000000100000000 RSI: 00000000000006a9 RDI: ffffffff8177d901
RBP: ffff8800380dbd18 R08: ffff880207412570 R09: ffff8800380dbe88
R10: ffff8800380dbf28 R11: 0000000000000000 R12: ffffffff81a28520
R13: 000000000000000b R14: 00000000000000d0 R15: 00000000000000d0
FS:  00007f4afe3857a0(0000) GS:ffff880207400000(0000) knlGS:00000000f6d92b70
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4afc38cd50 CR3: 00000000f1fe8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 16274, threadinfo ffff8800380da000, task ffff8801ccdcddc0)
Stack:
 ffff8800380dbcf8 0000000000000003 ffffffff81117373 0000000000000246
<0> ffff8800380dbd18 ffff8803ed0178e0 ffff8803ed017840 ffff8801f0938a50
<0> ffff8801f09046c0 0000000000000000 ffff8800380dbd48 ffffffff810dd233
Call Trace:
 [<ffffffff81117373>] ? vfs_rename+0xb2/0x3e1
 [<ffffffff810dd233>] kstrdup+0x31/0x49
 [<ffffffff81117373>] vfs_rename+0xb2/0x3e1
 [<ffffffff81116523>] ? __lookup_hash+0x55/0xf1
 [<ffffffff811ce973>] ? security_inode_permission+0x21/0x23
 [<ffffffff81118cfd>] sys_renameat+0x193/0x20c
 [<ffffffff810e4e5a>] ? handle_mm_fault+0x452/0x97b
 [<ffffffff81450195>] ? do_page_fault+0x28e/0x2bb
 [<ffffffff81118d91>] sys_rename+0x1b/0x1d
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: 90 66 90 48 89 45 c8 fa 66 66 90 66 66 90 65 4c 8b 04 25 90 e8 00 00 49 8b 04 24 49 01 c0 49 8b 18 48 85 db 74 0e 49 63 44 24 18 <48> 8b 04 03 49 89 00 eb 15 48 8b 4d c0 83 ca ff 44 89 fe 4c 89 
RIP  [<ffffffff8110136b>] __kmalloc_track_caller+0xe3/0x14c
 RSP <ffff8800380dbcc8>
---[ end trace 1a4b1fd758dd1fdc ]---
block drbd1: write: error=-95 s=1658s
block drbd1: Method to ensure write ordering: flush
block drbd1: local disk flush failed with status -95
block drbd1: Method to ensure write ordering: drain
o2net: accepted connection from node demeter2.kernel.org (num 2) at 172.20.0.20:7777
ocfs2_dlm: Node 2 joins domain FC86A681BA714C7AA126836FFC1D4C8C
ocfs2_dlm: Nodes in domain ("FC86A681BA714C7AA126836FFC1D4C8C"): 1 2 
general protection fault: 0000 [#3] SMP 
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1 
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs exportfs tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: scsi_wait_scan]

Pid: 29724, comm: lsof Tainted: G      D    2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 G1
RIP: 0010:[<ffffffff81101615>]  [<ffffffff81101615>] __kmalloc+0xeb/0x150
RSP: 0018:ffff8801c9cd5be8  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0003000000000000 RCX: ffffffff811264ff
RDX: 0000000100000000 RSI: 00000000000006a9 RDI: ffffffff8177d901
RBP: ffff8801c9cd5c28 R08: ffff880207412570 R09: ffff8803ea9aed80
R10: ffff8803ce0767c0 R11: 0000000000000000 R12: ffffffff81a28520
R13: 0000000000000010 R14: 00000000000080d0 R15: 00000000000080d0
FS:  00007f364411c7a0(0000) GS:ffff880207400000(0000) knlGS:00000000f5990b70
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004091c0 CR3: 000000019f488000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lsof (pid: 29724, threadinfo ffff8801c9cd4000, task ffff8801cd6c8000)
Stack:
 ffffffff811264ff 0000000000000246 ffffc900013ed638 ffff8803ee7dfc00
<0> ffffffff81666660 ffff8803ce0765b0 ffff8801f6bed1f8 ffff8803ee7dfc00
<0> ffff8801c9cd5c58 ffffffff811264ff ffff8801c9cd5c58 ffffffff81e074e0
Call Trace:
 [<ffffffff811264ff>] ? __seq_open_private+0x25/0x5f
 [<ffffffff811264ff>] __seq_open_private+0x25/0x5f
 [<ffffffff8115dcf9>] seq_open_net+0x65/0x8c
 [<ffffffff814217b6>] unix_seq_open+0x1a/0x1c
 [<ffffffff81155f64>] proc_reg_open+0xd7/0x163
 [<ffffffff8115dbdc>] ? seq_release_net+0x0/0x3d
 [<ffffffff81155e8d>] ? proc_reg_open+0x0/0x163
 [<ffffffff8110c4a1>] __dentry_open+0x173/0x2aa
 [<ffffffff811ce973>] ? security_inode_permission+0x21/0x23
 [<ffffffff8110c6a7>] nameidata_to_filp+0x3f/0x50
 [<ffffffff81117e21>] do_last+0x447/0x5b8
 [<ffffffff81119868>] do_filp_open+0x217/0x5fe
 [<ffffffff81214e23>] ? might_fault+0x21/0x23
 [<ffffffff811225ca>] ? alloc_fd+0x7b/0x124
 [<ffffffff8110c236>] do_sys_open+0x63/0x10f
 [<ffffffff8110c315>] sys_open+0x20/0x22
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: 90 66 90 48 89 45 c8 fa 66 66 90 66 66 90 65 4c 8b 04 25 90 e8 00 00 49 8b 04 24 49 01 c0 49 8b 18 48 85 db 74 0e 49 63 44 24 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 fe 4c 89 e7 e8 16 eb 
RIP  [<ffffffff81101615>] __kmalloc+0xeb/0x150
 RSP <ffff8801c9cd5be8>
---[ end trace 1a4b1fd758dd1fdd ]---
Comment 1 Andrew Morton 2010-10-18 20:49:20 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).


On Mon, 18 Oct 2010 19:05:30 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=20702
> 
>            Summary: Kernel bug, possible double free, effecting kernel.org
>                     machines
>            Product: Memory Management
>            Version: 2.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Slab Allocator
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: warthog9@kernel.org
>         Regression: No
> 
> 
> Discovered this on one of the kernel.org machines - it's been happening
> semi-consistently on a pair of boxes.  Seems like a double free somewhere and
> at that point the whole box falls over dead basically.
> 
> ------------[ cut here ]------------
> kernel BUG at mm/slub.c:2835!

Gee.  That's

: void kfree(const void *x)
: {
: 	struct page *page;
: 	void *object = (void *)x;
: 
: 	trace_kfree(_RET_IP_, x);
: 
: 	if (unlikely(ZERO_OR_NULL_PTR(x)))
: 		return;
: 
: 	page = virt_to_head_page(x);
: 	if (unlikely(!PageSlab(page))) {
: 		BUG_ON(!PageCompound(page));		<<----
: 		kmemleak_free(x);
: 		put_page(page);
: 		return;
: 	}
: 	slab_free(page->slab, page, object, _RET_IP_);
: }

and yes, I'd assume that we fed garbage to kfree().

> invalid opcode: 0000 [#1] SMP 
> last sysfs file: /sys/kernel/mm/ksm/run

Question is: what is /sys/kernel/mm/ksm/run?  I can't immediately
locate any code in 2.6.34.7 which could create that.  Is it something
which RH added to that kernel?

> CPU 1 
> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
> stp
> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs
> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng
> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
> scsi_wait_scan]
> 
> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385
> G1
> RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
> FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000) knlGS:00000000f76fa6d0
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task ffff8801f5688000)
> Stack:
>  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
> Call Trace:
>  [<ffffffff81125e27>] seq_release_private+0x28/0x44
>  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
>  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
>  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
>  [<ffffffff8110f087>] fput+0x1a/0x1c
>  [<ffffffff8110c0f7>] filp_close+0x68/0x72
>  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
>  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67
> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89
> f7
> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
> RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
>  RSP <ffff8801f6433df8>
> ---[ end trace 1a4b1fd758dd1fdb ]---

But then again, that's a procfs trace, not a sysfs trace.

proc_reg_release() makes my brain hurt.
Comment 2 Alexey Dobriyan 2010-10-18 21:11:35 UTC
On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
> > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
> > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
> stp
> > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs
> > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
> amd_rng
> > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
> > scsi_wait_scan]
> > 
> > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
> DL385
> > G1
> > RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> > RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
> > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
> > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
> > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
> > FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000)
> knlGS:00000000f76fa6d0
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
> ffff8801f5688000)
> > Stack:
> >  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
> > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
> > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
> > Call Trace:
> >  [<ffffffff81125e27>] seq_release_private+0x28/0x44
> >  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
> >  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
> >  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
> >  [<ffffffff8110f087>] fput+0x1a/0x1c
> >  [<ffffffff8110c0f7>] filp_close+0x68/0x72
> >  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
> >  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8
> 67
> > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
> 89 f7
> > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
> > RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
> >  RSP <ffff8801f6433df8>
> > ---[ end trace 1a4b1fd758dd1fdb ]---
> 
> But then again, that's a procfs trace, not a sysfs trace.
> 
> proc_reg_release() makes my brain hurt.

That complex code triggers only when module unloads and removes
its /proc entries.

Just to confirm, rmmod wasn't executed, last module still is
scsi_wait_scan from boot sequence?
Comment 3 Diego Calleja 2010-10-18 21:56:35 UTC
> Question is: what is /sys/kernel/mm/ksm/run?  I can't immediately
> locate any code in 2.6.34.7 which could create that.  Is it something
> which RH added to that kernel?

That's Kernel Samepage Merging, the virtualization thing. It's there since 2.6.32, it lives at mm/ksm.c:ksm_init.
Comment 4 John 'Warthog9' Hawley 2010-10-18 23:56:20 UTC
(In reply to comment #2)
> On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
> > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
> > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
> ocfs2_nodemanager
> > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q
> garp stp
> > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs
> > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
> amd_rng
> > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last
> unloaded:
> > > scsi_wait_scan]
> > > 
> > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
> DL385
> > > G1
> > > RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> > > RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
> > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
> > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
> > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
> > > FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000)
> knlGS:00000000f76fa6d0
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
> ffff8801f5688000)
> > > Stack:
> > >  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
> > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
> > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
> > > Call Trace:
> > >  [<ffffffff81125e27>] seq_release_private+0x28/0x44
> > >  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
> > >  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
> > >  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
> > >  [<ffffffff8110f087>] fput+0x1a/0x1c
> > >  [<ffffffff8110c0f7>] filp_close+0x68/0x72
> > >  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
> > >  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
> e8 67
> > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
> 89 f7
> > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
> > > RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
> > >  RSP <ffff8801f6433df8>
> > > ---[ end trace 1a4b1fd758dd1fdb ]---
> > 
> > But then again, that's a procfs trace, not a sysfs trace.
> > 
> > proc_reg_release() makes my brain hurt.
> 
> That complex code triggers only when module unloads and removes
> its /proc entries.
> 
> Just to confirm, rmmod wasn't executed, last module still is
> scsi_wait_scan from boot sequence?

I'm not sure that scsi_wait_scan was the last module pushed into the kernel at all.  I've got two boxes (one of which is the box with the error above), nearly identical software wise, and neither has scsi_wait_scan even loaded (they are up and running right now).  So I'm not sure where that came from, or what depended on that.

The last module, on each machine, that got loaded during boot is ocfs2.  This got loaded from the init script for it.  scsi_wait_scan isn't even listed according to lsmod.

I did not explicitly execute rmmod, restart ocfs2 or drbd (both of which mess with their respective modules being loaded), but those shouldn't have been doing much with their modules since they were up and running
Comment 5 Andrew Morton 2010-10-19 00:01:30 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

^^ See this?  Please don't use the bugzilla interface for this bug!  Nobody saw your
reply.



On Mon, 18 Oct 2010 23:56:22 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=20702
> 
> 
> 
> 
> 
> --- Comment #4 from John 'Warthog9' Hawley <warthog9@kernel.org>  2010-10-18
> 23:56:20 ---
> (In reply to comment #2)
> > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
> > > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd
> nfs_acl
> > > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
> ocfs2_nodemanager
> > > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q
> garp stp
> > > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs
> > > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
> amd_rng
> > > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last
> unloaded:
> > > > scsi_wait_scan]
> > > > 
> > > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
> DL385
> > > > G1
> > > > RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> > > > RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
> > > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
> > > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> > > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
> > > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> > > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
> > > > FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000)
> knlGS:00000000f76fa6d0
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
> ffff8801f5688000)
> > > > Stack:
> > > >  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
> > > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
> > > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
> > > > Call Trace:
> > > >  [<ffffffff81125e27>] seq_release_private+0x28/0x44
> > > >  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
> > > >  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
> > > >  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
> > > >  [<ffffffff8110f087>] fput+0x1a/0x1c
> > > >  [<ffffffff8110c0f7>] filp_close+0x68/0x72
> > > >  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
> > > >  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> > > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
> e8 67
> > > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe
> 48 89 f7
> > > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
> > > > RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
> > > >  RSP <ffff8801f6433df8>
> > > > ---[ end trace 1a4b1fd758dd1fdb ]---
> > > 
> > > But then again, that's a procfs trace, not a sysfs trace.
> > > 
> > > proc_reg_release() makes my brain hurt.
> > 
> > That complex code triggers only when module unloads and removes
> > its /proc entries.
> > 
> > Just to confirm, rmmod wasn't executed, last module still is
> > scsi_wait_scan from boot sequence?
> 
> I'm not sure that scsi_wait_scan was the last module pushed into the kernel
> at
> all.  I've got two boxes (one of which is the box with the error above),
> nearly
> identical software wise, and neither has scsi_wait_scan even loaded (they are
> up and running right now).  So I'm not sure where that came from, or what
> depended on that.
> 
> The last module, on each machine, that got loaded during boot is ocfs2.  This
> got loaded from the init script for it.  scsi_wait_scan isn't even listed
> according to lsmod.
> 
> I did not explicitly execute rmmod, restart ocfs2 or drbd (both of which mess
> with their respective modules being loaded), but those shouldn't have been
> doing much with their modules since they were up and running
> 
> -- 
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug.
Comment 6 John 'Warthog9' Hawley 2010-10-19 01:28:35 UTC
On 10/18/2010 05:00 PM, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> ^^ See this?  Please don't use the bugzilla interface for this bug!  Nobody
> saw your
> reply.

No I hadn't seen that, my apologies.
Comment 7 John 'Warthog9' Hawley 2010-11-24 22:53:25 UTC
*** Bug 20672 has been marked as a duplicate of this bug. ***
Comment 8 John 'Warthog9' Hawley 2010-11-24 22:53:35 UTC
*** Bug 20682 has been marked as a duplicate of this bug. ***
Comment 9 John 'Warthog9' Hawley 2010-11-24 23:24:55 UTC
On 10/18/2010 02:11 PM, Alexey Dobriyan wrote:
> On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
>>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
>>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
>>> stp
>>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
>>> exportfs
>>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
>>> amd_rng
>>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
>>> scsi_wait_scan]
>>>
>>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
>>> DL385
>>> G1
>>> RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
>>> RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
>>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
>>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
>>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
>>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
>>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
>>> FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000)
>>> knlGS:00000000f76fa6d0
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
>>> ffff8801f5688000)
>>> Stack:
>>>  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
>>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
>>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
>>> Call Trace:
>>>  [<ffffffff81125e27>] seq_release_private+0x28/0x44
>>>  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
>>>  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
>>>  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
>>>  [<ffffffff8110f087>] fput+0x1a/0x1c
>>>  [<ffffffff8110c0f7>] filp_close+0x68/0x72
>>>  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
>>>  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
>>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8
>>> 67
>>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
>>> 89 f7
>>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
>>> RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
>>>  RSP <ffff8801f6433df8>
>>> ---[ end trace 1a4b1fd758dd1fdb ]---
>>
>> But then again, that's a procfs trace, not a sysfs trace.
>>
>> proc_reg_release() makes my brain hurt.
> 
> That complex code triggers only when module unloads and removes
> its /proc entries.
> 
> Just to confirm, rmmod wasn't executed, last module still is
> scsi_wait_scan from boot sequence?

Bumping on this as I'm still seeing this, and hit it again today.  Since
this seems to be happening more, is there anything I can add for
debugging to help with this?

Getting back to Alexey's comments.  No there were not modules loaded or
unloaded, and I have no idea why scsi_wait_scan would have been loaded
or unloaded.

Is today's trace (From D2):

------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd
lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc
ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter
iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm
drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib]

Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL380 G6
RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff88080c12fdf8  EFLAGS: 00010246
RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40
RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001
RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000
R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40
FS:  00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task
ffff88080b58ddc0)
Stack:
 ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30
<0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40
<0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98
Call Trace:
 [<ffffffff81125daf>] seq_release_private+0x28/0x44
 [<ffffffff8115db98>] seq_release_net+0x34/0x3d
 [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
 [<ffffffff8110efb3>] __fput+0x12a/0x1d4
 [<ffffffff8110f077>] fput+0x1a/0x1c
 [<ffffffff8110c0f7>] filp_close+0x68/0x72
 [<ffffffff8110c19e>] sys_close+0x9d/0xd2
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
 RSP <ffff88080c12fdf8>
---[ end trace 60a902368ad4d4fe ]---

It seems to claim speedstep_lib was the last unloaded, which I wouldn't
be able to explain why that would have been unloaded at all, the running
system itself doesn't seem to be using it, so I can only assume it got
loaded on startup somewhere and was subsequently removed.

On D1 yesterday I saw the following:

------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd
nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8
freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6
ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3
edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd
microcode pata_acpi ata_generic pata_amd cciss [last unloaded:
scsi_wait_scan]
Nov 23 06:50:04 demeter kernel:
Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL385 G1
RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff8801eef2ddf8  EFLAGS: 00010246
RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0
RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000
R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540
FS:  00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task
ffff8801ecf2c650)
Stack:
ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0
<0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540
<0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98
Call Trace:
[<ffffffff81125daf>] seq_release_private+0x28/0x44
[<ffffffff8115db98>] seq_release_net+0x34/0x3d
[<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
[<ffffffff8110efb3>] __fput+0x12a/0x1d4
[<ffffffff8110f077>] fput+0x1a/0x1c
[<ffffffff8110c0f7>] filp_close+0x68/0x72
[<ffffffff8110c19e>] sys_close+0x9d/0xd2
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP <ffff8801eef2ddf8>
---[ end trace 34b1e268e13a43a5 ]---

Which claims the last unloaded module was scsi_wait_scan.  The module
unloads I'm going to assume are coming from random normal boot code, and
the errors are happening a significant time later anywhere from hours to
days after boot.

Considering I'm seeing this on completely different hardware, completely
different CPU's (AMD vs. Intel)

It's been suggested that I setup kmemleak on the boxes, despite the
overhead that will incur, but it might help point out the problem.  Is
there anything else?

Note: I'm not seeing this on any other box I have deployed, just these.
 Now the load patterns are completely different here, and these are the
only two boxes with drbd & ocfs2 on them as well.  Not trying to throw
those under the bus, but it's worth pointing out any differences in my
setup that might be helpful for debugging.

- John 'Warthog9' Hawley
Comment 10 Andrew Morton 2010-11-24 23:39:05 UTC
On Wed, 24 Nov 2010 15:25:16 -0800
"J.H." <warthog9@kernel.org> wrote:

> On 10/18/2010 02:11 PM, Alexey Dobriyan wrote:
> > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
> >>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
> >>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
> ocfs2_nodemanager
> >>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q
> garp stp
> >>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs
> >>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
> amd_rng
> >>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last
> unloaded:
> >>> scsi_wait_scan]
> >>>
> >>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
> DL385
> >>> G1
> >>> RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> >>> RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
> >>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
> >>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> >>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
> >>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> >>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
> >>> FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000)
> knlGS:00000000f76fa6d0
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
> ffff8801f5688000)
> >>> Stack:
> >>>  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
> >>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
> >>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
> >>> Call Trace:
> >>>  [<ffffffff81125e27>] seq_release_private+0x28/0x44
> >>>  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
> >>>  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
> >>>  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
> >>>  [<ffffffff8110f087>] fput+0x1a/0x1c
> >>>  [<ffffffff8110c0f7>] filp_close+0x68/0x72
> >>>  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
> >>>  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> >>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
> e8 67
> >>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
> 89 f7
> >>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
> >>> RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
> >>>  RSP <ffff8801f6433df8>
> >>> ---[ end trace 1a4b1fd758dd1fdb ]---
> >>
> >> But then again, that's a procfs trace, not a sysfs trace.
> >>
> >> proc_reg_release() makes my brain hurt.
> > 
> > That complex code triggers only when module unloads and removes
> > its /proc entries.
> > 
> > Just to confirm, rmmod wasn't executed, last module still is
> > scsi_wait_scan from boot sequence?
> 
> Bumping on this as I'm still seeing this, and hit it again today.  Since
> this seems to be happening more, is there anything I can add for
> debugging to help with this?
> 
> Getting back to Alexey's comments.  No there were not modules loaded or
> unloaded, and I have no idea why scsi_wait_scan would have been loaded
> or unloaded.
> 
> Is today's trace (From D2):
> 
> ------------[ cut here ]------------
> kernel BUG at mm/slub.c:2835!

That's

	BUG_ON(!PageCompound(page));

?

> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/kernel/mm/ksm/run
> CPU 0
> Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd
> lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
> ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc
> ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter
> iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm
> drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib]
> 
> Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
> /ProLiant DL380 G6
> RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> RSP: 0018:ffff88080c12fdf8  EFLAGS: 00010246
> RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40
> RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001
> RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000
> R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001
> R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40
> FS:  00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task
> ffff88080b58ddc0)
> Stack:
>  ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30
> <0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40
> <0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98
> Call Trace:
>  [<ffffffff81125daf>] seq_release_private+0x28/0x44
>  [<ffffffff8115db98>] seq_release_net+0x34/0x3d
>  [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
>  [<ffffffff8110efb3>] __fput+0x12a/0x1d4
>  [<ffffffff8110f077>] fput+0x1a/0x1c
>  [<ffffffff8110c0f7>] filp_close+0x68/0x72
>  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
>  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
> e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
> eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
> RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
>  RSP <ffff88080c12fdf8>
> ---[ end trace 60a902368ad4d4fe ]---
> 
> It seems to claim speedstep_lib was the last unloaded, which I wouldn't
> be able to explain why that would have been unloaded at all, the running
> system itself doesn't seem to be using it, so I can only assume it got
> loaded on startup somewhere and was subsequently removed.
> 
> On D1 yesterday I saw the following:
> 
> ------------[ cut here ]------------
> kernel BUG at mm/slub.c:2835!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/kernel/mm/ksm/run
> CPU 1
> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd
> nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
> ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8
> freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6
> ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3
> edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd
> microcode pata_acpi ata_generic pata_amd cciss [last unloaded:
> scsi_wait_scan]
> Nov 23 06:50:04 demeter kernel:
> Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
> /ProLiant DL385 G1
> RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
> RSP: 0018:ffff8801eef2ddf8  EFLAGS: 00010246
> RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0
> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000
> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540
> FS:  00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task
> ffff8801ecf2c650)
> Stack:
> ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0
> <0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540
> <0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98
> Call Trace:
> [<ffffffff81125daf>] seq_release_private+0x28/0x44
> [<ffffffff8115db98>] seq_release_net+0x34/0x3d
> [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
> [<ffffffff8110efb3>] __fput+0x12a/0x1d4
> [<ffffffff8110f077>] fput+0x1a/0x1c
> [<ffffffff8110c0f7>] filp_close+0x68/0x72
> [<ffffffff8110c19e>] sys_close+0x9d/0xd2
> [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
> e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
> eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
> RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
> RSP <ffff8801eef2ddf8>
> ---[ end trace 34b1e268e13a43a5 ]---
> 
> Which claims the last unloaded module was scsi_wait_scan.  The module
> unloads I'm going to assume are coming from random normal boot code, and
> the errors are happening a significant time later anywhere from hours to
> days after boot.
> 
> Considering I'm seeing this on completely different hardware, completely
> different CPU's (AMD vs. Intel)
> 
> It's been suggested that I setup kmemleak on the boxes, despite the
> overhead that will incur, but it might help point out the problem.  Is
> there anything else?
> 
> Note: I'm not seeing this on any other box I have deployed, just these.
>  Now the load patterns are completely different here, and these are the
> only two boxes with drbd & ocfs2 on them as well.  Not trying to throw
> those under the bus, but it's worth pointing out any differences in my
> setup that might be helpful for debugging.
> 

I can't begin to think what could have caused this :(
Comment 11 John 'Warthog9' Hawley 2010-11-25 00:46:19 UTC
On 10/18/2010 02:11 PM, Alexey Dobriyan wrote:
> On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
>>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
>>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
>>> stp
>>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
>>> exportfs
>>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
>>> amd_rng
>>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
>>> scsi_wait_scan]
>>>
>>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
>>> DL385
>>> G1
>>> RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
>>> RSP: 0018:ffff8801f6433df8  EFLAGS: 00010246
>>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
>>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
>>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
>>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
>>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
>>> FS:  00007ff4a31917a0(0000) GS:ffff880207400000(0000)
>>> knlGS:00000000f76fa6d0
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
>>> ffff8801f5688000)
>>> Stack:
>>>  ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
>>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
>>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
>>> Call Trace:
>>>  [<ffffffff81125e27>] seq_release_private+0x28/0x44
>>>  [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
>>>  [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
>>>  [<ffffffff8110efbb>] __fput+0x12a/0x1dc
>>>  [<ffffffff8110f087>] fput+0x1a/0x1c
>>>  [<ffffffff8110c0f7>] filp_close+0x68/0x72
>>>  [<ffffffff8110c19e>] sys_close+0x9d/0xd2
>>>  [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
>>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8
>>> 67
>>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
>>> 89 f7
>>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e 
>>> RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
>>>  RSP <ffff8801f6433df8>
>>> ---[ end trace 1a4b1fd758dd1fdb ]---
>>
>> But then again, that's a procfs trace, not a sysfs trace.
>>
>> proc_reg_release() makes my brain hurt.
> 
> That complex code triggers only when module unloads and removes
> its /proc entries.
> 
> Just to confirm, rmmod wasn't executed, last module still is
> scsi_wait_scan from boot sequence?

Bumping on this as I'm still seeing this, and hit it again today.  Since
this seems to be happening more, is there anything I can add for
debugging to help with this?

Getting back to Alexey's comments.  No there were not modules loaded or
unloaded, and I have no idea why scsi_wait_scan would have been loaded
or unloaded.

Is today's trace (From D2):

------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd
lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc
ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter
iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm
drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib]

Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL380 G6
RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff88080c12fdf8  EFLAGS: 00010246
RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40
RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001
RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000
R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40
FS:  00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task
ffff88080b58ddc0)
Stack:
 ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30
<0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40
<0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98
Call Trace:
 [<ffffffff81125daf>] seq_release_private+0x28/0x44
 [<ffffffff8115db98>] seq_release_net+0x34/0x3d
 [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
 [<ffffffff8110efb3>] __fput+0x12a/0x1d4
 [<ffffffff8110f077>] fput+0x1a/0x1c
 [<ffffffff8110c0f7>] filp_close+0x68/0x72
 [<ffffffff8110c19e>] sys_close+0x9d/0xd2
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
 RSP <ffff88080c12fdf8>
---[ end trace 60a902368ad4d4fe ]---

It seems to claim speedstep_lib was the last unloaded, which I wouldn't
be able to explain why that would have been unloaded at all, the running
system itself doesn't seem to be using it, so I can only assume it got
loaded on startup somewhere and was subsequently removed.

On D1 yesterday I saw the following:

------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd
nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8
freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6
ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3
edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd
microcode pata_acpi ata_generic pata_amd cciss [last unloaded:
scsi_wait_scan]
Nov 23 06:50:04 demeter kernel:
Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL385 G1
RIP: 0010:[<ffffffff811006d6>]  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff8801eef2ddf8  EFLAGS: 00010246
RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0
RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000
R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540
FS:  00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task
ffff8801ecf2c650)
Stack:
ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0
<0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540
<0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98
Call Trace:
[<ffffffff81125daf>] seq_release_private+0x28/0x44
[<ffffffff8115db98>] seq_release_net+0x34/0x3d
[<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
[<ffffffff8110efb3>] __fput+0x12a/0x1d4
[<ffffffff8110f077>] fput+0x1a/0x1c
[<ffffffff8110c0f7>] filp_close+0x68/0x72
[<ffffffff8110c19e>] sys_close+0x9d/0xd2
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP  [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP <ffff8801eef2ddf8>
---[ end trace 34b1e268e13a43a5 ]---

Which claims the last unloaded module was scsi_wait_scan.  The module
unloads I'm going to assume are coming from random normal boot code, and
the errors are happening a significant time later anywhere from hours to
days after boot.

Considering I'm seeing this on completely different hardware, completely
different CPU's (AMD vs. Intel)

It's been suggested that I setup kmemleak on the boxes, despite the
overhead that will incur, but it might help point out the problem.  Is
there anything else?

Note: I'm not seeing this on any other box I have deployed, just these.
 Now the load patterns are completely different here, and these are the
only two boxes with drbd & ocfs2 on them as well.  Not trying to throw
those under the bus, but it's worth pointing out any differences in my
setup that might be helpful for debugging.

- John 'Warthog9' Hawley
Comment 12 Alexey Dobriyan 2010-11-27 15:35:38 UTC
OK, nobody has ideas.

seq_release_private() does kfree(seq->private); which is void *

1. what files snmpd opens?
2. If Fedora ships with SLUB but with cache merging, what caches alias
   with 1) -- ls -l /sys/kernel/slab
Comment 13 Andy Walls 2010-12-21 16:23:50 UTC
With the exception of one instance of a lsof process opening a file, snmpd is always the process when the bug happens and it's always the sys_close system call.


Insight into the kernel bug might be obtained by auditing snmpd's use of fdopen(), fclose(), and close().  Specifically:

1. not close()-ing an fdopen()-ed fd after fclose()-ing the FILE *
(See man fdopen)

2. not fclose()-ing a NULL pointer.
(See man fclose)

Here's an actual close() after fdopen() & fclose() error in agent/snmpd.c:main():

            if ((PID = fdopen(fd, "w")) == NULL) {
                snmp_log_perror(pid_file);
                exit(1);
            } else {
                fprintf(PID, "%d\n", (int) getpid());
                fclose(PID);
            }
            close(fd);

snmpd is huge, ~65 source files call one of these functions, and most instances in the first 20 source files I checked were OK.

Maybe auditing lsof would be easier.

Regards,
Andy

Note You need to log in before you can comment on or make changes to this bug.