Bug 20702
Summary: | Kernel bug, possible double free, effecting kernel.org machines | ||
---|---|---|---|
Product: | Memory Management | Reporter: | John 'Warthog9' Hawley (warthog9) |
Component: | Slab Allocator | Assignee: | Andrew Morton (akpm) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | adobriyan, alan, awalls, diegocg |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
John 'Warthog9' Hawley
2010-10-18 19:05:27 UTC
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Mon, 18 Oct 2010 19:05:30 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=20702 > > Summary: Kernel bug, possible double free, effecting kernel.org > machines > Product: Memory Management > Version: 2.5 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Slab Allocator > AssignedTo: akpm@linux-foundation.org > ReportedBy: warthog9@kernel.org > Regression: No > > > Discovered this on one of the kernel.org machines - it's been happening > semi-consistently on a pair of boxes. Seems like a double free somewhere and > at that point the whole box falls over dead basically. > > ------------[ cut here ]------------ > kernel BUG at mm/slub.c:2835! Gee. That's : void kfree(const void *x) : { : struct page *page; : void *object = (void *)x; : : trace_kfree(_RET_IP_, x); : : if (unlikely(ZERO_OR_NULL_PTR(x))) : return; : : page = virt_to_head_page(x); : if (unlikely(!PageSlab(page))) { : BUG_ON(!PageCompound(page)); <<---- : kmemleak_free(x); : put_page(page); : return; : } : slab_free(page->slab, page, object, _RET_IP_); : } and yes, I'd assume that we fed garbage to kfree(). > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/mm/ksm/run Question is: what is /sys/kernel/mm/ksm/run? I can't immediately locate any code in 2.6.34.7 which could create that. Is it something which RH added to that kernel? > CPU 1 > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp > stp > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp amd_rng > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded: > scsi_wait_scan] > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant DL385 > G1 > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) knlGS:00000000f76fa6d0 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task ffff8801f5688000) > Stack: > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > Call Trace: > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > [<ffffffff8110f087>] fput+0x1a/0x1c > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8 67 > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 89 > f7 > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP <ffff8801f6433df8> > ---[ end trace 1a4b1fd758dd1fdb ]--- But then again, that's a procfs trace, not a sysfs trace. proc_reg_release() makes my brain hurt. On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
> > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
> > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
> stp
> > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
> exportfs
> > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
> amd_rng
> > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
> > scsi_wait_scan]
> >
> > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
> DL385
> > G1
> > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
> > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246
> > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
> > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
> > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
> > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
> > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
> > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000)
> knlGS:00000000f76fa6d0
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
> ffff8801f5688000)
> > Stack:
> > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
> > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
> > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
> > Call Trace:
> > [<ffffffff81125e27>] seq_release_private+0x28/0x44
> > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
> > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
> > [<ffffffff8110efbb>] __fput+0x12a/0x1dc
> > [<ffffffff8110f087>] fput+0x1a/0x1c
> > [<ffffffff8110c0f7>] filp_close+0x68/0x72
> > [<ffffffff8110c19e>] sys_close+0x9d/0xd2
> > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
> > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8
> 67
> > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
> 89 f7
> > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
> > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
> > RSP <ffff8801f6433df8>
> > ---[ end trace 1a4b1fd758dd1fdb ]---
>
> But then again, that's a procfs trace, not a sysfs trace.
>
> proc_reg_release() makes my brain hurt.
That complex code triggers only when module unloads and removes
its /proc entries.
Just to confirm, rmmod wasn't executed, last module still is
scsi_wait_scan from boot sequence?
> Question is: what is /sys/kernel/mm/ksm/run? I can't immediately
> locate any code in 2.6.34.7 which could create that. Is it something
> which RH added to that kernel?
That's Kernel Samepage Merging, the virtualization thing. It's there since 2.6.32, it lives at mm/ksm.c:ksm_init.
(In reply to comment #2) > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager > > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q > garp stp > > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last > unloaded: > > > scsi_wait_scan] > > > > > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > > > G1 > > > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > > > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > > > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > > > Stack: > > > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > > > Call Trace: > > > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > > > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > > > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > > > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > > > [<ffffffff8110f087>] fput+0x1a/0x1c > > > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > > > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > > > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 > > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 > 89 f7 > > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > > > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > > > RSP <ffff8801f6433df8> > > > ---[ end trace 1a4b1fd758dd1fdb ]--- > > > > But then again, that's a procfs trace, not a sysfs trace. > > > > proc_reg_release() makes my brain hurt. > > That complex code triggers only when module unloads and removes > its /proc entries. > > Just to confirm, rmmod wasn't executed, last module still is > scsi_wait_scan from boot sequence? I'm not sure that scsi_wait_scan was the last module pushed into the kernel at all. I've got two boxes (one of which is the box with the error above), nearly identical software wise, and neither has scsi_wait_scan even loaded (they are up and running right now). So I'm not sure where that came from, or what depended on that. The last module, on each machine, that got loaded during boot is ocfs2. This got loaded from the init script for it. scsi_wait_scan isn't even listed according to lsmod. I did not explicitly execute rmmod, restart ocfs2 or drbd (both of which mess with their respective modules being loaded), but those shouldn't have been doing much with their modules since they were up and running (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). ^^ See this? Please don't use the bugzilla interface for this bug! Nobody saw your reply. On Mon, 18 Oct 2010 23:56:22 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=20702 > > > > > > --- Comment #4 from John 'Warthog9' Hawley <warthog9@kernel.org> 2010-10-18 > 23:56:20 --- > (In reply to comment #2) > > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > > > > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd > nfs_acl > > > > auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager > > > > ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q > garp stp > > > > llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > > > > tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > > > > edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last > unloaded: > > > > scsi_wait_scan] > > > > > > > > Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > > > > G1 > > > > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > > > > RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > > > > RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > > > > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > > > > RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > > > > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > > > > R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > > > > FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > > > > Stack: > > > > ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > > > > <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > > > > <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > > > > Call Trace: > > > > [<ffffffff81125e27>] seq_release_private+0x28/0x44 > > > > [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > > > > [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > > > > [<ffffffff8110efbb>] __fput+0x12a/0x1dc > > > > [<ffffffff8110f087>] fput+0x1a/0x1c > > > > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > > > > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > > > > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > > > > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 > > > > e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe > 48 89 f7 > > > > e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > > > > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > > > > RSP <ffff8801f6433df8> > > > > ---[ end trace 1a4b1fd758dd1fdb ]--- > > > > > > But then again, that's a procfs trace, not a sysfs trace. > > > > > > proc_reg_release() makes my brain hurt. > > > > That complex code triggers only when module unloads and removes > > its /proc entries. > > > > Just to confirm, rmmod wasn't executed, last module still is > > scsi_wait_scan from boot sequence? > > I'm not sure that scsi_wait_scan was the last module pushed into the kernel > at > all. I've got two boxes (one of which is the box with the error above), > nearly > identical software wise, and neither has scsi_wait_scan even loaded (they are > up and running right now). So I'm not sure where that came from, or what > depended on that. > > The last module, on each machine, that got loaded during boot is ocfs2. This > got loaded from the init script for it. scsi_wait_scan isn't even listed > according to lsmod. > > I did not explicitly execute rmmod, restart ocfs2 or drbd (both of which mess > with their respective modules being loaded), but those shouldn't have been > doing much with their modules since they were up and running > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug. On 10/18/2010 05:00 PM, Andrew Morton wrote:
>
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> ^^ See this? Please don't use the bugzilla interface for this bug! Nobody
> saw your
> reply.
No I hadn't seen that, my apologies.
*** Bug 20672 has been marked as a duplicate of this bug. *** *** Bug 20682 has been marked as a duplicate of this bug. *** On 10/18/2010 02:11 PM, Alexey Dobriyan wrote:
> On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
>>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
>>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
>>> stp
>>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
>>> exportfs
>>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
>>> amd_rng
>>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
>>> scsi_wait_scan]
>>>
>>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
>>> DL385
>>> G1
>>> RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
>>> RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246
>>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
>>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
>>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
>>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
>>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
>>> FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000)
>>> knlGS:00000000f76fa6d0
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
>>> ffff8801f5688000)
>>> Stack:
>>> ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
>>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
>>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
>>> Call Trace:
>>> [<ffffffff81125e27>] seq_release_private+0x28/0x44
>>> [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
>>> [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
>>> [<ffffffff8110efbb>] __fput+0x12a/0x1dc
>>> [<ffffffff8110f087>] fput+0x1a/0x1c
>>> [<ffffffff8110c0f7>] filp_close+0x68/0x72
>>> [<ffffffff8110c19e>] sys_close+0x9d/0xd2
>>> [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
>>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8
>>> 67
>>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
>>> 89 f7
>>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
>>> RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
>>> RSP <ffff8801f6433df8>
>>> ---[ end trace 1a4b1fd758dd1fdb ]---
>>
>> But then again, that's a procfs trace, not a sysfs trace.
>>
>> proc_reg_release() makes my brain hurt.
>
> That complex code triggers only when module unloads and removes
> its /proc entries.
>
> Just to confirm, rmmod wasn't executed, last module still is
> scsi_wait_scan from boot sequence?
Bumping on this as I'm still seeing this, and hit it again today. Since
this seems to be happening more, is there anything I can add for
debugging to help with this?
Getting back to Alexey's comments. No there were not modules loaded or
unloaded, and I have no idea why scsi_wait_scan would have been loaded
or unloaded.
Is today's trace (From D2):
------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd
lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc
ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter
iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm
drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib]
Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL380 G6
RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff88080c12fdf8 EFLAGS: 00010246
RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40
RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001
RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000
R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40
FS: 00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task
ffff88080b58ddc0)
Stack:
ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30
<0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40
<0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98
Call Trace:
[<ffffffff81125daf>] seq_release_private+0x28/0x44
[<ffffffff8115db98>] seq_release_net+0x34/0x3d
[<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
[<ffffffff8110efb3>] __fput+0x12a/0x1d4
[<ffffffff8110f077>] fput+0x1a/0x1c
[<ffffffff8110c0f7>] filp_close+0x68/0x72
[<ffffffff8110c19e>] sys_close+0x9d/0xd2
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP <ffff88080c12fdf8>
---[ end trace 60a902368ad4d4fe ]---
It seems to claim speedstep_lib was the last unloaded, which I wouldn't
be able to explain why that would have been unloaded at all, the running
system itself doesn't seem to be using it, so I can only assume it got
loaded on startup somewhere and was subsequently removed.
On D1 yesterday I saw the following:
------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd
nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8
freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6
ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3
edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd
microcode pata_acpi ata_generic pata_amd cciss [last unloaded:
scsi_wait_scan]
Nov 23 06:50:04 demeter kernel:
Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL385 G1
RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff8801eef2ddf8 EFLAGS: 00010246
RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0
RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000
R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540
FS: 00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task
ffff8801ecf2c650)
Stack:
ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0
<0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540
<0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98
Call Trace:
[<ffffffff81125daf>] seq_release_private+0x28/0x44
[<ffffffff8115db98>] seq_release_net+0x34/0x3d
[<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
[<ffffffff8110efb3>] __fput+0x12a/0x1d4
[<ffffffff8110f077>] fput+0x1a/0x1c
[<ffffffff8110c0f7>] filp_close+0x68/0x72
[<ffffffff8110c19e>] sys_close+0x9d/0xd2
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP <ffff8801eef2ddf8>
---[ end trace 34b1e268e13a43a5 ]---
Which claims the last unloaded module was scsi_wait_scan. The module
unloads I'm going to assume are coming from random normal boot code, and
the errors are happening a significant time later anywhere from hours to
days after boot.
Considering I'm seeing this on completely different hardware, completely
different CPU's (AMD vs. Intel)
It's been suggested that I setup kmemleak on the boxes, despite the
overhead that will incur, but it might help point out the problem. Is
there anything else?
Note: I'm not seeing this on any other box I have deployed, just these.
Now the load patterns are completely different here, and these are the
only two boxes with drbd & ocfs2 on them as well. Not trying to throw
those under the bus, but it's worth pointing out any differences in my
setup that might be helpful for debugging.
- John 'Warthog9' Hawley
On Wed, 24 Nov 2010 15:25:16 -0800 "J.H." <warthog9@kernel.org> wrote: > On 10/18/2010 02:11 PM, Alexey Dobriyan wrote: > > On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote: > >>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl > >>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager > >>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q > garp stp > >>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs > >>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp > amd_rng > >>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last > unloaded: > >>> scsi_wait_scan] > >>> > >>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant > DL385 > >>> G1 > >>> RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > >>> RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246 > >>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340 > >>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > >>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000 > >>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > >>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80 > >>> FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000) > knlGS:00000000f76fa6d0 > >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0 > >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > >>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task > ffff8801f5688000) > >>> Stack: > >>> ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0 > >>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80 > >>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10 > >>> Call Trace: > >>> [<ffffffff81125e27>] seq_release_private+0x28/0x44 > >>> [<ffffffff8115dc10>] seq_release_net+0x34/0x3d > >>> [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0 > >>> [<ffffffff8110efbb>] __fput+0x12a/0x1dc > >>> [<ffffffff8110f087>] fput+0x1a/0x1c > >>> [<ffffffff8110c0f7>] filp_close+0x68/0x72 > >>> [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > >>> [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > >>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 > >>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48 > 89 f7 > >>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > >>> RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > >>> RSP <ffff8801f6433df8> > >>> ---[ end trace 1a4b1fd758dd1fdb ]--- > >> > >> But then again, that's a procfs trace, not a sysfs trace. > >> > >> proc_reg_release() makes my brain hurt. > > > > That complex code triggers only when module unloads and removes > > its /proc entries. > > > > Just to confirm, rmmod wasn't executed, last module still is > > scsi_wait_scan from boot sequence? > > Bumping on this as I'm still seeing this, and hit it again today. Since > this seems to be happening more, is there anything I can add for > debugging to help with this? > > Getting back to Alexey's comments. No there were not modules loaded or > unloaded, and I have no idea why scsi_wait_scan would have been loaded > or unloaded. > > Is today's trace (From D2): > > ------------[ cut here ]------------ > kernel BUG at mm/slub.c:2835! That's BUG_ON(!PageCompound(page)); ? > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/mm/ksm/run > CPU 0 > Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd > lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc > ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs > exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter > iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm > drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib] > > Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1 > /ProLiant DL380 G6 > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP: 0018:ffff88080c12fdf8 EFLAGS: 00010246 > RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40 > RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001 > RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000 > R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001 > R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40 > FS: 00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task > ffff88080b58ddc0) > Stack: > ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30 > <0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40 > <0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98 > Call Trace: > [<ffffffff81125daf>] seq_release_private+0x28/0x44 > [<ffffffff8115db98>] seq_release_net+0x34/0x3d > [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0 > [<ffffffff8110efb3>] __fput+0x12a/0x1d4 > [<ffffffff8110f077>] fput+0x1a/0x1c > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b > eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP <ffff88080c12fdf8> > ---[ end trace 60a902368ad4d4fe ]--- > > It seems to claim speedstep_lib was the last unloaded, which I wouldn't > be able to explain why that would have been unloaded at all, the running > system itself doesn't seem to be using it, so I can only assume it got > loaded on startup somewhere and was subsequently removed. > > On D1 yesterday I saw the following: > > ------------[ cut here ]------------ > kernel BUG at mm/slub.c:2835! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/mm/ksm/run > CPU 1 > Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd > nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 > freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 > ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3 > edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd > microcode pata_acpi ata_generic pata_amd cciss [last unloaded: > scsi_wait_scan] > Nov 23 06:50:04 demeter kernel: > Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1 > /ProLiant DL385 G1 > RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP: 0018:ffff8801eef2ddf8 EFLAGS: 00010246 > RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0 > RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002 > RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000 > R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002 > R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540 > FS: 00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task > ffff8801ecf2c650) > Stack: > ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0 > <0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540 > <0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98 > Call Trace: > [<ffffffff81125daf>] seq_release_private+0x28/0x44 > [<ffffffff8115db98>] seq_release_net+0x34/0x3d > [<ffffffff81155a62>] proc_reg_release+0xd3/0xf0 > [<ffffffff8110efb3>] __fput+0x12a/0x1d4 > [<ffffffff8110f077>] fput+0x1a/0x1c > [<ffffffff8110c0f7>] filp_close+0x68/0x72 > [<ffffffff8110c19e>] sys_close+0x9d/0xd2 > [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b > Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 > e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b > eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e > RIP [<ffffffff811006d6>] kfree+0x5e/0xcb > RSP <ffff8801eef2ddf8> > ---[ end trace 34b1e268e13a43a5 ]--- > > Which claims the last unloaded module was scsi_wait_scan. The module > unloads I'm going to assume are coming from random normal boot code, and > the errors are happening a significant time later anywhere from hours to > days after boot. > > Considering I'm seeing this on completely different hardware, completely > different CPU's (AMD vs. Intel) > > It's been suggested that I setup kmemleak on the boxes, despite the > overhead that will incur, but it might help point out the problem. Is > there anything else? > > Note: I'm not seeing this on any other box I have deployed, just these. > Now the load patterns are completely different here, and these are the > only two boxes with drbd & ocfs2 on them as well. Not trying to throw > those under the bus, but it's worth pointing out any differences in my > setup that might be helpful for debugging. > I can't begin to think what could have caused this :( On 10/18/2010 02:11 PM, Alexey Dobriyan wrote:
> On Mon, Oct 18, 2010 at 01:48:56PM -0700, Andrew Morton wrote:
>>> Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd nfs_acl
>>> auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>>> ocfs2_stackglue configfs cpufreq_ondemand powernow_k8 freq_table 8021q garp
>>> stp
>>> llc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
>>> exportfs
>>> tg3 hpwdt amd64_edac_mod i2c_amd756 i2c_core edac_core shpchp k8temp
>>> amd_rng
>>> edac_mce_amd microcode pata_acpi ata_generic cciss pata_amd [last unloaded:
>>> scsi_wait_scan]
>>>
>>> Pid: 1713, comm: snmpd Not tainted 2.6.34.7-56.fc13.x86_64 #1 /ProLiant
>>> DL385
>>> G1
>>> RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
>>> RSP: 0018:ffff8801f6433df8 EFLAGS: 00010246
>>> RAX: 0040000000000400 RBX: ffff8803ed0eb9b0 RCX: ffff8803e9c92340
>>> RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
>>> RBP: ffff8801f6433e18 R08: ffff8803e9c92958 R09: 0000000000000000
>>> R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
>>> R13: ffffffff81125e27 R14: ffffffff8115dbdc R15: ffff8803dd19ea80
>>> FS: 00007ff4a31917a0(0000) GS:ffff880207400000(0000)
>>> knlGS:00000000f76fa6d0
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007ff4a31b2000 CR3: 00000001f6455000 CR4: 00000000000006e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process snmpd (pid: 1713, threadinfo ffff8801f6432000, task
>>> ffff8801f5688000)
>>> Stack:
>>> ffff8801f6433e18 ffff8803ed0eb9b0 ffff8803f61ec480 ffff8803ed0eb9b0
>>> <0> ffff8801f6433e48 ffffffff81125e27 ffff8801f6433e38 ffff8803dd19ea80
>>> <0> ffff8803ed0eb9b0 ffff8803e9c92940 ffff8801f6433e78 ffffffff8115dc10
>>> Call Trace:
>>> [<ffffffff81125e27>] seq_release_private+0x28/0x44
>>> [<ffffffff8115dc10>] seq_release_net+0x34/0x3d
>>> [<ffffffff81155ada>] proc_reg_release+0xd3/0xf0
>>> [<ffffffff8110efbb>] __fput+0x12a/0x1dc
>>> [<ffffffff8110f087>] fput+0x1a/0x1c
>>> [<ffffffff8110c0f7>] filp_close+0x68/0x72
>>> [<ffffffff8110c19e>] sys_close+0x9d/0xd2
>>> [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
>>> Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7 e8
>>> 67
>>> e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b eb fe 48
>>> 89 f7
>>> e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
>>> RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
>>> RSP <ffff8801f6433df8>
>>> ---[ end trace 1a4b1fd758dd1fdb ]---
>>
>> But then again, that's a procfs trace, not a sysfs trace.
>>
>> proc_reg_release() makes my brain hurt.
>
> That complex code triggers only when module unloads and removes
> its /proc entries.
>
> Just to confirm, rmmod wasn't executed, last module still is
> scsi_wait_scan from boot sequence?
Bumping on this as I'm still seeing this, and hit it again today. Since
this seems to be happening more, is there anything I can add for
debugging to help with this?
Getting back to Alexey's comments. No there were not modules loaded or
unloaded, and I have no idea why scsi_wait_scan would have been loaded
or unloaded.
Is today's trace (From D2):
------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Modules linked in: ocfs2 mptctl mptbase ipmi_devintf drbd lru_cache nfsd
lockd nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs 8021q garp stp llc
ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 xfs
exportfs bnx2 ipmi_si ipmi_msghandler iTCO_wdt power_meter
iTCO_vendor_support microcode hpwdt serio_raw cciss hpsa radeon ttm
drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: speedstep_lib]
Pid: 2375, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL380 G6
RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff88080c12fdf8 EFLAGS: 00010246
RAX: 004000000010003c RBX: ffff8803e5a26e30 RCX: ffff880400d40c40
RDX: ffffea0000000000 RSI: ffffea000e000000 RDI: ffff880400000001
RBP: ffff88080c12fe18 R08: ffff880400d405d8 R09: 0000000000000000
R10: 0000000000000013 R11: 0000000000000246 R12: ffff880400000001
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff8803f9da4b40
FS: 00007fd3dda897a0(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3ddaab000 CR3: 000000080ce01000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 2375, threadinfo ffff88080c12e000, task
ffff88080b58ddc0)
Stack:
ffff88080c12fe18 ffff8803e5a26e30 ffff88040c4a3e80 ffff8803e5a26e30
<0> ffff88080c12fe48 ffffffff81125daf ffff88080c12fe38 ffff8803f9da4b40
<0> ffff8803e5a26e30 ffff880400d405c0 ffff88080c12fe78 ffffffff8115db98
Call Trace:
[<ffffffff81125daf>] seq_release_private+0x28/0x44
[<ffffffff8115db98>] seq_release_net+0x34/0x3d
[<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
[<ffffffff8110efb3>] __fput+0x12a/0x1d4
[<ffffffff8110f077>] fput+0x1a/0x1c
[<ffffffff8110c0f7>] filp_close+0x68/0x72
[<ffffffff8110c19e>] sys_close+0x9d/0xd2
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP <ffff88080c12fdf8>
---[ end trace 60a902368ad4d4fe ]---
It seems to claim speedstep_lib was the last unloaded, which I wouldn't
be able to explain why that would have been unloaded at all, the running
system itself doesn't seem to be using it, so I can only assume it got
loaded on startup somewhere and was subsequently removed.
On D1 yesterday I saw the following:
------------[ cut here ]------------
kernel BUG at mm/slub.c:2835!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 1
Modules linked in: ocfs2 mptctl mptbase drbd lru_cache nfsd lockd
nfs_acl auth_rpcgss sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue configfs cpufreq_ondemand powernow_k8
freq_table 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6
ip6table_filter ip6_tables ipv6 xfs exportfs amd64_edac_mod tg3
edac_core i2c_amd756 i2c_core hpwdt amd_rng k8temp shpchp edac_mce_amd
microcode pata_acpi ata_generic pata_amd cciss [last unloaded:
scsi_wait_scan]
Nov 23 06:50:04 demeter kernel:
Pid: 1737, comm: snmpd Not tainted 2.6.34.7-61.korg.fc13.x86_64 #1
/ProLiant DL385 G1
RIP: 0010:[<ffffffff811006d6>] [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP: 0018:ffff8801eef2ddf8 EFLAGS: 00010246
RAX: 0040000000000400 RBX: ffff8803effe74b0 RCX: ffff8803e18ec6c0
RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000002
RBP: ffff8801eef2de18 R08: ffff8803e18ec918 R09: 0000000000000000
R10: 0000000000000011 R11: 0000000000000246 R12: ffff880100000002
R13: ffffffff81125daf R14: ffffffff8115db64 R15: ffff880044570540
FS: 00007f0fadd727a0(0000) GS:ffff880207400000(0000) knlGS:00000000f4f9bb70
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0fadd93000 CR3: 00000001efbb6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process snmpd (pid: 1737, threadinfo ffff8801eef2c000, task
ffff8801ecf2c650)
Stack:
ffff8801eef2de18 ffff8803effe74b0 ffff8803d645ac00 ffff8803effe74b0
<0> ffff8801eef2de48 ffffffff81125daf ffff8801eef2de38 ffff880044570540
<0> ffff8803effe74b0 ffff8803e18ec900 ffff8801eef2de78 ffffffff8115db98
Call Trace:
[<ffffffff81125daf>] seq_release_private+0x28/0x44
[<ffffffff8115db98>] seq_release_net+0x34/0x3d
[<ffffffff81155a62>] proc_reg_release+0xd3/0xf0
[<ffffffff8110efb3>] __fput+0x12a/0x1d4
[<ffffffff8110f077>] fput+0x1a/0x1c
[<ffffffff8110c0f7>] filp_close+0x68/0x72
[<ffffffff8110c19e>] sys_close+0x9d/0xd2
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ef ff 13 48 83 c3 08 48 83 3b 00 eb ec 49 83 fc 10 76 7d 4c 89 e7
e8 67 e4 ff ff 48 89 c6 48 8b 00 84 c0 78 14 66 a9 00 c0 75 04 <0f> 0b
eb fe 48 89 f7 e8 66 36 fd ff eb 57 48 8b 4d 08 48 8b 7e
RIP [<ffffffff811006d6>] kfree+0x5e/0xcb
RSP <ffff8801eef2ddf8>
---[ end trace 34b1e268e13a43a5 ]---
Which claims the last unloaded module was scsi_wait_scan. The module
unloads I'm going to assume are coming from random normal boot code, and
the errors are happening a significant time later anywhere from hours to
days after boot.
Considering I'm seeing this on completely different hardware, completely
different CPU's (AMD vs. Intel)
It's been suggested that I setup kmemleak on the boxes, despite the
overhead that will incur, but it might help point out the problem. Is
there anything else?
Note: I'm not seeing this on any other box I have deployed, just these.
Now the load patterns are completely different here, and these are the
only two boxes with drbd & ocfs2 on them as well. Not trying to throw
those under the bus, but it's worth pointing out any differences in my
setup that might be helpful for debugging.
- John 'Warthog9' Hawley
OK, nobody has ideas. seq_release_private() does kfree(seq->private); which is void * 1. what files snmpd opens? 2. If Fedora ships with SLUB but with cache merging, what caches alias with 1) -- ls -l /sys/kernel/slab With the exception of one instance of a lsof process opening a file, snmpd is always the process when the bug happens and it's always the sys_close system call. Insight into the kernel bug might be obtained by auditing snmpd's use of fdopen(), fclose(), and close(). Specifically: 1. not close()-ing an fdopen()-ed fd after fclose()-ing the FILE * (See man fdopen) 2. not fclose()-ing a NULL pointer. (See man fclose) Here's an actual close() after fdopen() & fclose() error in agent/snmpd.c:main(): if ((PID = fdopen(fd, "w")) == NULL) { snmp_log_perror(pid_file); exit(1); } else { fprintf(PID, "%d\n", (int) getpid()); fclose(PID); } close(fd); snmpd is huge, ~65 source files call one of these functions, and most instances in the first 20 source files I checked were OK. Maybe auditing lsof would be easier. Regards, Andy |